• Reference Manager
  • Simple TEXT file

People also looked at

Specialty grand challenge article, grand challenges in image processing.

www.frontiersin.org

  • Université Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des signaux et Systèmes, Gif-sur-Yvette, France

Introduction

The field of image processing has been the subject of intensive research and development activities for several decades. This broad area encompasses topics such as image/video processing, image/video analysis, image/video communications, image/video sensing, modeling and representation, computational imaging, electronic imaging, information forensics and security, 3D imaging, medical imaging, and machine learning applied to these respective topics. Hereafter, we will consider both image and video content (i.e. sequence of images), and more generally all forms of visual information.

Rapid technological advances, especially in terms of computing power and network transmission bandwidth, have resulted in many remarkable and successful applications. Nowadays, images are ubiquitous in our daily life. Entertainment is one class of applications that has greatly benefited, including digital TV (e.g., broadcast, cable, and satellite TV), Internet video streaming, digital cinema, and video games. Beyond entertainment, imaging technologies are central in many other applications, including digital photography, video conferencing, video monitoring and surveillance, satellite imaging, but also in more distant domains such as healthcare and medicine, distance learning, digital archiving, cultural heritage or the automotive industry.

In this paper, we highlight a few research grand challenges for future imaging and video systems, in order to achieve breakthroughs to meet the growing expectations of end users. Given the vastness of the field, this list is by no means exhaustive.

A Brief Historical Perspective

We first briefly discuss a few key milestones in the field of image processing. Key inventions in the development of photography and motion pictures can be traced to the 19th century. The earliest surviving photograph of a real-world scene was made by Nicéphore Niépce in 1827 ( Hirsch, 1999 ). The Lumière brothers made the first cinematographic film in 1895, with a public screening the same year ( Lumiere, 1996 ). After decades of remarkable developments, the second half of the 20th century saw the emergence of new technologies launching the digital revolution. While the first prototype digital camera using a Charge-Coupled Device (CCD) was demonstrated in 1975, the first commercial consumer digital cameras started appearing in the early 1990s. These digital cameras quickly surpassed cameras using films and the digital revolution in the field of imaging was underway. As a key consequence, the digital process enabled computational imaging, in other words the use of sophisticated processing algorithms in order to produce high quality images.

In 1992, the Joint Photographic Experts Group (JPEG) released the JPEG standard for still image coding ( Wallace, 1992 ). In parallel, in 1993, the Moving Picture Experts Group (MPEG) published its first standard for coding of moving pictures and associated audio, MPEG-1 ( Le Gall, 1991 ), and a few years later MPEG-2 ( Haskell et al., 1996 ). By guaranteeing interoperability, these standards have been essential in many successful applications and services, for both the consumer and business markets. In particular, it is remarkable that, almost 30 years later, JPEG remains the dominant format for still images and photographs.

In the late 2000s and early 2010s, we could observe a paradigm shift with the appearance of smartphones integrating a camera. Thanks to advances in computational photography, these new smartphones soon became capable of rivaling the quality of consumer digital cameras at the time. Moreover, these smartphones were also capable of acquiring video sequences. Almost concurrently, another key evolution was the development of high bandwidth networks. In particular, the launch of 4G wireless services circa 2010 enabled users to quickly and efficiently exchange multimedia content. From this point, most of us are carrying a camera, anywhere and anytime, allowing to capture images and videos at will and to seamlessly exchange them with our contacts.

As a direct consequence of the above developments, we are currently observing a boom in the usage of multimedia content. It is estimated that today 3.2 billion images are shared each day on social media platforms, and 300 h of video are uploaded every minute on YouTube 1 . In a 2019 report, Cisco estimated that video content represented 75% of all Internet traffic in 2017, and this share is forecasted to grow to 82% in 2022 ( Cisco, 2019 ). While Internet video streaming and Over-The-Top (OTT) media services account for a significant bulk of this traffic, other applications are also expected to see significant increases, including video surveillance and Virtual Reality (VR)/Augmented Reality (AR).

Hyper-Realistic and Immersive Imaging

A major direction and key driver to research and development activities over the years has been the objective to deliver an ever-improving image quality and user experience.

For instance, in the realm of video, we have observed constantly increasing spatial and temporal resolutions, with the emergence nowadays of Ultra High Definition (UHD). Another aim has been to provide a sense of the depth in the scene. For this purpose, various 3D video representations have been explored, including stereoscopic 3D and multi-view ( Dufaux et al., 2013 ).

In this context, the ultimate goal is to be able to faithfully represent the physical world and to deliver an immersive and perceptually hyperrealist experience. For this purpose, we discuss hereafter some emerging innovations. These developments are also very relevant in VR and AR applications ( Slater, 2014 ). Finally, while this paper is only focusing on the visual information processing aspects, it is obvious that emerging display technologies ( Masia et al., 2013 ) and audio also plays key roles in many application scenarios.

Light Fields, Point Clouds, Volumetric Imaging

In order to wholly represent a scene, the light information coming from all the directions has to be represented. For this purpose, the 7D plenoptic function is a key concept ( Adelson and Bergen, 1991 ), although it is unmanageable in practice.

By introducing additional constraints, the light field representation collects radiance from rays in all directions. Therefore, it contains a much richer information, when compared to traditional 2D imaging that captures a 2D projection of the light in the scene integrating the angular domain. For instance, this allows post-capture processing such as refocusing and changing the viewpoint. However, it also entails several technical challenges, in terms of acquisition and calibration, as well as computational image processing steps including depth estimation, super-resolution, compression and image synthesis ( Ihrke et al., 2016 ; Wu et al., 2017 ). The resolution trade-off between spatial and angular resolutions is a fundamental issue. With a significant fraction of the earlier work focusing on static light fields, it is also expected that dynamic light field videos will stimulate more interest in the future. In particular, dense multi-camera arrays are becoming more tractable. Finally, the development of efficient light field compression and streaming techniques is a key enabler in many applications ( Conti et al., 2020 ).

Another promising direction is to consider a point cloud representation. A point cloud is a set of points in the 3D space represented by their spatial coordinates and additional attributes, including color pixel values, normals, or reflectance. They are often very large, easily ranging in the millions of points, and are typically sparse. One major distinguishing feature of point clouds is that, unlike images, they do not have a regular structure, calling for new algorithms. To remove the noise often present in acquired data, while preserving the intrinsic characteristics, effective 3D point cloud filtering approaches are needed ( Han et al., 2017 ). It is also important to develop efficient techniques for Point Cloud Compression (PCC). For this purpose, MPEG is developing two standards: Geometry-based PCC (G-PCC) and Video-based PCC (V-PCC) ( Graziosi et al., 2020 ). G-PCC considers the point cloud in its native form and compress it using 3D data structures such as octrees. Conversely, V-PCC projects the point cloud onto 2D planes and then applies existing video coding schemes. More recently, deep learning-based approaches for PCC have been shown to be effective ( Guarda et al., 2020 ). Another challenge is to develop generic and robust solutions able to handle potentially widely varying characteristics of point clouds, e.g. in terms of size and non-uniform density. Efficient solutions for dynamic point clouds are also needed. Finally, while many techniques focus on the geometric information or the attributes independently, it is paramount to process them jointly.

High Dynamic Range and Wide Color Gamut

The human visual system is able to perceive, using various adaptation mechanisms, a broad range of luminous intensities, from very bright to very dark, as experienced every day in the real world. Nonetheless, current imaging technologies are still limited in terms of capturing or rendering such a wide range of conditions. High Dynamic Range (HDR) imaging aims at addressing this issue. Wide Color Gamut (WCG) is also often associated with HDR in order to provide a wider colorimetry.

HDR has reached some levels of maturity in the context of photography. However, extending HDR to video sequences raises scientific challenges in order to provide high quality and cost-effective solutions, impacting the whole imaging processing pipeline, including content acquisition, tone reproduction, color management, coding, and display ( Dufaux et al., 2016 ; Chalmers and Debattista, 2017 ). Backward compatibility with legacy content and traditional systems is another issue. Despite recent progress, the potential of HDR has not been fully exploited yet.

Coding and Transmission

Three decades of standardization activities have continuously improved the hybrid video coding scheme based on the principles of transform coding and predictive coding. The Versatile Video Coding (VVC) standard has been finalized in 2020 ( Bross et al., 2021 ), achieving approximately 50% bit rate reduction for the same subjective quality when compared to its predecessor, High Efficiency Video Coding (HEVC). While substantially outperforming VVC in the short term may be difficult, one encouraging direction is to rely on improved perceptual models to further optimize compression in terms of visual quality. Another direction, which has already shown promising results, is to apply deep learning-based approaches ( Ding et al., 2021 ). Here, one key issue is the ability to generalize these deep models to a wide diversity of video content. The second key issue is the implementation complexity, both in terms of computation and memory requirements, which is a significant obstacle to a widespread deployment. Besides, the emergence of new video formats targeting immersive communications is also calling for new coding schemes ( Wien et al., 2019 ).

Considering that in many application scenarios, videos are processed by intelligent analytic algorithms rather than viewed by users, another interesting track is the development of video coding for machines ( Duan et al., 2020 ). In this context, the compression is optimized taking into account the performance of video analysis tasks.

The push toward hyper-realistic and immersive visual communications entails most often an increasing raw data rate. Despite improved compression schemes, more transmission bandwidth is needed. Moreover, some emerging applications, such as VR/AR, autonomous driving, and Industry 4.0, bring a strong requirement for low latency transmission, with implications on both the imaging processing pipeline and the transmission channel. In this context, the emergence of 5G wireless networks will positively contribute to the deployment of new multimedia applications, and the development of future wireless communication technologies points toward promising advances ( Da Costa and Yang, 2020 ).

Human Perception and Visual Quality Assessment

It is important to develop effective models of human perception. On the one hand, it can contribute to the development of perceptually inspired algorithms. On the other hand, perceptual quality assessment methods are needed in order to optimize and validate new imaging solutions.

The notion of Quality of Experience (QoE) relates to the degree of delight or annoyance of the user of an application or service ( Le Callet et al., 2012 ). QoE is strongly linked to subjective and objective quality assessment methods. Many years of research have resulted in the successful development of perceptual visual quality metrics based on models of human perception ( Lin and Kuo, 2011 ; Bovik, 2013 ). More recently, deep learning-based approaches have also been successfully applied to this problem ( Bosse et al., 2017 ). While these perceptual quality metrics have achieved good performances, several significant challenges remain. First, when applied to video sequences, most current perceptual metrics are applied on individual images, neglecting temporal modeling. Second, whereas color is a key attribute, there are currently no widely accepted perceptual quality metrics explicitly considering color. Finally, new modalities, such as 360° videos, light fields, point clouds, and HDR, require new approaches.

Another closely related topic is image esthetic assessment ( Deng et al., 2017 ). The esthetic quality of an image is affected by numerous factors, such as lighting, color, contrast, and composition. It is useful in different application scenarios such as image retrieval and ranking, recommendation, and photos enhancement. While earlier attempts have used handcrafted features, most recent techniques to predict esthetic quality are data driven and based on deep learning approaches, leveraging the availability of large annotated datasets for training ( Murray et al., 2012 ). One key challenge is the inherently subjective nature of esthetics assessment, resulting in ambiguity in the ground-truth labels. Another important issue is to explain the behavior of deep esthetic prediction models.

Analysis, Interpretation and Understanding

Another major research direction has been the objective to efficiently analyze, interpret and understand visual data. This goal is challenging, due to the high diversity and complexity of visual data. This has led to many research activities, involving both low-level and high-level analysis, addressing topics such as image classification and segmentation, optical flow, image indexing and retrieval, object detection and tracking, and scene interpretation and understanding. Hereafter, we discuss some trends and challenges.

Keypoints Detection and Local Descriptors

Local imaging matching has been the cornerstone of many analysis tasks. It involves the detection of keypoints, i.e. salient visual points that can be robustly and repeatedly detected, and descriptors, i.e. a compact signature locally describing the visual features at each keypoint. It allows to subsequently compute pairwise matching between the features to reveal local correspondences. In this context, several frameworks have been proposed, including Scale Invariant Feature Transform (SIFT) ( Lowe, 2004 ) and Speeded Up Robust Features (SURF) ( Bay et al., 2008 ), and later binary variants including Binary Robust Independent Elementary Feature (BRIEF) ( Calonder et al., 2010 ), Oriented FAST and Rotated BRIEF (ORB) ( Rublee et al., 2011 ) and Binary Robust Invariant Scalable Keypoints (BRISK) ( Leutenegger et al., 2011 ). Although these approaches exhibit scale and rotation invariance, they are less suited to deal with large 3D distortions such as perspective deformations, out-of-plane rotations, and significant viewpoint changes. Besides, they tend to fail under significantly varying and challenging illumination conditions.

These traditional approaches based on handcrafted features have been successfully applied to problems such as image and video retrieval, object detection, visual Simultaneous Localization And Mapping (SLAM), and visual odometry. Besides, the emergence of new imaging modalities as introduced above can also be beneficial for image analysis tasks, including light fields ( Galdi et al., 2019 ), point clouds ( Guo et al., 2020 ), and HDR ( Rana et al., 2018 ). However, when applied to high-dimensional visual data for semantic analysis and understanding, these approaches based on handcrafted features have been supplanted in recent years by approaches based on deep learning.

Deep Learning-Based Methods

Data-driven deep learning-based approaches ( LeCun et al., 2015 ), and in particular the Convolutional Neural Network (CNN) architecture, represent nowadays the state-of-the-art in terms of performances for complex pattern recognition tasks in scene analysis and understanding. By combining multiple processing layers, deep models are able to learn data representations with different levels of abstraction.

Supervised learning is the most common form of deep learning. It requires a large and fully labeled training dataset, a typically time-consuming and expensive process needed whenever tackling a new application scenario. Moreover, in some specialized domains, e.g. medical data, it can be very difficult to obtain annotations. To alleviate this major burden, methods such as transfer learning and weakly supervised learning have been proposed.

In another direction, deep models have been shown to be vulnerable to adversarial attacks ( Akhtar and Mian, 2018 ). Those attacks consist in introducing subtle perturbations to the input, such that the model predicts an incorrect output. For instance, in the case of images, imperceptible pixel differences are able to fool deep learning models. Such adversarial attacks are definitively an important obstacle to the successful deployment of deep learning, especially in applications where safety and security are critical. While some early solutions have been proposed, a significant challenge is to develop effective defense mechanisms against those attacks.

Finally, another challenge is to enable low complexity and efficient implementations. This is especially important for mobile or embedded applications. For this purpose, further interactions between signal processing and machine learning can potentially bring additional benefits. For instance, one direction is to compress deep neural networks in order to enable their more efficient handling. Moreover, by combining traditional processing techniques with deep learning models, it is possible to develop low complexity solutions while preserving high performance.

Explainability in Deep Learning

While data-driven deep learning models often achieve impressive performances on many visual analysis tasks, their black-box nature often makes it inherently very difficult to understand how they reach a predicted output and how it relates to particular characteristics of the input data. However, this is a major impediment in many decision-critical application scenarios. Moreover, it is important not only to have confidence in the proposed solution, but also to gain further insights from it. Based on these considerations, some deep learning systems aim at promoting explainability ( Adadi and Berrada, 2018 ; Xie et al., 2020 ). This can be achieved by exhibiting traits related to confidence, trust, safety, and ethics.

However, explainable deep learning is still in its early phase. More developments are needed, in particular to develop a systematic theory of model explanation. Important aspects include the need to understand and quantify risk, to comprehend how the model makes predictions for transparency and trustworthiness, and to quantify the uncertainty in the model prediction. This challenge is key in order to deploy and use deep learning-based solutions in an accountable way, for instance in application domains such as healthcare or autonomous driving.

Self-Supervised Learning

Self-supervised learning refers to methods that learn general visual features from large-scale unlabeled data, without the need for manual annotations. Self-supervised learning is therefore very appealing, as it allows exploiting the vast amount of unlabeled images and videos available. Moreover, it is widely believed that it is closer to how humans actually learn. One common approach is to use the data to provide the supervision, leveraging its structure. More generally, a pretext task can be defined, e.g. image inpainting, colorizing grayscale images, predicting future frames in videos, by withholding some parts of the data and by training the neural network to predict it ( Jing and Tian, 2020 ). By learning an objective function corresponding to the pretext task, the network is forced to learn relevant visual features in order to solve the problem. Self-supervised learning has also been successfully applied to autonomous vehicles perception. More specifically, the complementarity between analytical and learning methods can be exploited to address various autonomous driving perception tasks, without the prerequisite of an annotated data set ( Chiaroni et al., 2021 ).

While good performances have already been obtained using self-supervised learning, further work is still needed. A few promising directions are outlined hereafter. Combining self-supervised learning with other learning methods is a first interesting path. For instance, semi-supervised learning ( Van Engelen and Hoos, 2020 ) and few-short learning ( Fei-Fei et al., 2006 ) methods have been proposed for scenarios where limited labeled data is available. The performance of these methods can potentially be boosted by incorporating a self-supervised pre-training. The pretext task can also serve to add regularization. Another interesting trend in self-supervised learning is to train neural networks with synthetic data. The challenge here is to bridge the domain gap between the synthetic and real data. Finally, another compelling direction is to exploit data from different modalities. A simple example is to consider both the video and audio signals in a video sequence. In another example in the context of autonomous driving, vehicles are typically equipped with multiple sensors, including cameras, LIght Detection And Ranging (LIDAR), Global Positioning System (GPS), and Inertial Measurement Units (IMU). In such cases, it is easy to acquire large unlabeled multimodal datasets, where the different modalities can be effectively exploited in self-supervised learning methods.

Reproducible Research and Large Public Datasets

The reproducible research initiative is another way to further ensure high-quality research for the benefit of our community ( Vandewalle et al., 2009 ). Reproducibility, referring to the ability by someone else working independently to accurately reproduce the results of an experiment, is a key principle of the scientific method. In the context of image and video processing, it is usually not sufficient to provide a detailed description of the proposed algorithm. Most often, it is essential to also provide access to the code and data. This is even more imperative in the case of deep learning-based models.

In parallel, the availability of large public datasets is also highly desirable in order to support research activities. This is especially critical for new emerging modalities or specific application scenarios, where it is difficult to get access to relevant data. Moreover, with the emergence of deep learning, large datasets, along with labels, are often needed for training, which can be another burden.

Conclusion and Perspectives

The field of image processing is very broad and rich, with many successful applications in both the consumer and business markets. However, many technical challenges remain in order to further push the limits in imaging technologies. Two main trends are on the one hand to always improve the quality and realism of image and video content, and on the other hand to be able to effectively interpret and understand this vast and complex amount of visual data. However, the list is certainly not exhaustive and there are many other interesting problems, e.g. related to computational imaging, information security and forensics, or medical imaging. Key innovations will be found at the crossroad of image processing, optics, psychophysics, communication, computer vision, artificial intelligence, and computer graphics. Multi-disciplinary collaborations are therefore critical moving forward, involving actors from both academia and the industry, in order to drive these breakthroughs.

The “Image Processing” section of Frontier in Signal Processing aims at giving to the research community a forum to exchange, discuss and improve new ideas, with the goal to contribute to the further advancement of the field of image processing and to bring exciting innovations in the foreseeable future.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

1 https://www.brandwatch.com/blog/amazing-social-media-statistics-and-facts/ (accessed on Feb. 23, 2021).

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access 6, 52138–52160. doi:10.1109/access.2018.2870052

CrossRef Full Text | Google Scholar

Adelson, E. H., and Bergen, J. R. (1991). “The plenoptic function and the elements of early vision” Computational models of visual processing . Cambridge, MA: MIT Press , 3-20.

Google Scholar

Akhtar, N., and Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430. doi:10.1109/access.2018.2807385

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vis. image understanding 110 (3), 346–359. doi:10.1016/j.cviu.2007.09.014

Bosse, S., Maniry, D., Müller, K. R., Wiegand, T., and Samek, W. (2017). Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 27 (1), 206–219. doi:10.1109/TIP.2017.2760518

PubMed Abstract | CrossRef Full Text | Google Scholar

Bovik, A. C. (2013). Automatic prediction of perceptual image and video quality. Proc. IEEE 101 (9), 2008–2024. doi:10.1109/JPROC.2013.2257632

Bross, B., Chen, J., Ohm, J. R., Sullivan, G. J., and Wang, Y. K. (2021). Developments in international video coding standardization after AVC, with an overview of Versatile Video Coding (VVC). Proc. IEEE . doi:10.1109/JPROC.2020.3043399

Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010). Brief: binary robust independent elementary features. In K. Daniilidis, P. Maragos, and N. Paragios (eds) European conference on computer vision . Berlin, Heidelberg: Springer , 778–792. doi:10.1007/978-3-642-15561-1_56

Chalmers, A., and Debattista, K. (2017). HDR video past, present and future: a perspective. Signal. Processing: Image Commun. 54, 49–55. doi:10.1016/j.image.2017.02.003

Chiaroni, F., Rahal, M.-C., Hueber, N., and Dufaux, F. (2021). Self-supervised learning for autonomous vehicles perception: a conciliation between analytical and learning methods. IEEE Signal. Process. Mag. 38 (1), 31–41. doi:10.1109/msp.2020.2977269

Cisco, (20192019). Cisco visual networking index: forecast and trends, 2017-2022 (white paper) , Indianapolis, Indiana: Cisco Press .

Conti, C., Soares, L. D., and Nunes, P. (2020). Dense light field coding: a survey. IEEE Access 8, 49244–49284. doi:10.1109/ACCESS.2020.2977767

Da Costa, D. B., and Yang, H.-C. (2020). Grand challenges in wireless communications. Front. Commun. Networks 1 (1), 1–5. doi:10.3389/frcmn.2020.00001

Deng, Y., Loy, C. C., and Tang, X. (2017). Image aesthetic assessment: an experimental survey. IEEE Signal. Process. Mag. 34 (4), 80–106. doi:10.1109/msp.2017.2696576

Ding, D., Ma, Z., Chen, D., Chen, Q., Liu, Z., and Zhu, F. (2021). Advances in video compression system using deep neural network: a review and case studies . Ithaca, NY: Cornell university .

Duan, L., Liu, J., Yang, W., Huang, T., and Gao, W. (2020). Video coding for machines: a paradigm of collaborative compression and intelligent analytics. IEEE Trans. Image Process. 29, 8680–8695. doi:10.1109/tip.2020.3016485

Dufaux, F., Le Callet, P., Mantiuk, R., and Mrak, M. (2016). High dynamic range video - from acquisition, to display and applications . Cambridge, Massachusetts: Academic Press .

Dufaux, F., Pesquet-Popescu, B., and Cagnazzo, M. (2013). Emerging technologies for 3D video: creation, coding, transmission and rendering . Hoboken, NJ: Wiley .

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach Intell. 28 (4), 594–611. doi:10.1109/TPAMI.2006.79

Galdi, C., Chiesa, V., Busch, C., Lobato Correia, P., Dugelay, J.-L., and Guillemot, C. (2019). Light fields for face analysis. Sensors 19 (12), 2687. doi:10.3390/s19122687

Graziosi, D., Nakagami, O., Kuma, S., Zaghetto, A., Suzuki, T., and Tabatabai, A. (2020). An overview of ongoing point cloud compression standardization activities: video-based (V-PCC) and geometry-based (G-PCC). APSIPA Trans. Signal Inf. Process. 9, 2020. doi:10.1017/ATSIP.2020.12

Guarda, A., Rodrigues, N., and Pereira, F. (2020). Adaptive deep learning-based point cloud geometry coding. IEEE J. Selected Top. Signal Process. 15, 415-430. doi:10.1109/mmsp48831.2020.9287060

Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., and Bennamoun, M. (2020). Deep learning for 3D point clouds: a survey. IEEE transactions on pattern analysis and machine intelligence . doi:10.1109/TPAMI.2020.3005434

Han, X.-F., Jin, J. S., Wang, M.-J., Jiang, W., Gao, L., and Xiao, L. (2017). A review of algorithms for filtering the 3D point cloud. Signal. Processing: Image Commun. 57, 103–112. doi:10.1016/j.image.2017.05.009

Haskell, B. G., Puri, A., and Netravali, A. N. (1996). Digital video: an introduction to MPEG-2 . Berlin, Germany: Springer Science and Business Media .

Hirsch, R. (1999). Seizing the light: a history of photography . New York, NY: McGraw-Hill .

Ihrke, I., Restrepo, J., and Mignard-Debise, L. (2016). Principles of light field imaging: briefly revisiting 25 years of research. IEEE Signal. Process. Mag. 33 (5), 59–69. doi:10.1109/MSP.2016.2582220

Jing, L., and Tian, Y. (2020). “Self-supervised visual feature learning with deep neural networks: a survey,” IEEE transactions on pattern analysis and machine intelligence , Ithaca, NY: Cornell University .

Le Callet, P., Möller, S., and Perkis, A. (2012). Qualinet white paper on definitions of quality of experience. European network on quality of experience in multimedia systems and services (COST Action IC 1003), 3(2012) .

Le Gall, D. (1991). Mpeg: A Video Compression Standard for Multimedia Applications. Commun. ACM 34, 46–58. doi:10.1145/103085.103090

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature 521 (7553), 436–444. doi:10.1038/nature14539

Leutenegger, S., Chli, M., and Siegwart, R. Y. (2011). “BRISK: binary robust invariant scalable keypoints,” IEEE International conference on computer vision , Barcelona, Spain , 6-13 Nov, 2011 ( IEEE ), 2548–2555.

Lin, W., and Jay Kuo, C.-C. (2011). Perceptual visual quality metrics: a survey. J. Vis. Commun. image representation 22 (4), 297–312. doi:10.1016/j.jvcir.2011.01.005

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60 (2), 91–110. doi:10.1023/b:visi.0000029664.99615.94

Lumiere, L. (1996). 1936 the lumière cinematograph. J. Smpte 105 (10), 608–611. doi:10.5594/j17187

Masia, B., Wetzstein, G., Didyk, P., and Gutierrez, D. (2013). A survey on computational displays: pushing the boundaries of optics, computation, and perception. Comput. & Graphics 37 (8), 1012–1038. doi:10.1016/j.cag.2013.10.003

Murray, N., Marchesotti, L., and Perronnin, F. (2012). “AVA: a large-scale database for aesthetic visual analysis,” IEEE conference on computer vision and pattern recognition , Providence, RI , June, 2012 . ( IEEE ), 2408–2415. doi:10.1109/CVPR.2012.6247954

Rana, A., Valenzise, G., and Dufaux, F. (2018). Learning-based tone mapping operator for efficient image matching. IEEE Trans. Multimedia 21 (1), 256–268. doi:10.1109/TMM.2018.2839885

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). “ORB: an efficient alternative to SIFT or SURF,” IEEE International conference on computer vision , Barcelona, Spain , November, 2011 ( IEEE ), 2564–2571. doi:10.1109/ICCV.2011.6126544

Slater, M. (2014). Grand challenges in virtual environments. Front. Robotics AI 1, 3. doi:10.3389/frobt.2014.00003

Van Engelen, J. E., and Hoos, H. H. (2020). A survey on semi-supervised learning. Mach Learn. 109 (2), 373–440. doi:10.1007/s10994-019-05855-6

Vandewalle, P., Kovacevic, J., and Vetterli, M. (2009). Reproducible research in signal processing. IEEE Signal. Process. Mag. 26 (3), 37–47. doi:10.1109/msp.2009.932122

Wallace, G. K. (1992). The JPEG still picture compression standard. IEEE Trans. Consumer Electron.Feb 38 (1), xviii-xxxiv. doi:10.1109/30.125072

Wien, M., Boyce, J. M., Stockhammer, T., and Peng, W.-H. (20192019). Standardization status of immersive video coding. IEEE J. Emerg. Sel. Top. Circuits Syst. 9 (1), 5–17. doi:10.1109/JETCAS.2019.2898948

Wu, G., Masia, B., Jarabo, A., Zhang, Y., Wang, L., Dai, Q., et al. (2017). Light field image processing: an overview. IEEE J. Sel. Top. Signal. Process. 11 (7), 926–954. doi:10.1109/JSTSP.2017.2747126

Xie, N., Ras, G., van Gerven, M., and Doran, D. (2020). Explainable deep learning: a field guide for the uninitiated , Ithaca, NY: Cornell University ..

Keywords: image processing, immersive, image analysis, image understanding, deep learning, video processing

Citation: Dufaux F (2021) Grand Challenges in Image Processing. Front. Sig. Proc. 1:675547. doi: 10.3389/frsip.2021.675547

Received: 03 March 2021; Accepted: 10 March 2021; Published: 12 April 2021.

Reviewed and Edited by:

Copyright © 2021 Dufaux. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Frédéric Dufaux, [email protected]

Digital Image Processing

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Open access
  • Published: 05 December 2018

Application research of digital media image processing technology based on wavelet transform

  • Lina Zhang 1 ,
  • Lijuan Zhang 2 &
  • Liduo Zhang 3  

EURASIP Journal on Image and Video Processing volume  2018 , Article number:  138 ( 2018 ) Cite this article

7096 Accesses

12 Citations

Metrics details

With the development of information technology, people access information more and more rely on the network, and more than 80% of the information in the network is replaced by multimedia technology represented by images. Therefore, the research on image processing technology is very important, but most of the research on image processing technology is focused on a certain aspect. The research results of unified modeling on various aspects of image processing technology are still rare. To this end, this paper uses image denoising, watermarking, encryption and decryption, and image compression in the process of image processing technology to carry out unified modeling, using wavelet transform as a method to simulate 300 photos from life. The results show that unified modeling has achieved good results in all aspects of image processing.

1 Introduction

With the increase of computer processing power, people use computer processing objects to slowly shift from characters to images. According to statistics, today’s information, especially Internet information, transmits and stores more than 80% of the information. Compared with the information of the character type, the image information is much more complicated, so it is more complicated to process the characters on the computer than the image processing. Therefore, in order to make the use of image information safer and more convenient, it is particularly important to carry out related application research on image digital media. Digital media image processing technology mainly includes denoising, encryption, compression, storage, and many other aspects.

The purpose of image denoising is to remove the noise of the natural frequency in the image to achieve the characteristics of highlighting the meaning of the image itself. Because of the image acquisition, processing, etc., they will damage the original signal of the image. Noise is an important factor that interferes with the clarity of an image. This source of noise is varied and is mainly derived from the transmission process and the quantization process. According to the relationship between noise and signal, noise can be divided into additive noise, multiplicative noise, and quantization noise. In image noise removal, commonly used methods include a mean filter method, an adaptive Wiener filter method, a median filter, and a wavelet transform method. For example, the image denoising method performed by the neighborhood averaging method used in the literature [ 1 , 2 , 3 ] is a mean filtering method which is suitable for removing particle noise in an image obtained by scanning. The neighborhood averaging method strongly suppresses the noise and also causes the ambiguity due to the averaging. The degree of ambiguity is proportional to the radius of the field. The Wiener filter adjusts the output of the filter based on the local variance of the image. The Wiener filter has the best filtering effect on images with white noise. For example, in the literature [ 4 , 5 ], this method is used for image denoising, and good denoising results are obtained. Median filtering is a commonly used nonlinear smoothing filter that is very effective in filtering out the salt and pepper noise of an image. The median filter can both remove noise and protect the edges of the image for a satisfactory recovery. In the actual operation process, the statistical characteristics of the image are not needed, which brings a lot of convenience. For example, the literature [ 6 , 7 , 8 ] is a successful case of image denoising using median filtering. Wavelet analysis is to denoise the image by using the wavelet’s layering coefficient, so the image details can be well preserved, such as the literature [ 9 , 10 ].

Image encryption is another important application area of digital image processing technology, mainly including two aspects: digital watermarking and image encryption. Digital watermarking technology directly embeds some identification information (that is, digital watermark) into digital carriers (including multimedia, documents, software, etc.), but does not affect the use value of the original carrier, and is not easily perceived or noticed by a human perception system (such as a visual or auditory system). Through the information hidden in the carrier, it is possible to confirm the content creator, the purchaser, transmit the secret information, or determine whether the carrier has been tampered with. Digital watermarking is an important research direction of information hiding technology. For example, the literature [ 11 , 12 ] is the result of studying the image digital watermarking method. In terms of digital watermarking, some researchers have tried to use wavelet method to study. For example, AH Paquet [ 13 ] and others used wavelet packet to carry out digital watermark personal authentication in 2003, and successfully introduced wavelet theory into digital watermark research, which opened up a new idea for image-based digital watermarking technology. In order to achieve digital image secrecy, in practice, the two-dimensional image is generally converted into one-dimensional data, and then encrypted by a conventional encryption algorithm. Unlike ordinary text information, images and videos are temporal, spatial, visually perceptible, and lossy compression is also possible. These features make it possible to design more efficient and secure encryption algorithms for images. For example, Z Wen [ 14 ] and others use the key value to generate real-value chaotic sequences, and then use the image scrambling method in the space to encrypt the image. The experimental results show that the technology is effective and safe. YY Wang [ 15 ] et al. proposed a new optical image encryption method using binary Fourier transform computer generated hologram (CGH) and pixel scrambling technology. In this method, the order of pixel scrambling and the encrypted image are used as keys for decrypting the original image. Zhang X Y [ 16 ] et al. combined the mathematical principle of two-dimensional cellular automata (CA) with image encryption technology and proposed a new image encryption algorithm. The image encryption algorithm is convenient to implement, has good security, large key amount, good avalanche effect, high degree of confusion, diffusion characteristics, simple operation, low computational complexity, and high speed.

In order to realize the transmission of image information quickly, image compression is also a research direction of image application technology. The information age has brought about an “information explosion” that has led to an increase in the amount of data, so that data needs to be effectively compressed regardless of transmission or storage. For example, in remote sensing technology, space probes use compression coding technology to send huge amounts of information back to the ground. Image compression is the application of data compression technology on digital images. The purpose of image compression is to reduce redundant information in image data and store and transmit data in a more efficient format. Through the unremitting efforts of researchers, image compression technology is now maturing. For example, Lewis A S [ 17 ] hierarchically encodes the transformed coefficients, and designs a new image compression method based on the local estimation noise sensitivity of the human visual system (HVS). The algorithm can be easily mapped to 2-D orthogonal wavelet transform to decompose the image into spatial and spectral local coefficients. Devore R A [ 18 ] introduced a novel theory to analyze image compression methods based on wavelet decomposition compression. Buccigrossi R W [ 19 ] developed a probabilistic model of natural images based on empirical observations of statistical data in the wavelet transform domain. The wavelet coefficient pairs of the basis functions corresponding to adjacent spatial locations, directions, and scales are found to be non-Gaussian in their edges and joint statistical properties. They proposed a Markov model that uses linear predictors to interpret these dependencies, where amplitude is combined with multiplicative and additive uncertainty and indicates that it can interpret statistical data for various images, including photographic images, graphic images, and medical images. In order to directly prove the efficacy of the model, an image encoder called Embedded Prediction Wavelet Image Coder (EPWIC) was constructed in their research. The subband coefficients use a non-adaptive arithmetic coder to encode a bit plane at a time. The encoder uses the conditional probability calculated from the model to sort the bit plane using a greedy algorithm. The algorithm considers the MSE reduction for each coded bit. The decoder uses a statistical model to predict coefficient values based on the bits it has received. Although the model is simple, the rate-distortion performance of the encoder is roughly equivalent to the best image encoder in the literature.

From the existing research results, we find that today’s digital image-based application research has achieved fruitful results. However, this kind of results mainly focus on methods, such as deep learning [ 20 , 21 ], genetic algorithm [ 22 , 23 ], fuzzy theory, etc. [ 24 , 25 ], which also includes the method of wavelet analysis. However, the biggest problem in the existing image application research is that although the existing research on digital multimedia has achieved good research results, there is also a problem. Digital multimedia processing technology is an organic whole. From denoising, compression, storage, encryption, decryption to retrieval, it should be a whole, but the current research results basically study a certain part of this whole. Therefore, although one method is superior in one of the links, it is not necessary whether this method will be suitable for other links. Therefore, in order to solve this problem, this thesis takes digital image as the research object; realizes unified modeling by three main steps of encryption, compression, and retrieval in image processing; and studies the image processing capability of multiple steps by one method.

Wavelet transform is a commonly used digital signal processing method. Since the existing digital signals are mostly composed of multi-frequency signals, there are noise signals, secondary signals, and main signals in the signal. In the image processing, there are also many research teams using wavelet transform as a processing method, introducing their own research and achieving good results. So, can we use wavelet transform as a method to build a model suitable for a variety of image processing applications?

In this paper, the wavelet transform is used as a method to establish the denoising encryption and compression model in the image processing process, and the captured image is simulated. The results show that the same wavelet transform parameters have achieved good results for different image processing applications.

2.1 Image binarization processing method

The gray value of the point of the image ranges from 0 to 255. In the image processing, in order to facilitate the further processing of the image, the frame of the image is first highlighted by the method of binarization. The so-called binarization is to map the point gray value of the image from the value space of 0–255 to the value of 0 or 255. In the process of binarization, threshold selection is a key step. The threshold used in this paper is the maximum between-class variance method (OTSU). The so-called maximum inter-class variance method means that for an image, when the segmentation threshold of the current scene and the background is t , the pre-attraction image ratio is w0, the mean value is u0, the background point is the image ratio w1, and the mean value is u1. Then the mean of the entire image is:

The objective function can be established according to formula 1:

The OTSU algorithm makes g ( t ) take the global maximum, and the corresponding t when g ( t ) is maximum is called the optimal threshold.

2.2 Wavelet transform method

Wavelet transform (WT) is a research result of the development of Fourier transform technology, and the Fourier transform is only transformed into different frequencies. The wavelet transform not only has the local characteristics of the Fourier transform but also contains the transform frequency result. The advantage of not changing with the size of the window. Therefore, compared with the Fourier transform, the wavelet transform is more in line with the time-frequency transform. The biggest characteristic of the wavelet transform is that it can better represent the local features of certain features with frequency, and the scale of the wavelet transform can be different. The low-frequency and high-frequency division of the signal makes the feature more focused. This paper mainly uses wavelet transform to analyze the image in different frequency bands to achieve the effect of frequency analysis. The method of wavelet transform can be expressed as follows:

Where ψ ( t ) is the mother wavelet, a is the scale factor, and τ is the translation factor.

Because the image signal is a two-dimensional signal, when using wavelet transform for image analysis, it is necessary to generalize the wavelet transform to two-dimensional wavelet transform. Suppose the image signal is represented by f ( x , y ), ψ ( x ,  y ) represents a two-dimensional basic wavelet, and ψ a , b , c ( x ,  y ) represents the scale and displacement of the basic wavelet, that is, ψ a , b , c ( x ,  y ) can be calculated by the following formula:

According to the above definition of continuous wavelet, the two-dimensional continuous wavelet transform can be calculated by the following formula:

Where \( \overline{\psi \left(x,y\right)} \) is the conjugate of ψ ( x ,  y ).

2.3 Digital water mark

According to different methods of use, digital watermarking technology can be divided into the following types:

Spatial domain approach: A typical watermarking algorithm in this type of algorithm embeds information into the least significant bits (LSB) of randomly selected image points, which ensures that the embedded watermark is invisible. However, due to the use of pixel bits whose images are not important, the robustness of the algorithm is poor, and the watermark information is easily destroyed by filtering, image quantization, and geometric deformation operations. Another common method is to use the statistical characteristics of the pixels to embed the information in the luminance values of the pixels.

The method of transforming the domain: first calculate the discrete cosine transform (DCT) of the image, and then superimpose the watermark on the front k coefficient with the largest amplitude in the DCT domain (excluding the DC component), usually the low-frequency component of the image. If the first k largest components of the DCT coefficients are represented as D =, i  = 1, ..., k, and the watermark is a random real sequence W =, i  = 1, ..., k obeying the Gaussian distribution, then the watermark embedding algorithm is di = di(1 + awi), where the constant a is a scale factor that controls the strength of the watermark addition. The watermark image I is then obtained by inverse transforming with a new coefficient. The decoding function calculates the discrete cosine transform of the original image I and the watermark image I * , respectively, and extracts the embedded watermark W * , and then performs correlation test to determine the presence or absence of the watermark.

Compressed domain algorithm: The compressed domain digital watermarking system based on JPEG and MPEG standards not only saves a lot of complete decoding and re-encoding process but also has great practical value in digital TV broadcasting and video on demand (VOD). Correspondingly, watermark detection and extraction can also be performed directly in the compressed domain data.

The wavelet transform used in this paper is the method of transform domain. The main process is: assume that x ( m ,  n ) is a grayscale picture of M * N , the gray level is 2 a , where M , N and a are positive integers, and the range of values of m and n is defined as follows: 1 ≤  m  ≤  M , 1 ≤  n  ≤  N . For wavelet decomposition of this image, if the number of decomposition layers is L ( L is a positive integer), then 3* L high-frequency partial maps and a low-frequency approximate partial map can be obtained. Then X k , L can be used to represent the wavelet coefficients, where L is the number of decomposition layers, and K can be represented by H , V , and D , respectively, representing the horizontal, vertical, and diagonal subgraphs. Because the sub-picture distortion of the low frequency is large, the picture embedded in the watermark is removed from the picture outside the low frequency.

In order to realize the embedded digital watermark, we must first divide X K , L ( m i ,  n j ) into a certain size, and use B ( s , t ) to represent the coefficient block of size s * t in X K , L ( m i ,  n j ). Then the average value can be expressed by the following formula:

Where ∑ B ( s ,  t ) is the cumulative sum of the magnitudes of the coefficients within the block.

The embedding of the watermark sequence w is achieved by the quantization of AVG.

The interval of quantization is represented by Δ l according to considerations of robustness and concealment. For the low-level L th layer, since the coefficient amplitude is large, a larger interval can be set. For the other layers, starting from the L -1 layer, they are successively decremented.

According to w i  = {0, 1}, AVG is quantized to the nearest singular point, even point, D ( i , j ) is used to represent the wavelet coefficients in the block, and the quantized coefficient is represented by D ( i ,  j ) ' , where i  = 1, 2,. .., s ; j  = 1,2,..., t . Suppose T  =  AVG /Δ l , TD = rem(| T |, 2), where || means rounding and rem means dividing by 2 to take the remainder.

According to whether TD and w i are the same, the calculation of the quantized wavelet coefficient D ( i ,  j ) ' can be as follows:

Using the same wavelet base, an image containing the watermark is generated by inverse wavelet transform, and the wavelet base, the wavelet decomposition layer number, the selected coefficient region, the blocking method, the quantization interval, and the parity correspondence are recorded to form a key.

The extraction of the watermark is determined by the embedded method, which is the inverse of the embedded mode. First, wavelet transform is performed on the image to be detected, and the position of the embedded watermark is determined according to the key, and the inverse operation of the scramble processing is performed on the watermark.

2.4 Evaluation method

Filter normalized mean square error.

In order to measure the effect before and after filtering, this paper chooses the normalized mean square error M description. The calculation method of M is as follows:

where N 1 and N 2 are Pixels before and after normalization.

Normalized cross-correlation function

The normalized cross-correlation function is a classic algorithm of image matching algorithm, which can be used to represent the similarity of images. The normalized cross-correlation is determined by calculating the cross-correlation metric between the reference map and the template graph, generally expressed by NC( i , j ). If the NC value is larger, it means that the similarity between the two is greater. The calculation formula for the cross-correlation metric is as follows:

where T ( m , n ) is the n th row of the template image, the m th pixel value; S ( i , j ) is the part under the template cover, and i , j is the coordinate of the lower left corner of the subgraph in the reference picture S.

Normalize the above formula NC according to the following formula:

Peak signal-to-noise ratio

Peak signal-to-noise ratio is often used as a measure of signal reconstruction quality in areas such as image compression, which is often simply defined by mean square error (MSE). Two m  ×  n monochrome images I and K , if one is another noise approximation, then their mean square error is defined as:

Then the peak signal-to-noise ratio PSNR calculation method is:

Where Max is the maximum value of the pigment representing the image.

Information entropy

For a digital signal of an image, the frequency of occurrence of each pixel is different, so it can be considered that the image digital signal is actually an uncertainty signal. For image encryption, the higher the uncertainty of the image, the more the image tends to be random, the more difficult it is to crack. The lower the rule, the more regular it is, and the more likely it is to be cracked. For a grayscale image of 256 levels, the maximum value of information entropy is 8, so the more the calculation result tends to be 8, the better.

The calculation method of information entropy is as follows:

Correlation

Correlation is a parameter describing the relationship between two vectors. This paper describes the relationship between two images before and after image encryption by correlation. Assuming p ( x ,  y ) represents the correlation between pixels before and after encryption, the calculation method of p ( x ,  y ) can be calculated by the following formula:

3 Experiment

3.1 image parameter.

The images used in this article are all from the life photos, the shooting tool is Huawei meta 10, the picture size is 1440*1920, the picture resolution is 96 dbi, the bit depth is 24, no flash mode, there are 300 pictures as simulation pictures, all of which are life photos, and no special photos.

3.2 System environment

The computer system used in this simulation is Windows 10, and the simulation software used is MATLAB 2014B.

3.3 Wavelet transform-related parameters

For unified modeling, the wavelet decomposition used in this paper uses three layers of wavelet decomposition, and Daubechies is chosen as the wavelet base. The Daubechies wavelet is a wavelet function constructed by the world-famous wavelet analyst Ingrid Daubechies. They are generally abbreviated as dbN, where N is the order of the wavelet. The support region in the wavelet function Ψ( t ) and the scale function ϕ ( t ) is 2 N-1, and the vanishing moment of Ψ( t ) is N . The dbN wavelet has good regularity, that is, the smooth error introduced by the wavelet as a sparse basis is not easy to be detected, which makes the signal reconstruction process smoother. The characteristic of the dbN wavelet is that the order of the vanishing moment increases with the increase of the order (sequence N), wherein the higher the vanishing moment, the better the smoothness, the stronger the localization ability of the frequency domain, and the better the band division effect. However, the support of the time domain is weakened, and the amount of calculation is greatly increased, and the real-time performance is deteriorated. In addition, except for N  = 1, the dbN wavelet does not have symmetry (i.e., nonlinear phase), that is, a certain phase distortion is generated when the signal is analyzed and reconstructed. N  = 3 in this article.

4 Results and discussion

4.1 results 1: image filtering using wavelet transform.

In the process of image recording, transmission, storage, and processing, it is possible to pollute the image signal. The digital signal transmitted to the image will appear as noise. These noise data will often become isolated pixels. One-to-one isolated points, although they do not destroy the overall external frame of the image, but because these isolated points tend to be high in frequency, they are portable on the image as a bright spot, which greatly affects the viewing quality of the image, so to ensure the effect of image processing, the image must be denoised. The effective method of denoising is to remove the noise of a certain frequency of the image by filtering, but the denoising must ensure that the noise data can be removed without destroying the image. Figure  1 is the result of filtering the graph using the wavelet transform method. In order to test the wavelet filtering effect, this paper adds Gaussian white noise to the original image. Comparing the white noise with the frequency analysis of the original image, it can be seen that after the noise is added, the main image frequency segment of the original image is disturbed by the noise frequency, but after filtering using the wavelet transform, the frequency band of the main frame of the original image appears again. However, the filtered image does not change significantly compared to the original image. The normalized mean square error before and after filtering is calculated, and the M value before and after filtering is 0.0071. The wavelet transform is well protected to protect the image details, and the noise data is better removed (the white noise is 20%).

figure 1

Image denoising results comparison. (The first row from left to right are the original image, plus the noise map and the filtered map. The second row from left to right are the frequency distribution of the original image, the frequency distribution of the noise plus the filtered Frequency distribution)

4.2 Results 2: digital watermark encryption based on wavelet transform

As shown in Fig.  2 , the watermark encryption process based on wavelet transform can be seen from the figure. Watermarking the image by wavelet transform does not affect the structure of the original image. The noise is 40% of the salt and pepper noise. For the original image and the noise map, the wavelet transform method can extract the watermark well.

figure 2

Comparison of digital watermark before and after. (The first row from left to right are the original image, plus noise and watermark, and the noise is removed; the second row are the watermark original, the watermark extracted from the noise plus watermark, and the watermark extracted after denoising)

According to the method described in this paper, the image correlation coefficient and peak-to-noise ratio of the original image after watermarking are calculated. The correlation coefficient between the original image and the watermark is 0.9871 (the first column and the third column in the first row in the figure). The watermark does not destroy the structure of the original image. The signal-to-noise ratio of the original picture is 33.5 dB, and the signal-to-noise ratio of the water-jet printing is 31.58SdB, which proves that the wavelet transform can achieve watermark hiding well. From the second row of watermarking results, the watermark extracted from the image after noise and denoising, and the original watermark correlation coefficient are (0.9745, 0.9652). This shows that the watermark signal can be well extracted after being hidden by the wavelet transform.

4.3 Results 3: image encryption based on wavelet transform

In image transmission, the most common way to protect image content is to encrypt the image. Figure  3 shows the process of encrypting and decrypting an image using wavelet transform. It can be seen from the figure that after the image is encrypted, there is no correlation with the original image at all, but the decrypted image of the encrypted image reproduces the original image.

figure 3

Image encryption and decryption process diagram comparison. (The left is the original image, the middle is the encrypted image, the right is the decryption map)

The information entropy of Fig.  3 is calculated. The results show that the information entropy of the original image is 3.05, the information entropy of the decrypted graph is 3.07, and the information entropy of the encrypted graph is 7.88. It can be seen from the results of information entropy that before and after encryption. The image information entropy is basically unchanged, but the information entropy of the encrypted image becomes 7.88, indicating that the encrypted image is close to a random signal and has good confidentiality.

4.4 Result 4: image compression

Image data can be compressed because of the redundancy in the data. The redundancy of image data mainly manifests as spatial redundancy caused by correlation between adjacent pixels in an image; time redundancy due to correlation between different frames in an image sequence; spectral redundancy due to correlation of different color planes or spectral bands. The purpose of data compression is to reduce the number of bits required to represent the data by removing these data redundancy. Since the amount of image data is huge, it is very difficult to store, transfer, and process, so the compression of image data is very important. Figure  4 shows the result of two compressions of the original image. It can be seen from the figure that although the image is compressed, the main frame of the image does not change, but the image sharpness is significantly reduced. The Table  1 shows the compressed image properties.

figure 4

Image comparison before and after compression. (left is the original image, the middle is the first compression, the right is the second compression)

It can be seen from the results in Table 1 that after multiple compressions, the size of the image is significantly reduced and the image is getting smaller and smaller. The original image needs 2,764,800 bytes, which is reduced to 703,009 after a compression, which is reduced by 74.5%. After the second compression, only 182,161 is left, which is 74.1% lower. It can be seen that the wavelet transform can achieve image compression well.

5 Conclusion

With the development of informatization, today’s era is an era full of information. As the visual basis of human perception of the world, image is an important means for humans to obtain information, express information, and transmit information. Digital image processing, that is, processing images with a computer, has a long history of development. Digital image processing technology originated in the 1920s, when a photo was transmitted from London, England to New York, via a submarine cable, using digital compression technology. First of all, digital image processing technology can help people understand the world more objectively and accurately. The human visual system can help humans get more than 3/4 of the information from the outside world, and images and graphics are the carriers of all visual information, despite the identification of the human eye. It is very powerful and can recognize thousands of colors, but in many cases, the image is blurred or even invisible to the human eye. Image enhancement technology can make the blurred or even invisible image clear and bright. There are also some relevant research results on this aspect of research, which proves that relevant research is feasible [ 26 , 27 ].

It is precisely because of the importance of image processing technology that many researchers have begun research on image processing technology and achieved fruitful results. However, with the deepening of image processing technology research, today’s research has a tendency to develop in depth, and this depth is an in-depth aspect of image processing technology. However, the application of image processing technology is a system engineering. In addition to the deep requirements, there are also systematic requirements. Therefore, if the unified model research on multiple aspects of image application will undoubtedly promote the application of image processing technology. Wavelet transform has been successfully applied in many fields of image processing technology. Therefore, this paper uses wavelet transform as a method to establish a unified model based on wavelet transform. Simulation research is carried out by filtering, watermark hiding, encryption and decryption, and image compression of image processing technology. The results show that the model has achieved good results.

Abbreviations

Cellular automata

Computer generated hologram

Discrete cosine transform

Embedded Prediction Wavelet Image Coder

Human visual system

Least significant bits

Video on demand

Wavelet transform

H.W. Zhang, The research and implementation of image Denoising method based on Matlab[J]. Journal of Daqing Normal University 36 (3), 1-4 (2016)

J.H. Hou, J.W. Tian, J. Liu, Analysis of the errors in locally adaptive wavelet domain wiener filter and image Denoising[J]. Acta Photonica Sinica 36 (1), 188–191 (2007)

Google Scholar  

M. Lebrun, An analysis and implementation of the BM3D image Denoising method[J]. Image Processing on Line 2 (25), 175–213 (2012)

Article   Google Scholar  

A. Fathi, A.R. Naghsh-Nilchi, Efficient image Denoising method based on a new adaptive wavelet packet thresholding function[J]. IEEE transactions on image processing a publication of the IEEE signal processing. Society 21 (9), 3981 (2012)

MATH   Google Scholar  

X. Zhang, X. Feng, W. Wang, et al., Gradient-based wiener filter for image denoising [J]. Comput. Electr. Eng. 39 (3), 934–944 (2013)

T. Chen, K.K. Ma, L.H. Chen, Tri-state median filter for image denoising.[J]. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 8 (12), 1834 (1999)

S.M.M. Rahman, M.K. Hasan, Wavelet-domain iterative center weighted median filter for image denoising[J]. Signal Process. 83 (5), 1001–1012 (2003)

Article   MATH   Google Scholar  

H.L. Eng, K.K. Ma, Noise adaptive soft-switching median filter for image denoising[C]// IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. IEEE 4 , 2175–2178 (2000)

S.G. Chang, B. Yu, M. Vetterli, Adaptive wavelet thresholding for image denoising and compression[J]. IEEE transactions on image processing a publication of the IEEE signal processing. Society 9 (9), 1532 (2000)

M. Kivanc Mihcak, I. Kozintsev, K. Ramchandran, et al., Low-complexity image Denoising based on statistical modeling of wavelet Coecients[J]. IEEE Signal Processing Letters 6 (12), 300–303 (1999)

J.H. Wu, F.Z. Lin, Image authentication based on digital watermarking[J]. Chinese Journal of Computers 9 , 1153–1161 (2004)

MathSciNet   Google Scholar  

A. Wakatani, Digital watermarking for ROI medical images by using compressed signature image[C]// Hawaii international conference on system sciences. IEEE (2002), pp. 2043–2048

A.H. Paquet, R.K. Ward, I. Pitas, Wavelet packets-based digital watermarking for image verification and authentication [J]. Signal Process. 83 (10), 2117–2132 (2003)

Z. Wen, L.I. Taoshen, Z. Zhang, An image encryption technology based on chaotic sequences[J]. Comput. Eng. 31 (10), 130–132 (2005)

Y.Y. Wang, Y.R. Wang, Y. Wang, et al., Optical image encryption based on binary Fourier transform computer-generated hologram and pixel scrambling technology[J]. Optics & Lasers in Engineering 45 (7), 761–765 (2007)

X.Y. Zhang, C. Wang, S.M. Li, et al., Image encryption technology on two-dimensional cellular automata[J]. Journal of Optoelectronics Laser 19 (2), 242–245 (2008)

A.S. Lewis, G. Knowles, Image compression using the 2-D wavelet transform[J]. IEEE Trans. Image Process. 1 (2), 244–250 (2002)

R.A. Devore, B. Jawerth, B.J. Lucier, Image compression through wavelet transform coding[J]. IEEE Trans.inf.theory 38 (2), 719–746 (1992)

Article   MathSciNet   MATH   Google Scholar  

R.W. Buccigrossi, E.P. Simoncelli, Image compression via joint statistical characterization in the wavelet domain[J]. IEEE transactions on image processing a publication of the IEEE signal processing. Society 8 (12), 1688–1701 (1999)

A.A. Cruzroa, J.E. Arevalo Ovalle, A. Madabhushi, et al., A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. Med Image Comput Comput Assist Interv. 16 , 403–410 (2013)

S.P. Mohanty, D.P. Hughes, M. Salathé, Using deep learning for image-based plant disease detection[J]. Front. Plant Sci. 7 , 1419 (2016)

B. Sahiner, H. Chan, D. Wei, et al., Image feature selection by a genetic algorithm: application to classification of mass and normal breast tissue[J]. Med. Phys. 23 (10), 1671 (1996)

B. Bhanu, S. Lee, J. Ming, Adaptive image segmentation using a genetic algorithm[J]. IEEE Transactions on Systems Man & Cybernetics 25 (12), 1543–1567 (2002)

Y. Egusa, H. Akahori, A. Morimura, et al., An application of fuzzy set theory for an electronic video camera image stabilizer[J]. IEEE Trans. Fuzzy Syst. 3 (3), 351–356 (1995)

K. Hasikin, N.A.M. Isa, Enhancement of the low contrast image using fuzzy set theory[C]// Uksim, international conference on computer modelling and simulation. IEEE (2012), pp. 371–376

P. Yang, Q. Li, Wavelet transform-based feature extraction for ultrasonic flaw signal classification. Neural Comput. & Applic. 24 (3–4), 817–826 (2014)

R.K. Lama, M.-R. Choi, G.-R. Kwon, Image interpolation for high-resolution display based on the complex dual-tree wavelet transform and hidden Markov model. Multimedia Tools Appl. 75 (23), 16487–16498 (2016)

Download references

Acknowledgements

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

This work was supported by

Shandong social science planning research project in 2018

Topic: The Application of Shandong Folk Culture in Animation in The View of Digital Media (No. 18CCYJ14).

Shandong education science 12th five-year plan 2015

Topic: Innovative Research on Stop-motion Animation in The Digital Media Age (No. YB15068).

Shandong education science 13th five-year plan 2016–2017

Approval of “Ports and Arts Education Special Fund”: BCA2017017.

Topic: Reform of Teaching Methods of Hand Drawn Presentation Techniques (No. BCA2017017).

National Research Youth Project of state ethnic affairs commission in 2018

Topic: Protection and Development of Villages with Ethnic Characteristics Under the Background of Rural Revitalization Strategy (No. 2018-GMC-020).

Availability of data and materials

Authors can provide the data.

About the authors

Zaozhuang University, No. 1 Beian Road., Shizhong District, Zaozhuang City, Shandong, P.R. China.

Lina, Zhang was born in Jining, Shandong, P.R. China, in 1983. She received a Master degree from Bohai University, P.R. China. Now she works in School of Media, Zaozhuang University, P.R. China. Her research interests include animation and Digital media art.

Lijuan, Zhang was born in Jining, Shandong, P.R. China, in 1983. She received a Master degree from Jingdezhen Ceramic Institute, P.R. China. Now she works in School of Fine Arts and Design, Zaozhuang University, P.R. China. Her research interests include Interior design and Digital media art.

Liduo, Zhang was born in Zaozhuang, Shandong, P.R. China, in 1982. He received a Master degree from Monash University, Australia. Now he works in School of economics and management, Zaozhuang University. His research interests include Internet finance and digital media.

Author information

Authors and affiliations.

School of Media, Zaozhuang University, Zaozhuang, Shandong, China

School of Fine Arts and Design, Zaozhuang University, Zaozhuang, Shandong, China

Lijuan Zhang

School of Economics and Management, Zaozhuang University, Zaozhuang, Shandong, China

Liduo Zhang

You can also search for this author in PubMed   Google Scholar

Contributions

All authors take part in the discussion of the work described in this paper. The author LZ wrote the first version of the paper. The author LZ and LZ did part experiments of the paper, LZ revised the paper in different version of the paper, respectively. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lijuan Zhang .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Zhang, L., Zhang, L. & Zhang, L. Application research of digital media image processing technology based on wavelet transform. J Image Video Proc. 2018 , 138 (2018). https://doi.org/10.1186/s13640-018-0383-6

Download citation

Received : 28 September 2018

Accepted : 23 November 2018

Published : 05 December 2018

DOI : https://doi.org/10.1186/s13640-018-0383-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Image processing
  • Digital watermark
  • Image denoising
  • Image encryption
  • Image compression

image processing in research paper

Image processing and pattern recognition in industrial engineering

Sensor Review

ISSN : 0260-2288

Article publication date: 29 March 2011

Du, Z. (2011), "Image processing and pattern recognition in industrial engineering", Sensor Review , Vol. 31 No. 2. https://doi.org/10.1108/sr.2011.08731baa.002

Emerald Group Publishing Limited

Copyright © 2011, Emerald Group Publishing Limited

Article Type: Viewpoint From: Sensor Review, Volume 31, Issue 2

Along with the information superhighway, digital globe concept’s statement and the internet’s widespread application, image information has become an important source, and an important means, of human access to information. As a result, the demands on image processing and pattern recognition technology grows day by day.

Currently, image processing and pattern recognition have become an object of study and research in areas such as the engineering, computer science, information science, statistics, physics, biology, chemistry, medicine and even in the fields of social science. Therefore, image processing and pattern recognition technology use by other disciplines are inevitably increasing.

Recently, there is a growing demand for image processing and pattern recognition in various application areas, such as remote sensing, multimedia computing, secured image data communication, biomedical imaging, texture understanding, content-based image retrieval, image compression, and so on. As a result, the challenge to scientists, engineers and business people is to quickly extract valuable information from raw image data. This is the primary purpose of image processing and pattern recognition.

In electrical engineering and computer science, image processing is any form of signal processing for which the input is an image, such as a photograph or video frame. The output of image processing may be either an image or a set of characteristics or parameters related to the image. Most image-processing techniques involve treating the image as a two-dimensional signal and applying standard signal-processing techniques to it. A digital image is composed of a grid of pixels and stored as an array. A single pixel represents a value of either light intensity or color. Images are processed to obtain information beyond what is apparent given the image’s initial pixel values.

Image-processing tasks can include any combination of the following: modifying the image view, adding dimensionality to image data, working with masks and calculating statistics, warping images, specifying regions of interest, manipulating images in various domains, enhancing contrast and filtering, extracting and analyzing shapes, and so on.

Pattern recognition techniques are concerned with the theory and algorithms for putting abstract objects, e.g. measurements made on physical objects, into categories. Methods of pattern recognition are useful in many applications such as information retrieval, data mining, document image analysis and recognition, computational linguistics, forensics, biometrics and bioinformatics.

Pattern recognition is the science and art of giving names to the natural objects in the real world. It is often considered part of artificial intelligence. However, the problem here is even more challenging because the observations are not in symbolic form and often contain much variability and noise: another term for pattern recognition is artificial perception. Typical inputs to a pattern recognition system are images or sound signals, out of which the relevant objects have to be found and identified. The pattern recognition solution involves many stages such as making the measurements, processing and segmentation, finding a suitable numerical representation for the object we are interested in, and finally classifying them based on these representation.

Image processing and pattern recognition technology is also a closely linked with the national economy and science; it has brought huge economy and social efficiency to humanity. In the near future, image processing and pattern recognition technology will have not only a more thorough development theoretically, but will also be an indispensable and powerful tool in the application of scientific research, and for our everyday lives. In our information-based society, image processing and pattern recognition have huge potential, both in theory and in practice.

Zhenyu Du Professor at the Information Technology and Industrial Engineering Research Center (ITTE), Hong Kong, China

Related articles

We’re listening — tell us what you think, something didn’t work….

Report bugs here

All feedback is valuable

Please share your general feedback

Join us on our journey

Platform update page.

Visit emeraldpublishing.com/platformupdate to discover the latest news and updates

Questions & More Information

Answers to the most commonly asked questions here

A deep neural network for hand gesture recognition from RGB image in complex background

  • Original Paper
  • Published: 05 May 2024

Cite this article

image processing in research paper

  • Tsung-Han Tsai 1 ,
  • Yuan-Chen Ho 1 ,
  • Po-Ting Chi 1 &
  • Ting-Jia Chen 1  

Explore all metrics

Deep learning research has gained significant popularity recently, finding applications in various domains such as image preprocessing, segmentation, object recognition, and semantic analysis. Deep learning has gradually replaced traditional algorithms such as color-based methods, contour-based methods, and motion-based methods. In the context of hand gesture recognition, traditional algorithms heavily rely on depth information for accuracy, but their performance is often subpar. This paper introduces a novel approach using a deep neural network for hand gesture recognition, requiring only a single complementary metal oxide semiconductor (CMOS) camera to operate amidst complex backgrounds. The neural network design incorporates depthwise separable convolutional layers, dividing the model into segmentation and recognition components. As our proposed single-stage model, we avoid the use of the whole model and thus reduce the number of weights and calculations. Additionally, in the training phase, the data augmentation and iterative training strategy further increase recognition accuracy. The results show that the proposed work uses little parameter usage while still having a higher gesture recognition rate than the other works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

image processing in research paper

Data availability

Data will be made available on request.

Ghadi, Y.Y., et al.: MS-DLD: multi-sensors based daily locomotion detection via kinematic-static energy and body-specific HMMs. IEEE Access (2022). https://doi.org/10.1109/ACCESS.2022.3154775

Article   Google Scholar  

Azmat, U., Jalal, A., Javeed, M.: Multi-sensors fused IoT-based home surveillance via bag of visual and motion features. In: 2023 international conference on communication, computing and digital systems (C-CODE), pp. 1–6. IEEE

Hajjej, F., et al.: Deep human motion detection and multi-features analysis for smart healthcare learning tools. IEEE Access (2022). https://doi.org/10.1109/ACCESS.2022.3214986

Kumar, P., Rautaray, S.S., Agrawal, A.: Hand data glove: A new generation real-time mouse for human-computer interaction. In: International conference on recent advances in information technology (RAIT), pp. 750–755 (2012). https://doi.org/10.1109/RAIT.2012.6194548 .

Sun, J., Ji, T., Zhang, S., Yang, J., Ji, G.: Research on the hand gesture recognition based on deep learning. In: 2018 12th international symposium on antennas, propagation and EM theory (ISAPE), pp. 1–4 (2018). https://doi.org/10.1109/ISAPE.2018.8634348

Rautaray, S.S., Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43 (1), 1–54 (2015). https://doi.org/10.1007/s10462-012-9356-9

Hu, B., Wang, J.: Deep learning based hand gesture recognition and UAV flight controls. Int. J. Autom. Comput. 17 (1), 17–29 (2020). https://doi.org/10.1007/s10462-012-9356-9

Mummadi, C., Leo, F., Verma, K., et al.: Real-time and embedded detection of hand gestures with an IMU-based glove. Informatics 5 (2), 28 (2018). https://doi.org/10.3390/informatics5020028

Oudah, M., Al-Naji, A., Chahl, J.: Hand gesture recognition based on computer vision: a review of techniques. J. Imaging 6 (8), 73 (2020). https://doi.org/10.3390/jimaging6080073

Krizhevsky, A., Sutskever, I., Hinton, G.E., Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012). https://doi.org/10.1145/3065386

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99 (2015). https://doi.org/10.1109/TPAMI.2016.2577031

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440 (2015) https://doi.org/10.1109/CVPR.2015.7298965

Chevtchenko, S.F., Vale, R.F., Macario, V., Cordeiro, F.R.: A convolutional neural network with feature fusion for real-time hand posture recognition. Appl. Soft Comput. (2018). https://doi.org/10.1016/j.asoc.2018.09.010

Xing K., et al.: Hand gesture recognition based on deep learning method. In: 2018 IEEE third interna-tional conference on data science in cyberspace (DSC), pp. 542-546 (2018) https://doi.org/10.1109/DSC.2018.00087

Bilal, S., Akmeliawati, R., El Salami, M. J., Shafie, A.A., & Bouhabba, E.M.: A hybrid method using haar-like and skin-color algorithm for hand posture detection, recognition and tracking. In: 2010 IEEE international conference on mechatronics and automation, Xi'an, pp. 934–939 (2010) https://doi.org/10.1109/ICMA.2010.5588576

Guo, J., Cheng, J., Pang J., Guo, Y.: Real-time hand detection based on multi-stage HOG-SVM classifier. In: 2013 IEEE international conference on image processing, melbourne, VIC, pp. 4108–4111 (2013). https://doi.org/10.1109/ICIP.2013.6738846

Long, J., Shelhamer, E., Darrell. T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp. 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965

Bilal, S., et al.: A hybrid method using haar-like and skin-color algorithm for hand posture detection, recognition and tracking. In: 2010 IEEE international conference on mechatronics and automation, Xi'an, pp. 934–939 (2010). https://doi.org/10.1109/ICMA.2010.5588576

Nguyen, V.-T., et al.: A method for hand detection based on Internal Haar-like features and Cascaded AdaBoost Classifier. ICCE, pp. 608–613 (2012)

Chen, L.-C., Papandreou, G., Schroff, F., Adam. H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)

Chen, L.-C., Papandreou, G., Schroff, F., Adam. H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611 , https://doi.org/10.1007/978-3-030-01234-2_49 (2018)

Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: multipath refinement networks with identity mappings for highresolution semantic segmentation. arXiv:1611.06612 , https://doi.org/10.1109/CVPR.2017.549 (2016)

Barros, P., Magg, S., Weber, C., Wermter, S.: A multichannel convolutional neural network for hand posture recognition. In: International conference on artificial neural networks, pp. 403– 410 (2014). https://doi.org/10.1007/978-3-319-11179-7_51 .

Zhang, W., Wang, J., Lan, F.: Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Autom. Sin. 8 (1), 110–120 (2020). https://doi.org/10.1109/JAS.2020.1003465

Saboo, S., Singha, J., Laskar, R.H.: Dynamic hand gesture recognition using combination of two-level tracker and trajectory-guided features. Multim. Syst. 28 (1), 183–194 (2022). https://doi.org/10.1007/s00530-021-00811-8

Yirtici, T., Yurtkan, K.: Regional-CNN-based enhanced Turkish sign language recognition. Signal Image Video Process. 5 , 1305–1311 (2022). https://doi.org/10.1007/s11760-021-02082-2

Sun, S., et al.: ShuffleNetv2-YOLOv3: a real-time recognition method of static sign language based on a lightweight network. Signal Image Video Process. 17 (6), 2721–2729 (2023)

Zhou, W., Li, X.: PEA-YOLO: a lightweight network for static gesture recognition combining multiscale and attention mechanisms. Signal Image Video Process. 18 (1), 597–605 (2023)

Dadashzadeh, A., Targhi, A.T., Tahmasbi, M., Mirmehdi, M.: HGR-Net: a fusion network for hand gesture segmentation and recognition. arXiv:1806.05653 , https://doi.org/10.1049/iet-cvi.2018.5796 (2018)

Matilainen, M., Sangi, P., Holappa, J., Silven, O.: Ouhands database for hand detection and pose recognition. In: Image Processing theory tools and applications, 6th international conference, pp. 1–5. IEEE (2016). https://doi.org/10.1109/IPTA.2016.7821025 .

Pinto, R.F., et al.: Static hand gesture recognition based on convolutional neural networks. J. Electr. Comput. Eng. 2019 , 1–12 (2019). https://doi.org/10.1155/2019/4167890

http://sun.aei.polsl.pl/mkawulok/gestures/ . Accessed 30 July 2019

Bambach, S., Lee, S., Crandall, D., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex ego- centric interactions. In: IEEE international conference on computer vision (ICCV) (2015). https://doi.org/10.1109/ICCV.2015.226

Khan, A.U., Borji, A.: Analysis of hand segmentation in the wild. In: CVPR (2018). https://doi.org/10.1109/CVPR.2018.00495

Everingham, M., John, W.: The PASCAL visual object classes challenge 2012 (VOC2012) development kit. Pattern Anal. Stat. Model. Comput. Learn. Tech. Rep (2012). https://doi.org/10.1007/s11263-014-0733-5

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 .

Howard, A. G., et al.: MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861 [cs], https://doi.org/10.1109/IJCNN52387.2021.9534076 (2017)

Verma, M., Gupta, A., et al.: One for all: an end-to-end compact solution for hand gesture recognition, arXiv:2105.07143 (2021)

Zhou, W., Chen, K.: A lightweight hand gesture recognition in complex backgrounds. Displays 74 , 102226 (2022). https://doi.org/10.1016/j.displa.2022.102226

Dayananda Kumar, N. C., Suresh, K. V., Dinesh, R.: Depth Based Static Hand Gesture Segmentation and Recognition. In: Cognition and Recognition: 8th international conference, ICCR 2021, Mandya, India, December 30–31, 2021, Revised Selected Papers. Springer, Cham (2023)

Bansal, S.R., Savita, W., Rajeev, G.: mrmr-pso: a hybrid feature selection technique with a multiobjective approach for sign language recognition. Arab. J. Sci. Eng. 47 (8), 10365–10380 (2022). https://doi.org/10.1007/s13369-021-06456-z

Bousbai, K., et al.: Improving hand gestures recognition capabilities by ensembling convolutional networks. Exp. Syst. 39 (5), e12937 (2022). https://doi.org/10.1111/exsy.12937

Sadeghzadeh, A., Islam, M. B.: BiSign-Net: Fine-grained Static Sign Language Recognition based on Bilinear CNN. In: 2022 international symposium on intelligent signal processing and communication systems (ISPACS). IEEE (2022)

Download references

Author information

Authors and affiliations.

Department of Electrical Engineering, National Central University, Taoyuan City, Taiwan, ROC

Tsung-Han Tsai, Yuan-Chen Ho, Po-Ting Chi & Ting-Jia Chen

You can also search for this author in PubMed   Google Scholar

Contributions

Tsung-Han Tsai wrote the main manuscript text. Y-CH, P-TC, and T-JC took the programing work.

Corresponding author

Correspondence to Tsung-Han Tsai .

Ethics declarations

Conflict of interest.

The authors declare no conflict of interest.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Tsai, TH., Ho, YC., Chi, PT. et al. A deep neural network for hand gesture recognition from RGB image in complex background. SIViP (2024). https://doi.org/10.1007/s11760-024-03198-x

Download citation

Received : 11 December 2023

Revised : 07 March 2024

Accepted : 31 March 2024

Published : 05 May 2024

DOI : https://doi.org/10.1007/s11760-024-03198-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Hand gesture recognition
  • Hand segmentation
  • Deep neural network
  • Attention model
  • Depthwise separable convolution
  • Human–computer interaction
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Electrical Engineering and Systems Science > Signal Processing

Title: diffusion-aided joint source channel coding for high realism wireless image transmission.

Abstract: Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated as an effective approach for wireless image transmission. Nevertheless, current research has concentrated on minimizing a standard distortion metric such as Mean Squared Error (MSE), which does not necessarily improve the perceptual quality. To address this issue, we propose DiffJSCC, a novel framework that leverages pre-trained text-to-image diffusion models to enhance the realism of images transmitted over the channel. The proposed DiffJSCC utilizes prior deep JSCC frameworks to deliver an initial reconstructed image at the receiver. Then, the spatial and textual features are extracted from the initial reconstruction, which, together with the channel state information (e.g., signal-to-noise ratio, SNR), are passed to a control module to fine-tune the pre-trained Stable Diffusion model. Extensive experiments on the Kodak dataset reveal that our method significantly surpasses both conventional methods and prior deep JSCC approaches on perceptual metrics such as LPIPS and FID scores, especially with poor channel conditions and limited bandwidth. Notably, DiffJSCC can achieve highly realistic reconstructions for 768x512 pixel Kodak images with only 3072 symbols (<0.008 symbols per pixel) under 1dB SNR. Our code will be released in this https URL .

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Open access
  • Published: 07 May 2024

Computer vision digitization of smartphone images of anesthesia paper health records from low-middle income countries

  • Ryan D. Folks 1 ,
  • Bhiken I. Naik 1 ,
  • Donald E. Brown 2 &
  • Marcel E. Durieux 1  

BMC Bioinformatics volume  25 , Article number:  178 ( 2024 ) Cite this article

Metrics details

In low-middle income countries, healthcare providers primarily use paper health records for capturing data. Paper health records are utilized predominately due to the prohibitive cost of acquisition and maintenance of automated data capture devices and electronic medical records. Data recorded on paper health records is not easily accessible in a digital format to healthcare providers. The lack of real time accessible digital data limits healthcare providers, researchers, and quality improvement champions to leverage data to improve patient outcomes. In this project, we demonstrate the novel use of computer vision software to digitize handwritten intraoperative data elements from smartphone photographs of paper anesthesia charts from the University Teaching Hospital of Kigali. We specifically report our approach to digitize checkbox data, symbol-denoted systolic and diastolic blood pressure, and physiological data.

We implemented approaches for removing perspective distortions from smartphone photographs, removing shadows, and improving image readability through morphological operations. YOLOv8 models were used to deconstruct the anesthesia paper chart into specific data sections. Handwritten blood pressure symbols and physiological data were identified, and values were assigned using deep neural networks. Our work builds upon the contributions of previous research by improving upon their methods, updating the deep learning models to newer architectures, as well as consolidating them into a single piece of software.

The model for extracting the sections of the anesthesia paper chart achieved an average box precision of 0.99, an average box recall of 0.99, and an mAP0.5-95 of 0.97. Our software digitizes checkbox data with greater than 99% accuracy and digitizes blood pressure data with a mean average error of 1.0 and 1.36 mmHg for systolic and diastolic blood pressure respectively. Overall accuracy for physiological data which includes oxygen saturation, inspired oxygen concentration and end tidal carbon dioxide concentration was 85.2%.

Conclusions

We demonstrate that under normal photography conditions we can digitize checkbox, blood pressure and physiological data to within human accuracy when provided legible handwriting. Our contributions provide improved access to digital data to healthcare practitioners in low-middle income countries.

Peer Review reports

Globally, approximately 313 million surgical cases are performed annually. 6% of these surgeries are performed in low-middle income countries (LMICs), where a third of the global population currently resides. Surgical mortality rates are twice as high in LMICs, compared to high-income countries despite patients being younger, having a lower risk profile and undergoing less invasive surgery [ 1 ]. A significant majority of these deaths are preventable with surveillance of high-risk patients and early evidence-based interventions [ 1 , 2 ].

Surveillance and improvement in surgical and anesthesia care is dependent on having access to continuous, reproducible, and real-time data. However, in LMICs the primary method of data capture for anesthesia and surgery is within paper health records. These records are characterized by having multiple data elements including medication administration, physiological parameters, and procedural-specific elements recorded manually by the provider at a regular frequency (e.g., every 5 min). The data density of the anesthesia paper health records, defined as the data generated per unit of time, is amongst the highest for any healthcare setting [ 3 ].

The most efficient method to record high-volume anesthesia data is with automatic data capture monitors and electronic medical record systems (EMRs). Unfortunately, due to their cost and complexity, electronic records remain an unlikely solution in LMICs for the foreseeable future [ 4 ]. This creates major gaps in digital data access for anesthesia providers in LMICs, and their ability to utilize data to rapidly anticipate and intervene to reduce anesthesia and surgical complications and mortality.

In this paper we describe our methodology to further improve the accuracy of the digitization of anesthesia paper health records from the University Teaching Hospital of Kigali (CHUK) in real time using computer vision. Our work builds from our previous digitizing efforts and further consolidates the process using a single software program. Our overarching goal for this project is to provide rapidly accessible, digital data to anesthesia healthcare providers in LMICs, which can faciliate evidence-based actionable interventions to reduce morbidity and mortality.

The remainder of this paper begins with an introduction to the paper anesthesia record from CHUK, leading into a discussion on our methodology for correcting common distortions in smartphone images of the paper anesthesia record, followed by our methods for extracting the blood pressure, physiological, and checkbox data elements. Finally, we assess the improvements in our methods from previous research in the results section, and discuss the impact, challenges, and future directions of our results and work.

The intraoperative anesthesia paper health record

We utilized 500 smartphone photographs of paper anesthesia records collected from 2019 to 2023. The photographs of the anesthesia paper records varied greatly in quality, with some being clear, well lit, and legible, whereas others were blurry, poorly lit, and illegible. The anesthesia record has seven distinct sections: handwritten medications (Fig  1 , Section A), inhaled volatile anesthetics (Fig  1 , Section B), intravenous fluids (Fig  1 , Section C), blood and blood product transfused (Fig  1 , Section D), blood pressure and heart rate (Fig  1 , Section E), physiological data elements (Fig  1 , Section F), and checkboxes for marking key procedural events (Fig  1 , Section G).

figure 1

An example of an intraoperative paper anesthesia record from the University Teaching Hospital in Kigali, Rwanda

Intravenous medications

Multiple intravenous medications are administered over the course of surgery, with both the dose and timing of administration recorded in the anesthesia paper health record. Commonly administered medications include drugs required for induction of anesthesia, prevention of infection (e.g., antibiotics), to induce or reverse muscle paralysis, and to ensure blood pressure and heart stability. The medications are written in the temporal order in which they are administered.

Inhaled volatile medications

The inhaled volatile anesthetic medications are halogentated hydrocarbon gases that are administered to maintain general anesthesia. To document the type of the volatile inhaled anesthetic administered, the anesthesia paper health record has three checkboxes, two are for the most commonly used inhaled anesthetics: isoflurane and halothane, and the third box is a fill-in if another gas such as sevoflurane or desflurane is used. The dose of the volatile inhaled anesthetic medication is recorded as a percentage value.

Intravenous fluids

Intravenous fluids are administered during anesthesia to maintain fluid homeostasis and hemodynamic stability. The type of intravenous fluids, in addition to the incremental and total volume given during anesthesia is recorded as free text.

Blood and blood product transfused

Blood and component blood products are administered when significant bleeding and hemorrhagic complications occur. The Blood and Blood Product Transfused section is a free text section where providers list both the specific blood component product (e.g., packed red blood cells or fresh frozen plasma) and volume administered.

Blood pressure and heart rate

The blood pressure and heart rate section utilize handwritten arrows and dots to encode blood pressure in millimeters of mercury (mmHg) and heart rate in beats per minute (bpm). The x axis on the grid indicates five minute epochs, during which a provider takes a systolic blood pressure (downward arrow), diastolic blood pressure (upward arrow), and heart rate measurement (dot). The y-axis encodes both bpm and mmHg in increments of 10.

Physiological indicators

The physiological indicators section uses handwritten digits to encode different types of physiological information including oxygen saturation, inspired oxygen concentration, exhaled carbon dioxide, mechanical ventilator data, body temperature, amount of urine produced, and blood loss encountered. The x-axis on the grid represents five minute epochs.

The checkboxes section uses handwritten check marks to indicate boolean values associated with a patient’s position on the operating table, intubation status, type of monitoring devices and details, and safety best-practices utilized during the surgery.

Related work

In 2015, Ohuabunwa et al. [ 5 ] detailed the need for electronic medical record systems in LMICs. According to their analysis, the rise of “communicable diseases necessitates adequate record keeping for effective follow-up”, and for retrospective research. Among the difficulties with implementing these EMRs in LMICs are unfamiliarity with these systems and the cost of implementation and maintenance which make them prohibitively expensive. The authors assert that even hybrid paper-electronic systems where an image of the health record is scanned into a database and certain data elements are manually entered into an EMR can be very costly and require significant human and monetary resources. We postulate that a system which would only require the user to take a smartphone image of an anesthesia paper record would impose minimal burdens to the existing clinical workflow and require a very small amount of capital to adopt in comparison to EMR systems.

In 2020, Rho et al. described using computer vision software to automatically digitize portions of an anesthesia paper record from CHUK using smartphone images [ 6 ]. Their work utilized a wooden box within which the anesthesia paper record would be inserted and on top of which a smartphone could be placed to attain an image that was standardized for lighting and position. They digitized the checkboxes section with 82.2% accuracy, blood pressure data with an average mean squared error of 21.44 between the systolic and diastolic symbols, and classified handwritten images of medication text with an accuracy of 90.1%. It is unclear how comparable this metric is to future work, since the algorithm used was trained to reject “unreadable” samples, and did so on approximately 15% of the test set.

Subsequently, Adorno et al. developed an improved approach for blood pressure symbol detection utilizing U-Nets [ 7 ]. By generating a segmentation mask of the blood pressure symbols, using image morphology to separate the detections, and computing the centroid of each pixel cluster, Adorno was able to improve the object detection precision to 99.7% and recall to 98.2%. The mean average error of the association between U-Net detections and the ground truth blood pressure values was approximately 4 mmHg. Our approaches build on this conceptual basis of using deep learning to identify handwritten symbols in conjunction with a post-processing algorithm to associate values with detections. We implement two of the suggestions in the future work section of Adorno’s paper, namely to incorporate image tiling, and to improve the post-processing algorithms.

For checkbox detection, Murphy et al. utilized a deep neural network approach. They used a template matching algorithm called ORB and a convolutional neural network (CNN) to locate and classify the checkboxes rather than the proportion of pixel intensity method initially used by Rho et al. [ 8 ]. Their new algorithm was capable of locating checkboxes with an accuracy of 99.8% and classifying them as checked or unchecked with an accuracy of 96.7%. In subsequent development, we simplified this process by using the YOLOv8 single shot detector to combine the detection and classification steps.

Finally, Annapareddy et al. investigated the use of the YOLOv5 single shot detector to extract and classify handwritten intravenous medications and digitize the physiological indicators Sect. [ 9 ]. Due to the large number of classes in the medication and physiological indicator sections, their paper found that models that attempted both detection and classification were generally unable to do either due to lack of sufficient data in each class. However, models trained on a single class performed much better in detection, but could not classify.

The extraction of data from an anesthesia paper chart begins with optimizing the lighting of the smartphone photographs, removing shadows, and using object detection to find document landmarks for use in removing perspective distortion. Then, each section of the chart is identified by a YOLOv8 model and cropped out of the chart. YOLOv8 models which are trained to detect handwritten blood pressure symbols, numbers, and checkboxes used in anesthesia paper charts produce lists of bounding boxes that a combination of convolutional neural networks, traditional computer vision, machine learning, and algorithms then use to impute meaningful values and detect errors.

Image optimization techniques

To maximize the accuracy of digitization, the input images need to be optimized as follows: (1) shadows removed, (2) pixel intensities standardized and normalized, (3) perspective distortions such as rotation, shear, and scaling corrected, and (4) general location of document landmarks fixed. We accomplish this by first removing shadows using image morphology techniques, then normalize and standardize the pixel values of the images, and finally correct perspective distortions and approximately correct the location of document landmarks using a homography transformation.

Shadow removal

Smartphone photographs of the anesthesia paper chart often suffer from sudden changes in pixel intensities caused by shadows being cast onto the image which break up the lighting. Sudden changes in the value of pixels can cause difficulty for deep learning models which learn representations of objects as functions of the weighted sums of pixels. Therefore, both normalization and shadow removal are necessary to optimize our inputs and maximize detection accuracy. One algorithm for accomplishing this is outlined by Dan Mašek in a stack overflow post from 2017 (Algorithm 1 ) [ 10 ].

figure a

Basic Shadow Removal

The exact values for the median blur and dilation operations are subject to the image’s size and degree of shadow and can be tuned to the dataset. This algorithm only operates on grayscale images, but since no information in the anesthesia paper charts are encoded with color, we converted our charts to grayscale. We did not use any metrics to assess shadow removal, but a visual inspection of the output shows that the resulting images no longer suffer from a lighting gradient (Fig.  2 ).

figure 2

Example of an anesthesia paper chart before and after the removal of shadows and normalization. The dilated, blurred image is subtracted pixel-wise from the original image to produce the final result

The planar homography

The planar homography is defined as the most general linear mapping of all the points contained within one quadrilateral to the points of another quadrilateral (Fig.  3 ). A planar homography was used to correct perspective distortions within the smartphone image.

figure 3

An illustration of a homography performing a general linear mapping of the points of one quadrilateral to another. Images suffering from perspective distortions can have much of their error corrected by finding four anchor points on the image, and using them as the four points on a quadrilateral to map to a perfect, scanned sheet

Translation, rotation, scaling, affine, and shear transformations are all subsets of the homography, and the homography in turn can be decomposed into these transformations. Here, as in many other computer vision applications, the homography is used to correct linear distortions in the image caused by an off-angle camera perspective (Fig.  4 ).

figure 4

An illustration of perspective based distortion due to an off-angle camera. Even the most vigilant camera operators will have some degree of perspective distortion. [ 11 ]

In order to compute a useful homography for document correction, four document landmarks need to be identified from a target anesthesia paper chart image. Those same four landmark locations were then identified on a scanned, perfectly aligned control anesthesia paper chart image. We trained a YOLOv8 model to detect the document landmarks “Total”, “Time”, “Procedure Details”, and “Patient Position” which fall in the four corners of the anesthesia paper chart described in Fig.  1 . We then used the OpenCV python package to compute the homography between the two sheets and warp the target image accordingly (Fig.  5 ). The benefits to this method are that the homography computation is robust to failure due to YOLOv8’s high accuracy, even under sub-optimal conditions. In cases where the planar homography failed to correct the distortion, clear errors were found on the anesthesia paper chart including: (1) landmarks being obscured by writing (2) landmarks being covered by other pieces of paper (3) landmarks not being included in the smartphone image entirely. Initially, this deep object detection approach seems excessive, as there are a number of traditional computer vision methods for automatic feature matching between two images such as ORB and SIFT. However, the variance in lighting and blurriness in our dataset posed challenges for these nondeep algorithms, which often failed silently, mistaking one landmark for another, and warping images such that they were unidentifiable.

figure 5

An illustration of correction using a homography on an image of the anesthesia paper chart. Perspective based distortions are corrected

Section extraction

There are seven sections which encode different pieces of intraoperative information on the anesthesia paper chart (Fig.  1 ). Due to nonlinear distortions in the image, the homography is not a perfect pixel-to-pixel matching from the target image to the scanned control image. Therefore, an alternative method of identifying the precise location of the sections is required. We accomplished this by training a YOLOv8s model to place a bounding box around each section. Because the homography already normalizes the locations of the sections to within a few dozen pixels, we were able to train one of the smallest architectures of YOLOv8, YOLOv8s, to extract the different sections.

Image tiling for small object detection

The anesthesia paper chart is characterized by having handwritten symbols (e.g., medication, numerical and blood pressure symbols) that are small and often tightly packed together (Fig.  1 ). Single shot detectors like YOLO struggle to separate and identify these handwritten symbols due to their use of a grid which assigns responsibility of a single cell to the center of a single object. One solution to this issue is to increase the image size, however since YOLO uses padding to make all images square, and the number of pixels in a square image grows quadratically with image size, this causes training memory usage and detection time to increase quadratically as well. To overcome this problem, we used an approach called image tiling where we divided the image into smaller pieces called tiles and trained on the tiles rather than the entire image. This allowed us to increase the size of these small objects relative to the frame, allowing us to get much better object detections.

There are, however, several challenges associated with image tiling. First, objects which are larger than the tiles which we have divided the image into will not be able to fit into a single tile, and will be missed by the model. All the handwritten symbols in our dataset were small, and were uniform in size, allowing us to use image tiling without the risk of losing any detections. Second, by needing to detect on every sub-image, the detection time increases. Whereas this may be an issue in real-time detection, the difference in detection time is only measured in several hundred milliseconds, which does not affect our use case. Third, the number of unique images and total objects in a single training batch will be smaller, causing the models weights to have noisy updates and require longer training. We solved these issues by utilizing the memory savings acquired by tiling to double the training batch size from 16 to 32. In addition, due to the very large number of empty tiles, we were able to randomly add only a small proportion to the training dataset, which further increased the object to tile ratio. Finally, objects which lie on the border of two tiles will not be detected since they do not reside in either image. Our solution to this issue is to not divide the image into a strict grid, but instead to treat the tiling process as a sliding window which moves by one half of its width or height every step. With this approach, if an object is on the edge of one sub-image, it will be directly in the center of the next one (Fig.  6 ). This solution introduces its own challenge though since nearly every detection will be double counted when the detections are reassembled. Our solution to this problem is to compute the intersection-over-union of every bounding box with every other bounding box at detection time, group together boxes whose intersection-over-union is greater than a given threshold, and combine them into one detection. Since the objects we are detecting should be well separated and never overlap, this allows us to remove the doubled detections.

figure 6

An example of our implementation of image tiling. By using a sliding window rather than a grid, the edge of one image is the center of the next one [ 12 ]

Blood pressure symbol detection and interpretation

The blood pressure section encodes blood pressure values using arrows, and heart rate using dots or lines. Each vertical line on the grid indicates a five minute epoch of time during which a provider records a blood pressure and heart rate reading (Fig.  1 ). The y-axis encodes the value of blood pressure in mmHg, and each horizontal line denotes a multiple of ten (Fig.  1 ).

Symbol detection

Systolic blood pressure values are encoded by a downward arrow, and diastolic blood pressure values are encoded with an upward arrow. The downward and upward arrows are identical when reflected over the x-axis, so we were able to collapse the two classes into one. We then trained a YOLOv8 model on the single “arrow” class, and during detection we simply detect on the image and an upside-down-version of itself to obtain systolic and diastolic detections respectively. Finally, the  diastolic detections y-values are subtracted from the image's height to correct for the flip.

Thereafter two key pieces of information are required from each of the bounding boxes: (1) its value in millimeters of mercury (mmHg), and (2) its timestamp in minutes.

Inferring mmHg values from blood pressure symbol detections

The value of blood pressure encoded by an arrow corresponds to the y-pixel of the tip of the arrow. By associating a blood pressure value to each y-pixel in the blood pressure section, we can obtain a value for each blood pressure bounding box. We trained a YOLOv8 model to identify the 200 and 30 legend markers, and by identifying the locations of the 200 and 30 markers, we were able to interpolate the value of blood pressure for each y-pixel between the 200 and 30 bounding boxes (Fig.  7 ).

figure 7

By dividing the space between the 30 and 200 bounding boxes equally, we can find the blood pressure values of each y-pixel. We ran the algorithm on this image, and set all the y-pixels that were multiples of 10 to red. We can see the efficacy of the algorithm visually as the detections cover the lines on the image almost perfectly

Assigning timestamps to blood pressure symbol detections

To impute timestamps, we wrote an algorithm that applies timestamps based on the relative x distances between the systolic and diastolic detections (algorithm 2).

figure b

Imputing a Time Stamp to a Blood Pressure Bounding Box

Missing detections are a common problem when applying timestamps. Our algorithm deals with this in two ways. The while loop checks if two boxes are within 1% of the image’s width from one another, ensuring they are not too far away to plausibly match before actually pairing them. If a box has no pair which is within the 1% range, the algorithm considers it to not have any matches. Another problem occurs when there are no detections for a five minute epoch. This is solved by sampling the distance between true matches in the dataset. We found that 100% of the matches were within 0.016*image’s width of the next matching pair. So, adding a small amount for error, if a match is more than 0.018*image’s width from the next pair, a time gap of 10 min is applied instead of the typical 5.

Blood pressure model training and error testing

A YOLOv8l model, the second largest architecture of YOLOv8, was trained to detect downward arrows for 150 epochs and using a batch size of 32 images. The images used to train this model were tiled images of the blood pressure section where only the systolic arrows were annotated on unflipped images, and only the diastolic arrows were annotated on flipped images.

There are two ways that error will be assessed for the blood pressure section: detection error and inference error. Detection error will be computed using the normal object detection model metrics of accuracy, recall, precision, and F1. Inference error is the error between the value in millimeters of mercury the program assigned to a blood pressure detection on the whole image of the blood pressure section, and the ground truth value that was manually annotated. Blood pressure detections made by the program were hand matched with ground truth values during assessment in order to avoid the case where the correct blood pressure value was assigned to a different timestamp. The error metric we used for this was mean average error. The 30 chart images used for testing included 1040 systolic and diastolic marks (this number varies from the object detection testing set due to image tiling duplicating detections). The ability of the program to match blood pressure detections to a particular time stamp was not assessed.

The physiological indicators section is the most difficult and challenging section to digitize. Handwritten digits are written on the line that corresponds to the physiological data they encode, but are free to vary along the time axis rather than being discretely boxed in, or being listed in fixed increments. In addition, the individual digits which appear in the physiological indicators section must be concatenated into strings of digits to form the number the provider intended to write. Our approach to digitize this section is described below:

Handwritten number detection

Our approach for the detection of numbers is a two-step process: (1) a YOLOv8 model trained on a single “digit” class which locates and bounds handwritten numbers, and (2) a RegNetY_1.6gf CNN that classifies those digits. There are two advantages to this method over using a single YOLOv8 model for both detection and classification. First, the distribution of digits in our training dataset was not uniform. For example, there are over one-thousand examples of the number ’9’ on the training charts, but only approximately 160 examples of the number ’5’ due to the typical range of oxygen saturation being between 90 and 99. This leads to the number 5 having much poorer box recall in a model that does both classification and localization. Visually, handwritten numbers are very similar to one another, so by collapsing each digit into a single “digit” class, the model can learn information about how to localize handwritten digits for numbers which are underrepresented by using numbers which are overrepresented. Second, there is an added advantage of training the classification CNN separately since the dataset can be augmented with images of digits not found on the anesthesia paper charts. We used the MNIST dataset to expand and augment our training dataset, providing sufficient examples from each class to attain a high accuracy [ 13 ].

Matching each box to the corresponding row

Prior to clustering the digit bounding boxes together by proximity (Fig.  9 ), we had to find which row the box belongs to. For any given patient, between 0 and 7 rows were filled out depending on the type of surgery and ventilation parameter data recorded by the anesthesia provider. For the special cases where 0 or 1 rows were filled out, there were either no detected digits or the standard deviation of the y-center of the detected digits was only a few pixels. For the case where there was more than one row, we used KMeans clustering on the y-centers of the digit bounding boxes using \(k \in [2, 3, 4, 5, 6, 7]\) and determined the number of rows by choosing the value of k which maximized the silhouette score, a metric which determines how well a particular clustering fits the data. In order to determine which row a cluster encodes, we examined the y-centroid of clusters from 30 sheets, and found that the distribution of y-centroids for a particular row never overlapped with any other row. This meant that there were distinct ranges of y-pixels that corresponded to a given row, allowing us to determine which row a cluster encodes by finding which range contained the y-centroid of a cluster (Fig.  8 ).

figure 8

Clustered detections in the physiological indicator section using the KMeans clustering algorithm, and selecting K based on the maximum silhouette score

Clustering single digit detections into multi-digit detections

When we assigned each row an ordered list of boxes that correspond to it, we then clustered those boxes into observations that encode a single value (Fig.  9 ). This is done with the same KMeans-silhouette method used to find which rows each digit bounding box corresponds. In order to narrow down the search for the correct value of k , we used the plausible range of values for each row. For example, the first row encodes oxygen saturation, which realistically falls within the range \(\text {SpO}_2 \in [75, 100]\) . If we let n be the number of digit bounding boxes, the minimum number of clusters would be realized if the patient had a \(100\%\) oxygen saturation for the entire surgery, leading to \(k = \lfloor n/3\rfloor\) . In contrast, the maximum number would be realized when the patient never had a \(100\%\) oxygen saturation, leading to \(k = \lceil n/2\rceil\) . Allowing for a margin of error on either side of \(10\%\) due to missed or erroneous detections, we fit a KMeans clustering model with each of \(k \in [\lfloor n/3\rfloor - \lceil 0.1*n \rceil , \lceil n/2\rceil + \lceil 0.1*n\rceil ]\) , and selected the value of k which maximized silhouette score. For the other physiological parameter rows, we reassessed the plausible number of digits for that specific variable and obtained a new range of k values. The clusters created by the optimal KMeans model are then considered to be digits which semantically combine to form one value.

figure 9

Boxes from the SpO \(_2\) section clustered into observations using KMeans. A plausible range of values for k is determined by computing the number of boxes divided by the highest and lowest plausible number of digits found in a cluster (3 and 2 for the SpO \(_2\) section, respectively). From this range, the k which maximizes the silhouette score is chosen

The only section which does not conform to this paradigm is the tidal volume row. In this row, there is an “X” which separates a tidal volume in milliliters from the respiratory rate in breaths per minute. To detect semantic groupings of digits, we used the fact that tidal volume is nearly always three digits, and respiratory rate is nearly always two digits, with an “X” mark in the center, and made our search accordingly. A small CNN trained as a one vs rest model to detect the “X” mark was then trained to separate the tidal volume from the respiratory rate.

Assigning a value to each multi-digit detection cluster

We trained a RegNetY CNN model to classify images of handwritten numbers by combining the MNIST dataset with the digits from the charts we labeled. Initially the program runs the model on each digit in a cluster and concatenates them together to form a single value. However, due to the poor quality of handwriting, our test set classification accuracy was approximately 90% rather than the standard 99% or greater that is achievable with most modern CNNs using the MNIST dataset.

One way to minimize this error is to check if the value assigned is biologically plausible. The program first checks if the concatenated characters of a section fall in a plausible range for each row. For example, if SpO \(_2 \not \in [75\%, 100\%]\) the program marks the observation as implausible. In addition, if the absolute difference between a value and the values immediately before or after it is larger than a one sided tolerance interval constructed with the differences we observed in the dataset, the program also marks it as implausible. For example, if an observation for SpO \(_2\) is truly 99, but the model mistakes it is 79, and the observations just before and after it is 98 and 100 respectively, the observation is marked as implausible since SpO \(_2\) is very unlikely to decrease and improve that rapidly. If an observation is marked as implausible, the program imputes a value by fitting a linear regression line with the previous two and next two plausible values, and predicts the current value by rounding the output of the regression model at the unknown value.

Physiological indicator model and error testing

A YOLOv8l model was trained to detect one class, handwritten digits, for 150 epochs with a batch size of 32.

A RegNetY_1.6gf model was trained on a mixture between observations cropped from the charts and the MNIST dataset. The model was validated and tested on observations only from the charts. The training set contained 88571 observations, while the validation and testing sets had 7143 observations each. The model was trained for 25 epochs and images were augmented using Torchvision’s autoaugment transformation under the ’imagenet’ autoaugment policy.

Error for object detection will be assessed with accuracy, precision, recall, and F1. Error for classifying numbers will be reported using only accuracy. The error for inferring a value from the classified object detections will be assessed using mean average error on each of the 5 physiological indicators on all 30 test charts. Using the output of the program and the ground truth dataset, we will compute the mean average error by index value of the lists. For example, let the program output be (99, 98, 97), the ground truth from the chart image be (98, 99, 100). Then the matched values are ((99, 98), (98, 99), (97, 100)), and the error would be computed as \(((\Vert 99-98\Vert + \Vert 98-99\Vert + \Vert 97-100\Vert )/3))\) . If the ground truth and predictions vary in length, the longer of the two lists will be truncated to the length of the shorter.

The checkbox section is a two class object detection and classification problem. Imputing a value can be made difficult if there are missing or erroneous detections.

Checkbox detection and classification

We labeled each checkbox from all the anesthesia paper charts in the dataset as checked or unchecked, and then trained a YOLOv8 model to detect and classify each checkbox in the image. Approximately one out of every twenty checkboxes that were intended to be checked did not actually contain a marking inside them. Instead, the marking would be placed on the text next to the box, slightly above the box, or adjacent to the box in some other location. We decided a priori to label these as checked because it was the intention of the provider to indicate the box as checked, and so that the model would begin to look to areas adjacent to the box for checks as well.

Assigning meaning to checkboxes

The checkboxes are arranged in columns (Fig.  1 ), so the algorithm for determining which bounding box corresponds to which checkbox starts by sorting the bounding boxes by x-center, then groups them using the columns that appear on the page, and sorts each group by y-center. For example, the left-most boxes “Eye Protection”, “Warming”, “TED Stockings”, and “Safety Checklist” on the anesthesia paper chart are all in the “Patient Safety” column, and have approximately the same x-center. The algorithm sorts all checkbox bounding boxes by x-center, selects the first four, then sorts them by y-value. Assuming there are no missing or erroneous boxes, these first four bounding boxes should match the “Patient Safety” checkboxes they encode.

Checkbox model training and error testing

A YOLOv8l model was trained to detect and classify checkboxes for 150 epochs using a batch size of 32. Error will be reported by overall accuracy, precision, recall, and F1 score. Sheets where the number of detections does not match the number of checkboxes will be removed from the error calculation, and the number of sheets where this occurred will be reported.

In addition to detection and classification, the program’s ability to correctly infer which checked/unchecked bounding box detection associates with which checkbox will be assessed. This error will be quantified with accuracy, precision, recall, and F1.

Results and discussion

Our testing results were based on a 30 chart holdout set. The reason we report accuracy on these and not the testing sets used during YOLO training was due to image tiling duplicating many of the labels, which would provide an accuracy that does not reflect what would be seen on the whole section of the chart. While not reported, in all cases the test and validation sets had nearly identical metrics, suggesting the models were generalizing.

On the 30 test charts, the model for extracting the sections of the anesthesia paper chart achieved an average box precision of 0.99, an average box recall of 0.99, and an mAP0.5-95 of 0.97. Due to the handwritten symbols being listed on the interior of the sections rather than the edges, a small error is, for our purposes, equivalent perfect model since it never cut off the important data elements in the sections.

Blood pressure

Detection errors were computed using the full test set of 30 images, which in total had 1040 systolic and diastolic marks. Inference errors were computed using the first 5 images, which in total had 141 systolic and diastolic markers. This set is smaller because the systolic and diastolic markers were manually matched with their ground truth counterparts due to 8 erroneous extra markers and 2 missed markers.

Detection error

Table 1 demonstrates that our new method has a slightly lower accuracy rate. However, it is important to note that the previous method was tested on scanned, synthetic anesthesia paper chart images, whereas the new method was tested on smartphone images of anesthesia paper charts from real cases.

Inference error

The mean average error for the inference of a mmHg measurement to a blood pressure detection was only approximately 1.25mmHg, and did not vary greatly (Table 2 ). While not listed, the mean squared error also remains small, suggesting the error we observe did not come from a few very incorrect observations. Rather, the error we observed came from most observations being some small distance away from the true value.

The MAE for imputing a value in mmHg to a blood pressure detection is much lower than previous methods. The MAE of the new method is within the variance that human beings assign to the handwritten symbols and is clinically insignificant.

By passing the output bounding boxes of the single class YOLOv8 model to the classification CNN, we can get an end to end detection error for the single characters. The overall accuracy was 85.2%, but this metric varied greatly between digits, primarily due to less representation in the training dataset for certain digits, and handwritten digits looking similar to each other (e.g., 7, 2, and 9).

Obtaining an error for the imputed value of the physiological indicators is challenging. Approximately one out of every six characters that should be detected was not (false negative), and one out of every twenty proposed boxes was not actually a character, but was instead a percentage sign or other nondigit pen marking (false positive). In addition, there were relatively few examples of FiO \(_2\) (inspired oxygen concentration) and EtCO \(_2\) (end tidal carbon dioxide) in the test set, making their error highly dependent on the quality of the small number of sheets which did record them.

Therefore, we assessed error only on observations in which at least one character was detected, and a-priori decided to exclude those which were completely undetected. In addition, we left in any erroneous boxes that were clustered together with an observation.

We identified that handwriting quality had a very large positive effect on the inference accuracy, so to determine a best case error we created five synthetic sheets and filled them with an average of 35 plausible datapoints per sheet, and took images of them with smartphones in lighting similar to the real dataset. Table 3 contains the average and squared error for each section between the real anesthesia paper chart and the synthetic anesthesia paper chart test sheets. The inference error on the synthetic sheets was near zero and much more consistent than on the real anesthesia paper chart. The error on the real anesthesia paper chart was comparatively higher and more variable. When an application for smartphones is developed that will be used by physicians, we believe that the handwriting will improve to meet that of the synthetic sheets due to the Hawthorne effect [ 14 ].

1117 checkboxes from the 29 of the 30 test set images were used for assessing error. One test set image was excluded due to it being too blurry to manually annotate. The accuracy metrics in Table 4 demonstrate improvement in all measures, compared to previous approaches.

Some checkboxes had markings which were not strictly inside the checkbox they were intending to mark, but were still classified as checked in the training dataset since the intention of the provider was to check them. Because of this, the model learned how to look in the space immediately around the checkbox to find markings, and was able to classify some checkboxes that did not have markings inside them (Tables 5 , 6 , 7 and 8 ).

To increase the accuracy of the data being extracted from the sheets, our exact implementation of the checkbox detection algorithm was written to throw an error if it did not detect the exact number of checkboxes on the sheet and no more. Our program did so on 4 of the 29 sheets in the test dataset (13.7%). Among the remaining 25 sheets, the program inferred the exact box that was being checked almost perfectly. The conditional error metrics are reported in Table 9 .

Impact of image preprocessing

To assess the impact of both homography and deshadowing, errors were recomputed without them. We found the homography to raise accuracy across all metrics, while deshadowing had no effect on accuracy (Tables 10 , 11 , 12 , 13 , 14 , 15 ).

The effect of preprocessing on the physiological indicator section was unclear. By removing deshadowing, the amount of numbers correctly detected raised by 3%, and removing both the homography correction and deshadowing had varying effects on the inference of a value for the detections (Tables 12 , 13 ).

The checkboxes showed very little performance loss when removing the deshadowing component, but did have a notable but small drop in the metrics when removing the homography correction (Tables 14 , 15 ). Removing the homography caused an additional sheet from the test dataset to not have the correct number of detections for imputing meaning to the checkbox detections.

In this manuscript we discussed the integration of previous research into one piece of software and the improvement of algorithms for extracting handwritten data from smartphone photographs of anesthesia paper health records. While electronic medical records are not a feasible solution for LMICs in the near future, we have demonstrated that it is possible to extract high quality data elements from anesthesia paper charts, utilizing locally available, low-cost resources such as a smartphone. Through the use of deep neural networks and the careful filtering and correction of their output by classical machine learning models and algorithms, we were able to improve the digitization of blood pressure and checkboxes to near perfect accuracy, under realistic photography and lighting conditions. In addition, we demonstrated that, through careful and legible handwriting, physiological data could likewise be digitized with high accuracy. Our work is an important step in improving access to data for health care providers in LMICs, and is a major advance in providing access to data for real time, point of care clinical decision support.

Challenges and limitations

Image and chart quality.

We have demonstrated the ability of the program to digitize multiple components of the anesthesia paper chart with high accuracy. However, as has been demonstrated with digitization of the physiological indicators, poor or illegible handwriting and image quality make extraction difficult, and is responsible for the majority of errors in the system. It is important to note that model development was done on previously archived anesthesia paper charts. We believe that in the future there will likely be a Hawthorne effect with improved handwriting quality when health care providers are aware that paper health records will be digitized [ 14 ]. This will improve the accuracy of the physiological data.

Single site usage

Anesthesia paper health charts are not standardized, with different hospitals having their own unique chart. This means that our current software will only work on a single version of the chart at a single hospital.

Future work

Improvement of error detection and inference algorithms.

For our initial implementation of the system, we kept the algorithms for imputing values to erroneous detections either (1) simple, using only linear models and filtering algorithms, or (2) left them out entirely, such as in the case of the checkboxes. The software we developed can now be used to test and compare local or nonlinear regression algorithms for imputing values, and new filtering methods for detecting erroneous values.

Digitization of remaining chart elements

There are several reasons why the remaining anesthesia paper chart elements remain undigitized. In our current dataset, Inhaled Volatile Medications (Fig.  1 . Section B), Intravenous Fluids (Fig.  1 . Section C) and Blood and Blood Product Transfused (Fig.  1 . Section D) were infrequently recorded. In addition, the transfusions and intravenous fluids sections are completely free text, the heart rate encoding is not consistent with some anesthesia paper records using a dot, whereas others use a straight line, and the intravenous drugs section is particularly hard to read even for human clinicians. The inhaled anesthetics, however, could be digitized since they are simple checkboxes and digits, which are both currently readable. Other techniques for digitizing the data could also be available in the future, especially with a potentially larger training dataset. If a smartphone app implemented our code into a full system, the providers could list the drugs they used, eliminating the most difficult section while imposing only a minor amount of extra work for anesthesia providers.

Prospective creation of a new intraoperative sheet

Anesthesia paper health charts are not standardized, with different hospitals having their own unique chart. Immense time and effort is required to digitize one unique anesthesia paper health chart. To ensure future success for this project, our next goal is to design a standardized, machine readable, anesthesia paper chart using a collaborative effort between anesthesia providers from LMIC and computer vision engineers using a Delphi approach. By creating a chart prospectively, chart sections that are currently outside our ability to digitize accurately such as the intravenous fluids, transfusions, and intravenous drugs could be redesigned with machine readability in mind. For example, the intravenous drugs could have a three digit alphanumeric code written alongside the name of the medication, allowing the machine to accurately read drugs and circumventing the need to read handwritten words altogether. A smartphone app that sends images of charts to a server for processing could also store a medication-to-code dictionary so providers can easily look up the code of medications. Findings and knowledge gained from this work will guide future efforts to digitize paper charts from nonsurgical locations such as the emergency room, obstetrical delivery areas and critical care units.

Availability of data and materials

The data for this paper are not available due to the protected health information contained in the images.

Code availability

The code for this project can be found in the references [ 15 ].

Abbreviations

Beats per minute

University Teaching Hospital of Kigali

Convolutional neural network

Electronic medical records

End tidal carbon dioxide

Fraction of inspired oxygen

Low-middle income countries

Mean average error

Mean average precision

Millimeters of mercury

Mean squared error

Oriented FAST and rotated BRIEF

Oxygen saturation

RNN regulated residual network Y 1.6 gigaflops

Scale-invariant feature transform

You only look once version 5

You only look once version 8

You only look once version 8 small architecture

Biccard BM, Madiba TE, Kluyts H-L, Munlemvo DM, Madzimbamuto FD, Basenero A, Gordon CS, Youssouf C, Rakotoarison SR, Gobin V, Samateh AL, Sani CM, Omigbodun AO, Amanor-Boadu SD, Tumukunde JT, Esterhuizen TM, Manach YL, Forget P, Elkhogia AM, Mehyaoui RM, Zoumeno E, Ndayisaba G, Ndasi H, Ndonga AKN, Ngumi ZWW, Patel UP, Ashebir DZ, Antwi-Kusi AAK, Mbwele B, Sama HD, Elfiky M, Fawzy MA, Pearse RM. African Surgical Outcomes Study (ASOS) investigators: perioperative patient outcomes in the African surgical outcomes study: a 7-day prospective observational cohort study. Lancet. 2018;391(10130):1589–98.

Article   PubMed   Google Scholar  

ASOS-2 Investigators: Enhanced postoperative surveillance versus standard of care to reduce mortality among adult surgical patients in africa (ASOS-2): a cluster-randomised controlled trial. Lancet Glob. Health 9(10), 1391–1401 (2021)

Durieux ME, Naik BI. Scientia potentia est: striving for data equity in clinical medicine for low- and middle-income countries. Anesth Analg. 2022;135(1):209–12.

Akanbi MO, Ocheke AN, Agaba PA, Daniyam CA, Agaba EI, Okeke EN, Ukoli CO. Use of electronic health records in sub-saharan Africa: progress and challenges. J Med Trop. 2012;14(1):1–6.

PubMed   PubMed Central   Google Scholar  

Ohuabunwa EC, Sun J, Jean Jubanyik K, Wallis LA. Electronic medical records in low to middle income countries: the case of Khayelitsha hospital, South Africa. Afr J Emerg Med. 2016;6(1):38–43. https://doi.org/10.1016/j.afjem.2015.06.003 .

Rho V, Yi A, Channavajjala B, McPhillips L, Nathan SW, Focht R, Ohene N, Adorno W, Durieux M, Brown D. Digitization of perioperative surgical flowsheets. In: 2020 systems and information engineering design symposium (SIEDS), pp. 1–6 (2020). https://doi.org/10.1109/SIEDS49339.2020.9106679

Adorno W, Yi A, Durieux M, Brown D. Hand-drawn symbol recognition of surgical flowsheet graphs with deep image segmentation. In: 2020 IEEE 20th international conference on bioinformatics and bioengineering (BIBE), pp. 295–302 (2020). https://doi.org/10.1109/BIBE50027.2020.00055

Murphy E, Samuel S, Cho J, Adorno W, Durieux M, Brown D, Ndaribitse C. Checkbox detection on rwandan perioperative flowsheets using convolutional neural network. In: 2021 systems and information engineering design symposium (SIEDS), pp. 1–6 (2021). https://doi.org/10.1109/SIEDS52267.2021.9483723

Annapareddy N, Fallin K, Folks R, Jarrard W, Durieux M, Moradinasab N, Naik B, Sengupta S, Ndaribitse C, Brown D. Handwritten text and digit classification on rwandan perioperative flowsheets via yolov5. In: 2022 systems and information engineering design symposium (SIEDS), pp. 270–275 (2022). IEEE

Mašek D. Increase image brightness without overflow (2017). https://stackoverflow.com/a/44054699/16292661

Rosengren P. Appoose: Homography-transl-bold.svg. https://commons.wikimedia.org/wiki/File:Homography-transl-bold.svg

Flesier M. A Domestic Cat in Zeytinburnu. https://commons.wikimedia.org/wiki/File:A_domestic_cat_in_Zeytinburnu.jpg

Deng L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process Mag. 2012;29(6):141–2.

Article   Google Scholar  

Edwards K-E, Hagen SM, Hannam J, Kruger C, Yu R, Merry AF. A randomized comparison between records made with an anesthesia information management system and by hand, and evaluation of the hawthorne effect. Can J Anaesth. 2013;60(10):990–7.

D Folks R. Rwandan-Flowsheet-Digitizer. https://github.com/RyanDoesMath/Rwandan-Flowsheet-Digitizer

Download references

Acknowledgements

Our team would like to acknowledge Michael G. Rich, Jose P. Trejo, and Faiz M. Plastikwala for their work in labeling much of the training data for the project. We would also like to thank Dr Christian Ndaribitse for providing the data for this project.

The Lacuna fund funded the collection of anesthesia paper chart images for the creation of a dataset.

Author information

Authors and affiliations.

Department of Anesthesiology, University of Virginia, Charlottesville, VA, USA

Ryan D. Folks, Bhiken I. Naik & Marcel E. Durieux

School of Data Science, University of Virginia, Charlottesville, VA, USA

Donald E. Brown

You can also search for this author in PubMed   Google Scholar

Contributions

RDF wrote the code for the project, trained the deep learning models, evaluated their accuracy, labeled datasets, and helped write and draft the manuscript. BIN organized data collection and helped write and draft the manuscript. MED helped write and draft the manuscript. DEB helped write and draft the manuscript.

Corresponding author

Correspondence to Ryan D. Folks .

Ethics declarations

Ethics approval and consent to participate.

The data utilized for this project was obtained from CHUK after receiving Institutional Review Board approval at both the University of Rwanda and University of Virginia (no. 029/College of Medicine and Health Sciences IRB/2020), University Teaching Hospital of Kigali (EC/Centre Hospitalier Universitaire De Kigali/049/2020, July 13, 2020), and the University of Virginia (Health Sciences Research no. 22259) institutional review board (IRB).

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Folks, R.D., Naik, B.I., Brown, D.E. et al. Computer vision digitization of smartphone images of anesthesia paper health records from low-middle income countries. BMC Bioinformatics 25 , 178 (2024). https://doi.org/10.1186/s12859-024-05785-8

Download citation

Received : 08 November 2023

Accepted : 15 April 2024

Published : 07 May 2024

DOI : https://doi.org/10.1186/s12859-024-05785-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Computer vision
  • Computer extraction of time series
  • Document analysis

BMC Bioinformatics

ISSN: 1471-2105

image processing in research paper

COMMENTS

  1. Frontiers

    The field of image processing has been the subject of intensive research and development activities for several decades. This broad area encompasses topics such as image/video processing, image/video analysis, image/video communications, image/video sensing, modeling and representation, computational imaging, electronic imaging, information forensics and security, 3D imaging, medical imaging ...

  2. (PDF) IMAGE RECOGNITION USING MACHINE LEARNING

    The image classification is a classical problem of image processing, computer vision and machine learning fields. In this paper we study the image classification using deep learning.

  3. 267349 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DIGITAL IMAGE PROCESSING. Find methods information, sources, references or conduct a literature ...

  4. Image Preprocessing in Classification and Identification of Diabetic

    Therefore, in this research article, the main objective is to achieve the highest accuracy, sensitivity, and specificity than the existing deep learning models. The technique we used in this paper is the combination of traditional image processing methods for image enhancement and segmentation and then train in deep learning algorithms.

  5. (PDF) Medical Image Processing-An Introduction

    Keywords: Data Mining, Classification, Image Segmentation. 1. Introduction. Medical image processing deals with the development of. problem-specific approaches to the enhancement of raw. medical ...

  6. Machine learning-based image processing in materials science and

    1. Introduction. Material science research has adopted Machine learning (ML) which has transformed the research perspective. Human scientists can interpret the material data but by integration of machine learning the interpretation gets accelerated with better accuracy [1].For decades the researchers have relied upon experimental methods for the development of new material or testing which ...

  7. Viewpoints on Medical Image Processing: From Science to Application

    Abstract. Medical image processing provides core innovation for medical imaging. This paper is focused on recent developments from science to applications analyzing the past fifteen years of history of the proceedings of the German annual meeting on medical image processing (BVM). Furthermore, some members of the program committee present their ...

  8. Developments in Image Processing Using Deep Learning and Reinforcement

    When we consider the volume of research developed, there is a clear increase in published research papers targeting image processing and DL, over the last decades. A search using the terms "image processing deep learning" in Springerlink generated results demonstrating an increase from 1309 articles in 2005 to 30,905 articles in 2022, only ...

  9. Morphological Image Processing

    Morphological image processing is based on probing an image with a structuring element and either filtering or quantifying the image according to the manner in which the structuring element fits (or does not fit) within each object in the image. A binary image is made up of foreground and background pixels, and connected sets of foreground pixels make up the objects in the image.

  10. Pattern Recognition and Image Processing

    Extensive research and development has taken place over the last 20 years in the areas of pattern recognition and image processing. Areas to which these disciplines have been applied include business (e. g., character recognition), medicine (diagnosis, abnormality detection), automation (robot vision), military intelligence, communications (data compression, speech recognition), and many ...

  11. Digital Image Processing

    In this paper we give a tutorial overview of the field of digital image processing. Following a brief discussion of some basic concepts in this area, image processing algorithms are presented with emphasis on fundamental techniques which are broadly applicable to a number of applications. In addition to several real-world examples of such techniques, we also discuss the applicability of ...

  12. Application research of digital media image processing ...

    To this end, this paper uses image denoising, watermarking, encryption and decryption, and image compression in the process of image processing technology to carry out unified modeling, using wavelet transform as a method to simulate 300 photos from life. ... Therefore, the research on image processing technology is very important, but most of ...

  13. PDF Advanced Image Processing for Astronomical Images

    Advanced Image Processing for Astronomical Images Diganta Misra1, Sparsha Mishra1 and Bhargav Appasani1 1School of Electronics Engineering, KIIT University, Bhubaneswar-751024, India Abstract: Image Processing in Astronomy is a major field of research and involves a lot of techniques pertaining to improve analyzing the properties of the celestial objects or obtaining

  14. A Method Of Skin Disease Detection Using Image Processing And Machine

    Image processing techniques help to build automated screening system for dermatology at an initial stage. The extraction of features plays a key role in helping to classify skin diseases. In this research the method of detection was designed by using pretrained convolutional neural network (AlexNet) and SVM. In conclusion, we must not forget ...

  15. Image processing and pattern recognition in industrial engineering

    This is the primary purpose of image processing and pattern recognition. In electrical engineering and computer science, image processing is any form of signal processing for which the input is an image, such as a photograph or video frame. The output of image processing may be either an image or a set of characteristics or parameters related ...

  16. A deep neural network for hand gesture recognition from RGB image in

    Deep learning research has gained significant popularity recently, finding applications in various domains such as image preprocessing, segmentation, object recognition, and semantic analysis. Deep learning has gradually replaced traditional algorithms such as color-based methods, contour-based methods, and motion-based methods. In the context of hand gesture recognition, traditional ...

  17. [2404.17736] Diffusion-Aided Joint Source Channel Coding For High

    View a PDF of the paper titled Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission, by Mingyu Yang and 3 other authors View PDF HTML (experimental) Abstract: Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated as an effective approach for wireless image transmission.

  18. Image Based Quality Inspection in Smart Manufacturing Systems: A

    There is an evident increase in the number of Scopus research papers in the field of ‘Image based Quality Inspection’ with an about 87.76% increase in the last five years compared to the previous five year term which is almost double (Fig. 1). ... Quality Control & Inspection 160 71.4 Image Processing and Analysis 129 57.6 Computer ...

  19. Computer vision digitization of smartphone images of anesthesia paper

    Background In low-middle income countries, healthcare providers primarily use paper health records for capturing data. Paper health records are utilized predominately due to the prohibitive cost of acquisition and maintenance of automated data capture devices and electronic medical records. Data recorded on paper health records is not easily accessible in a digital format to healthcare ...

  20. Researchers create innovative verification techniques to increase

    The research, detailed in the paper "Modular Sumcheck Proofs with Applications to Machine Learning and Image Processing" and presented at the last ACM (Association for Computing Machinery ...