• Reference Manager
  • Simple TEXT file

People also looked at

Specialty grand challenge article, grand challenges in image processing.


  • Université Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des signaux et Systèmes, Gif-sur-Yvette, France


The field of image processing has been the subject of intensive research and development activities for several decades. This broad area encompasses topics such as image/video processing, image/video analysis, image/video communications, image/video sensing, modeling and representation, computational imaging, electronic imaging, information forensics and security, 3D imaging, medical imaging, and machine learning applied to these respective topics. Hereafter, we will consider both image and video content (i.e. sequence of images), and more generally all forms of visual information.

Rapid technological advances, especially in terms of computing power and network transmission bandwidth, have resulted in many remarkable and successful applications. Nowadays, images are ubiquitous in our daily life. Entertainment is one class of applications that has greatly benefited, including digital TV (e.g., broadcast, cable, and satellite TV), Internet video streaming, digital cinema, and video games. Beyond entertainment, imaging technologies are central in many other applications, including digital photography, video conferencing, video monitoring and surveillance, satellite imaging, but also in more distant domains such as healthcare and medicine, distance learning, digital archiving, cultural heritage or the automotive industry.

In this paper, we highlight a few research grand challenges for future imaging and video systems, in order to achieve breakthroughs to meet the growing expectations of end users. Given the vastness of the field, this list is by no means exhaustive.

A Brief Historical Perspective

We first briefly discuss a few key milestones in the field of image processing. Key inventions in the development of photography and motion pictures can be traced to the 19th century. The earliest surviving photograph of a real-world scene was made by Nicéphore Niépce in 1827 ( Hirsch, 1999 ). The Lumière brothers made the first cinematographic film in 1895, with a public screening the same year ( Lumiere, 1996 ). After decades of remarkable developments, the second half of the 20th century saw the emergence of new technologies launching the digital revolution. While the first prototype digital camera using a Charge-Coupled Device (CCD) was demonstrated in 1975, the first commercial consumer digital cameras started appearing in the early 1990s. These digital cameras quickly surpassed cameras using films and the digital revolution in the field of imaging was underway. As a key consequence, the digital process enabled computational imaging, in other words the use of sophisticated processing algorithms in order to produce high quality images.

In 1992, the Joint Photographic Experts Group (JPEG) released the JPEG standard for still image coding ( Wallace, 1992 ). In parallel, in 1993, the Moving Picture Experts Group (MPEG) published its first standard for coding of moving pictures and associated audio, MPEG-1 ( Le Gall, 1991 ), and a few years later MPEG-2 ( Haskell et al., 1996 ). By guaranteeing interoperability, these standards have been essential in many successful applications and services, for both the consumer and business markets. In particular, it is remarkable that, almost 30 years later, JPEG remains the dominant format for still images and photographs.

In the late 2000s and early 2010s, we could observe a paradigm shift with the appearance of smartphones integrating a camera. Thanks to advances in computational photography, these new smartphones soon became capable of rivaling the quality of consumer digital cameras at the time. Moreover, these smartphones were also capable of acquiring video sequences. Almost concurrently, another key evolution was the development of high bandwidth networks. In particular, the launch of 4G wireless services circa 2010 enabled users to quickly and efficiently exchange multimedia content. From this point, most of us are carrying a camera, anywhere and anytime, allowing to capture images and videos at will and to seamlessly exchange them with our contacts.

As a direct consequence of the above developments, we are currently observing a boom in the usage of multimedia content. It is estimated that today 3.2 billion images are shared each day on social media platforms, and 300 h of video are uploaded every minute on YouTube 1 . In a 2019 report, Cisco estimated that video content represented 75% of all Internet traffic in 2017, and this share is forecasted to grow to 82% in 2022 ( Cisco, 2019 ). While Internet video streaming and Over-The-Top (OTT) media services account for a significant bulk of this traffic, other applications are also expected to see significant increases, including video surveillance and Virtual Reality (VR)/Augmented Reality (AR).

Hyper-Realistic and Immersive Imaging

A major direction and key driver to research and development activities over the years has been the objective to deliver an ever-improving image quality and user experience.

For instance, in the realm of video, we have observed constantly increasing spatial and temporal resolutions, with the emergence nowadays of Ultra High Definition (UHD). Another aim has been to provide a sense of the depth in the scene. For this purpose, various 3D video representations have been explored, including stereoscopic 3D and multi-view ( Dufaux et al., 2013 ).

In this context, the ultimate goal is to be able to faithfully represent the physical world and to deliver an immersive and perceptually hyperrealist experience. For this purpose, we discuss hereafter some emerging innovations. These developments are also very relevant in VR and AR applications ( Slater, 2014 ). Finally, while this paper is only focusing on the visual information processing aspects, it is obvious that emerging display technologies ( Masia et al., 2013 ) and audio also plays key roles in many application scenarios.

Light Fields, Point Clouds, Volumetric Imaging

In order to wholly represent a scene, the light information coming from all the directions has to be represented. For this purpose, the 7D plenoptic function is a key concept ( Adelson and Bergen, 1991 ), although it is unmanageable in practice.

By introducing additional constraints, the light field representation collects radiance from rays in all directions. Therefore, it contains a much richer information, when compared to traditional 2D imaging that captures a 2D projection of the light in the scene integrating the angular domain. For instance, this allows post-capture processing such as refocusing and changing the viewpoint. However, it also entails several technical challenges, in terms of acquisition and calibration, as well as computational image processing steps including depth estimation, super-resolution, compression and image synthesis ( Ihrke et al., 2016 ; Wu et al., 2017 ). The resolution trade-off between spatial and angular resolutions is a fundamental issue. With a significant fraction of the earlier work focusing on static light fields, it is also expected that dynamic light field videos will stimulate more interest in the future. In particular, dense multi-camera arrays are becoming more tractable. Finally, the development of efficient light field compression and streaming techniques is a key enabler in many applications ( Conti et al., 2020 ).

Another promising direction is to consider a point cloud representation. A point cloud is a set of points in the 3D space represented by their spatial coordinates and additional attributes, including color pixel values, normals, or reflectance. They are often very large, easily ranging in the millions of points, and are typically sparse. One major distinguishing feature of point clouds is that, unlike images, they do not have a regular structure, calling for new algorithms. To remove the noise often present in acquired data, while preserving the intrinsic characteristics, effective 3D point cloud filtering approaches are needed ( Han et al., 2017 ). It is also important to develop efficient techniques for Point Cloud Compression (PCC). For this purpose, MPEG is developing two standards: Geometry-based PCC (G-PCC) and Video-based PCC (V-PCC) ( Graziosi et al., 2020 ). G-PCC considers the point cloud in its native form and compress it using 3D data structures such as octrees. Conversely, V-PCC projects the point cloud onto 2D planes and then applies existing video coding schemes. More recently, deep learning-based approaches for PCC have been shown to be effective ( Guarda et al., 2020 ). Another challenge is to develop generic and robust solutions able to handle potentially widely varying characteristics of point clouds, e.g. in terms of size and non-uniform density. Efficient solutions for dynamic point clouds are also needed. Finally, while many techniques focus on the geometric information or the attributes independently, it is paramount to process them jointly.

High Dynamic Range and Wide Color Gamut

The human visual system is able to perceive, using various adaptation mechanisms, a broad range of luminous intensities, from very bright to very dark, as experienced every day in the real world. Nonetheless, current imaging technologies are still limited in terms of capturing or rendering such a wide range of conditions. High Dynamic Range (HDR) imaging aims at addressing this issue. Wide Color Gamut (WCG) is also often associated with HDR in order to provide a wider colorimetry.

HDR has reached some levels of maturity in the context of photography. However, extending HDR to video sequences raises scientific challenges in order to provide high quality and cost-effective solutions, impacting the whole imaging processing pipeline, including content acquisition, tone reproduction, color management, coding, and display ( Dufaux et al., 2016 ; Chalmers and Debattista, 2017 ). Backward compatibility with legacy content and traditional systems is another issue. Despite recent progress, the potential of HDR has not been fully exploited yet.

Coding and Transmission

Three decades of standardization activities have continuously improved the hybrid video coding scheme based on the principles of transform coding and predictive coding. The Versatile Video Coding (VVC) standard has been finalized in 2020 ( Bross et al., 2021 ), achieving approximately 50% bit rate reduction for the same subjective quality when compared to its predecessor, High Efficiency Video Coding (HEVC). While substantially outperforming VVC in the short term may be difficult, one encouraging direction is to rely on improved perceptual models to further optimize compression in terms of visual quality. Another direction, which has already shown promising results, is to apply deep learning-based approaches ( Ding et al., 2021 ). Here, one key issue is the ability to generalize these deep models to a wide diversity of video content. The second key issue is the implementation complexity, both in terms of computation and memory requirements, which is a significant obstacle to a widespread deployment. Besides, the emergence of new video formats targeting immersive communications is also calling for new coding schemes ( Wien et al., 2019 ).

Considering that in many application scenarios, videos are processed by intelligent analytic algorithms rather than viewed by users, another interesting track is the development of video coding for machines ( Duan et al., 2020 ). In this context, the compression is optimized taking into account the performance of video analysis tasks.

The push toward hyper-realistic and immersive visual communications entails most often an increasing raw data rate. Despite improved compression schemes, more transmission bandwidth is needed. Moreover, some emerging applications, such as VR/AR, autonomous driving, and Industry 4.0, bring a strong requirement for low latency transmission, with implications on both the imaging processing pipeline and the transmission channel. In this context, the emergence of 5G wireless networks will positively contribute to the deployment of new multimedia applications, and the development of future wireless communication technologies points toward promising advances ( Da Costa and Yang, 2020 ).

Human Perception and Visual Quality Assessment

It is important to develop effective models of human perception. On the one hand, it can contribute to the development of perceptually inspired algorithms. On the other hand, perceptual quality assessment methods are needed in order to optimize and validate new imaging solutions.

The notion of Quality of Experience (QoE) relates to the degree of delight or annoyance of the user of an application or service ( Le Callet et al., 2012 ). QoE is strongly linked to subjective and objective quality assessment methods. Many years of research have resulted in the successful development of perceptual visual quality metrics based on models of human perception ( Lin and Kuo, 2011 ; Bovik, 2013 ). More recently, deep learning-based approaches have also been successfully applied to this problem ( Bosse et al., 2017 ). While these perceptual quality metrics have achieved good performances, several significant challenges remain. First, when applied to video sequences, most current perceptual metrics are applied on individual images, neglecting temporal modeling. Second, whereas color is a key attribute, there are currently no widely accepted perceptual quality metrics explicitly considering color. Finally, new modalities, such as 360° videos, light fields, point clouds, and HDR, require new approaches.

Another closely related topic is image esthetic assessment ( Deng et al., 2017 ). The esthetic quality of an image is affected by numerous factors, such as lighting, color, contrast, and composition. It is useful in different application scenarios such as image retrieval and ranking, recommendation, and photos enhancement. While earlier attempts have used handcrafted features, most recent techniques to predict esthetic quality are data driven and based on deep learning approaches, leveraging the availability of large annotated datasets for training ( Murray et al., 2012 ). One key challenge is the inherently subjective nature of esthetics assessment, resulting in ambiguity in the ground-truth labels. Another important issue is to explain the behavior of deep esthetic prediction models.

Analysis, Interpretation and Understanding

Another major research direction has been the objective to efficiently analyze, interpret and understand visual data. This goal is challenging, due to the high diversity and complexity of visual data. This has led to many research activities, involving both low-level and high-level analysis, addressing topics such as image classification and segmentation, optical flow, image indexing and retrieval, object detection and tracking, and scene interpretation and understanding. Hereafter, we discuss some trends and challenges.

Keypoints Detection and Local Descriptors

Local imaging matching has been the cornerstone of many analysis tasks. It involves the detection of keypoints, i.e. salient visual points that can be robustly and repeatedly detected, and descriptors, i.e. a compact signature locally describing the visual features at each keypoint. It allows to subsequently compute pairwise matching between the features to reveal local correspondences. In this context, several frameworks have been proposed, including Scale Invariant Feature Transform (SIFT) ( Lowe, 2004 ) and Speeded Up Robust Features (SURF) ( Bay et al., 2008 ), and later binary variants including Binary Robust Independent Elementary Feature (BRIEF) ( Calonder et al., 2010 ), Oriented FAST and Rotated BRIEF (ORB) ( Rublee et al., 2011 ) and Binary Robust Invariant Scalable Keypoints (BRISK) ( Leutenegger et al., 2011 ). Although these approaches exhibit scale and rotation invariance, they are less suited to deal with large 3D distortions such as perspective deformations, out-of-plane rotations, and significant viewpoint changes. Besides, they tend to fail under significantly varying and challenging illumination conditions.

These traditional approaches based on handcrafted features have been successfully applied to problems such as image and video retrieval, object detection, visual Simultaneous Localization And Mapping (SLAM), and visual odometry. Besides, the emergence of new imaging modalities as introduced above can also be beneficial for image analysis tasks, including light fields ( Galdi et al., 2019 ), point clouds ( Guo et al., 2020 ), and HDR ( Rana et al., 2018 ). However, when applied to high-dimensional visual data for semantic analysis and understanding, these approaches based on handcrafted features have been supplanted in recent years by approaches based on deep learning.

Deep Learning-Based Methods

Data-driven deep learning-based approaches ( LeCun et al., 2015 ), and in particular the Convolutional Neural Network (CNN) architecture, represent nowadays the state-of-the-art in terms of performances for complex pattern recognition tasks in scene analysis and understanding. By combining multiple processing layers, deep models are able to learn data representations with different levels of abstraction.

Supervised learning is the most common form of deep learning. It requires a large and fully labeled training dataset, a typically time-consuming and expensive process needed whenever tackling a new application scenario. Moreover, in some specialized domains, e.g. medical data, it can be very difficult to obtain annotations. To alleviate this major burden, methods such as transfer learning and weakly supervised learning have been proposed.

In another direction, deep models have been shown to be vulnerable to adversarial attacks ( Akhtar and Mian, 2018 ). Those attacks consist in introducing subtle perturbations to the input, such that the model predicts an incorrect output. For instance, in the case of images, imperceptible pixel differences are able to fool deep learning models. Such adversarial attacks are definitively an important obstacle to the successful deployment of deep learning, especially in applications where safety and security are critical. While some early solutions have been proposed, a significant challenge is to develop effective defense mechanisms against those attacks.

Finally, another challenge is to enable low complexity and efficient implementations. This is especially important for mobile or embedded applications. For this purpose, further interactions between signal processing and machine learning can potentially bring additional benefits. For instance, one direction is to compress deep neural networks in order to enable their more efficient handling. Moreover, by combining traditional processing techniques with deep learning models, it is possible to develop low complexity solutions while preserving high performance.

Explainability in Deep Learning

While data-driven deep learning models often achieve impressive performances on many visual analysis tasks, their black-box nature often makes it inherently very difficult to understand how they reach a predicted output and how it relates to particular characteristics of the input data. However, this is a major impediment in many decision-critical application scenarios. Moreover, it is important not only to have confidence in the proposed solution, but also to gain further insights from it. Based on these considerations, some deep learning systems aim at promoting explainability ( Adadi and Berrada, 2018 ; Xie et al., 2020 ). This can be achieved by exhibiting traits related to confidence, trust, safety, and ethics.

However, explainable deep learning is still in its early phase. More developments are needed, in particular to develop a systematic theory of model explanation. Important aspects include the need to understand and quantify risk, to comprehend how the model makes predictions for transparency and trustworthiness, and to quantify the uncertainty in the model prediction. This challenge is key in order to deploy and use deep learning-based solutions in an accountable way, for instance in application domains such as healthcare or autonomous driving.

Self-Supervised Learning

Self-supervised learning refers to methods that learn general visual features from large-scale unlabeled data, without the need for manual annotations. Self-supervised learning is therefore very appealing, as it allows exploiting the vast amount of unlabeled images and videos available. Moreover, it is widely believed that it is closer to how humans actually learn. One common approach is to use the data to provide the supervision, leveraging its structure. More generally, a pretext task can be defined, e.g. image inpainting, colorizing grayscale images, predicting future frames in videos, by withholding some parts of the data and by training the neural network to predict it ( Jing and Tian, 2020 ). By learning an objective function corresponding to the pretext task, the network is forced to learn relevant visual features in order to solve the problem. Self-supervised learning has also been successfully applied to autonomous vehicles perception. More specifically, the complementarity between analytical and learning methods can be exploited to address various autonomous driving perception tasks, without the prerequisite of an annotated data set ( Chiaroni et al., 2021 ).

While good performances have already been obtained using self-supervised learning, further work is still needed. A few promising directions are outlined hereafter. Combining self-supervised learning with other learning methods is a first interesting path. For instance, semi-supervised learning ( Van Engelen and Hoos, 2020 ) and few-short learning ( Fei-Fei et al., 2006 ) methods have been proposed for scenarios where limited labeled data is available. The performance of these methods can potentially be boosted by incorporating a self-supervised pre-training. The pretext task can also serve to add regularization. Another interesting trend in self-supervised learning is to train neural networks with synthetic data. The challenge here is to bridge the domain gap between the synthetic and real data. Finally, another compelling direction is to exploit data from different modalities. A simple example is to consider both the video and audio signals in a video sequence. In another example in the context of autonomous driving, vehicles are typically equipped with multiple sensors, including cameras, LIght Detection And Ranging (LIDAR), Global Positioning System (GPS), and Inertial Measurement Units (IMU). In such cases, it is easy to acquire large unlabeled multimodal datasets, where the different modalities can be effectively exploited in self-supervised learning methods.

Reproducible Research and Large Public Datasets

The reproducible research initiative is another way to further ensure high-quality research for the benefit of our community ( Vandewalle et al., 2009 ). Reproducibility, referring to the ability by someone else working independently to accurately reproduce the results of an experiment, is a key principle of the scientific method. In the context of image and video processing, it is usually not sufficient to provide a detailed description of the proposed algorithm. Most often, it is essential to also provide access to the code and data. This is even more imperative in the case of deep learning-based models.

In parallel, the availability of large public datasets is also highly desirable in order to support research activities. This is especially critical for new emerging modalities or specific application scenarios, where it is difficult to get access to relevant data. Moreover, with the emergence of deep learning, large datasets, along with labels, are often needed for training, which can be another burden.

Conclusion and Perspectives

The field of image processing is very broad and rich, with many successful applications in both the consumer and business markets. However, many technical challenges remain in order to further push the limits in imaging technologies. Two main trends are on the one hand to always improve the quality and realism of image and video content, and on the other hand to be able to effectively interpret and understand this vast and complex amount of visual data. However, the list is certainly not exhaustive and there are many other interesting problems, e.g. related to computational imaging, information security and forensics, or medical imaging. Key innovations will be found at the crossroad of image processing, optics, psychophysics, communication, computer vision, artificial intelligence, and computer graphics. Multi-disciplinary collaborations are therefore critical moving forward, involving actors from both academia and the industry, in order to drive these breakthroughs.

The “Image Processing” section of Frontier in Signal Processing aims at giving to the research community a forum to exchange, discuss and improve new ideas, with the goal to contribute to the further advancement of the field of image processing and to bring exciting innovations in the foreseeable future.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

1 https://www.brandwatch.com/blog/amazing-social-media-statistics-and-facts/ (accessed on Feb. 23, 2021).

Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access 6, 52138–52160. doi:10.1109/access.2018.2870052

CrossRef Full Text | Google Scholar

Adelson, E. H., and Bergen, J. R. (1991). “The plenoptic function and the elements of early vision” Computational models of visual processing . Cambridge, MA: MIT Press , 3-20.

Google Scholar

Akhtar, N., and Mian, A. (2018). Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430. doi:10.1109/access.2018.2807385

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vis. image understanding 110 (3), 346–359. doi:10.1016/j.cviu.2007.09.014

Bosse, S., Maniry, D., Müller, K. R., Wiegand, T., and Samek, W. (2017). Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 27 (1), 206–219. doi:10.1109/TIP.2017.2760518

PubMed Abstract | CrossRef Full Text | Google Scholar

Bovik, A. C. (2013). Automatic prediction of perceptual image and video quality. Proc. IEEE 101 (9), 2008–2024. doi:10.1109/JPROC.2013.2257632

Bross, B., Chen, J., Ohm, J. R., Sullivan, G. J., and Wang, Y. K. (2021). Developments in international video coding standardization after AVC, with an overview of Versatile Video Coding (VVC). Proc. IEEE . doi:10.1109/JPROC.2020.3043399

Calonder, M., Lepetit, V., Strecha, C., and Fua, P. (2010). Brief: binary robust independent elementary features. In K. Daniilidis, P. Maragos, and N. Paragios (eds) European conference on computer vision . Berlin, Heidelberg: Springer , 778–792. doi:10.1007/978-3-642-15561-1_56

Chalmers, A., and Debattista, K. (2017). HDR video past, present and future: a perspective. Signal. Processing: Image Commun. 54, 49–55. doi:10.1016/j.image.2017.02.003

Chiaroni, F., Rahal, M.-C., Hueber, N., and Dufaux, F. (2021). Self-supervised learning for autonomous vehicles perception: a conciliation between analytical and learning methods. IEEE Signal. Process. Mag. 38 (1), 31–41. doi:10.1109/msp.2020.2977269

Cisco, (20192019). Cisco visual networking index: forecast and trends, 2017-2022 (white paper) , Indianapolis, Indiana: Cisco Press .

Conti, C., Soares, L. D., and Nunes, P. (2020). Dense light field coding: a survey. IEEE Access 8, 49244–49284. doi:10.1109/ACCESS.2020.2977767

Da Costa, D. B., and Yang, H.-C. (2020). Grand challenges in wireless communications. Front. Commun. Networks 1 (1), 1–5. doi:10.3389/frcmn.2020.00001

Deng, Y., Loy, C. C., and Tang, X. (2017). Image aesthetic assessment: an experimental survey. IEEE Signal. Process. Mag. 34 (4), 80–106. doi:10.1109/msp.2017.2696576

Ding, D., Ma, Z., Chen, D., Chen, Q., Liu, Z., and Zhu, F. (2021). Advances in video compression system using deep neural network: a review and case studies . Ithaca, NY: Cornell university .

Duan, L., Liu, J., Yang, W., Huang, T., and Gao, W. (2020). Video coding for machines: a paradigm of collaborative compression and intelligent analytics. IEEE Trans. Image Process. 29, 8680–8695. doi:10.1109/tip.2020.3016485

Dufaux, F., Le Callet, P., Mantiuk, R., and Mrak, M. (2016). High dynamic range video - from acquisition, to display and applications . Cambridge, Massachusetts: Academic Press .

Dufaux, F., Pesquet-Popescu, B., and Cagnazzo, M. (2013). Emerging technologies for 3D video: creation, coding, transmission and rendering . Hoboken, NJ: Wiley .

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach Intell. 28 (4), 594–611. doi:10.1109/TPAMI.2006.79

Galdi, C., Chiesa, V., Busch, C., Lobato Correia, P., Dugelay, J.-L., and Guillemot, C. (2019). Light fields for face analysis. Sensors 19 (12), 2687. doi:10.3390/s19122687

Graziosi, D., Nakagami, O., Kuma, S., Zaghetto, A., Suzuki, T., and Tabatabai, A. (2020). An overview of ongoing point cloud compression standardization activities: video-based (V-PCC) and geometry-based (G-PCC). APSIPA Trans. Signal Inf. Process. 9, 2020. doi:10.1017/ATSIP.2020.12

Guarda, A., Rodrigues, N., and Pereira, F. (2020). Adaptive deep learning-based point cloud geometry coding. IEEE J. Selected Top. Signal Process. 15, 415-430. doi:10.1109/mmsp48831.2020.9287060

Guo, Y., Wang, H., Hu, Q., Liu, H., Liu, L., and Bennamoun, M. (2020). Deep learning for 3D point clouds: a survey. IEEE transactions on pattern analysis and machine intelligence . doi:10.1109/TPAMI.2020.3005434

Han, X.-F., Jin, J. S., Wang, M.-J., Jiang, W., Gao, L., and Xiao, L. (2017). A review of algorithms for filtering the 3D point cloud. Signal. Processing: Image Commun. 57, 103–112. doi:10.1016/j.image.2017.05.009

Haskell, B. G., Puri, A., and Netravali, A. N. (1996). Digital video: an introduction to MPEG-2 . Berlin, Germany: Springer Science and Business Media .

Hirsch, R. (1999). Seizing the light: a history of photography . New York, NY: McGraw-Hill .

Ihrke, I., Restrepo, J., and Mignard-Debise, L. (2016). Principles of light field imaging: briefly revisiting 25 years of research. IEEE Signal. Process. Mag. 33 (5), 59–69. doi:10.1109/MSP.2016.2582220

Jing, L., and Tian, Y. (2020). “Self-supervised visual feature learning with deep neural networks: a survey,” IEEE transactions on pattern analysis and machine intelligence , Ithaca, NY: Cornell University .

Le Callet, P., Möller, S., and Perkis, A. (2012). Qualinet white paper on definitions of quality of experience. European network on quality of experience in multimedia systems and services (COST Action IC 1003), 3(2012) .

Le Gall, D. (1991). Mpeg: A Video Compression Standard for Multimedia Applications. Commun. ACM 34, 46–58. doi:10.1145/103085.103090

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature 521 (7553), 436–444. doi:10.1038/nature14539

Leutenegger, S., Chli, M., and Siegwart, R. Y. (2011). “BRISK: binary robust invariant scalable keypoints,” IEEE International conference on computer vision , Barcelona, Spain , 6-13 Nov, 2011 ( IEEE ), 2548–2555.

Lin, W., and Jay Kuo, C.-C. (2011). Perceptual visual quality metrics: a survey. J. Vis. Commun. image representation 22 (4), 297–312. doi:10.1016/j.jvcir.2011.01.005

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60 (2), 91–110. doi:10.1023/b:visi.0000029664.99615.94

Lumiere, L. (1996). 1936 the lumière cinematograph. J. Smpte 105 (10), 608–611. doi:10.5594/j17187

Masia, B., Wetzstein, G., Didyk, P., and Gutierrez, D. (2013). A survey on computational displays: pushing the boundaries of optics, computation, and perception. Comput. & Graphics 37 (8), 1012–1038. doi:10.1016/j.cag.2013.10.003

Murray, N., Marchesotti, L., and Perronnin, F. (2012). “AVA: a large-scale database for aesthetic visual analysis,” IEEE conference on computer vision and pattern recognition , Providence, RI , June, 2012 . ( IEEE ), 2408–2415. doi:10.1109/CVPR.2012.6247954

Rana, A., Valenzise, G., and Dufaux, F. (2018). Learning-based tone mapping operator for efficient image matching. IEEE Trans. Multimedia 21 (1), 256–268. doi:10.1109/TMM.2018.2839885

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G. (2011). “ORB: an efficient alternative to SIFT or SURF,” IEEE International conference on computer vision , Barcelona, Spain , November, 2011 ( IEEE ), 2564–2571. doi:10.1109/ICCV.2011.6126544

Slater, M. (2014). Grand challenges in virtual environments. Front. Robotics AI 1, 3. doi:10.3389/frobt.2014.00003

Van Engelen, J. E., and Hoos, H. H. (2020). A survey on semi-supervised learning. Mach Learn. 109 (2), 373–440. doi:10.1007/s10994-019-05855-6

Vandewalle, P., Kovacevic, J., and Vetterli, M. (2009). Reproducible research in signal processing. IEEE Signal. Process. Mag. 26 (3), 37–47. doi:10.1109/msp.2009.932122

Wallace, G. K. (1992). The JPEG still picture compression standard. IEEE Trans. Consumer Electron.Feb 38 (1), xviii-xxxiv. doi:10.1109/30.125072

Wien, M., Boyce, J. M., Stockhammer, T., and Peng, W.-H. (20192019). Standardization status of immersive video coding. IEEE J. Emerg. Sel. Top. Circuits Syst. 9 (1), 5–17. doi:10.1109/JETCAS.2019.2898948

Wu, G., Masia, B., Jarabo, A., Zhang, Y., Wang, L., Dai, Q., et al. (2017). Light field image processing: an overview. IEEE J. Sel. Top. Signal. Process. 11 (7), 926–954. doi:10.1109/JSTSP.2017.2747126

Xie, N., Ras, G., van Gerven, M., and Doran, D. (2020). Explainable deep learning: a field guide for the uninitiated , Ithaca, NY: Cornell University ..

Keywords: image processing, immersive, image analysis, image understanding, deep learning, video processing

Citation: Dufaux F (2021) Grand Challenges in Image Processing. Front. Sig. Proc. 1:675547. doi: 10.3389/frsip.2021.675547

Received: 03 March 2021; Accepted: 10 March 2021; Published: 12 April 2021.

Reviewed and Edited by:

Copyright © 2021 Dufaux. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Frédéric Dufaux, [email protected]

Subscribe to the PwC Newsletter

Join the community, search results, scikit-image: image processing in python.

1 code implementation • 23 Jul 2014

scikit-image is an image processing library that implements algorithms and utilities for use in research, education and industry applications.

Loss Functions for Neural Networks for Image Processing

2 code implementations • 28 Nov 2015

Neural networks are becoming central in several areas of computer vision and image processing and different architectures have been proposed to solve specific problems.

Picasso: A Modular Framework for Visualizing the Learning Process of Neural Network Image Classifiers

1 code implementation • 16 May 2017

Picasso is a free open-source (Eclipse Public License) web application written in Python for rendering standard visualizations useful for analyzing convolutional neural networks.

latest research paper in image processing

MAXIM: Multi-Axis MLP for Image Processing

1 code implementation • CVPR 2022

In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks.

latest research paper in image processing

Fast Image Processing with Fully-Convolutional Networks

2 code implementations • ICCV 2017

Our approach uses a fully-convolutional network that is trained on input-output pairs that demonstrate the operator's action.

latest research paper in image processing

Pre-Trained Image Processing Transformer

6 code implementations • CVPR 2021

To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.

latest research paper in image processing

In Defense of Classical Image Processing: Fast Depth Completion on the CPU

2 code implementations • 31 Jan 2018

With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from hand crafted classical image processing algorithms.

latest research paper in image processing

Image Processing Using Multi-Code GAN Prior

1 code implementation • CVPR 2020

Such an over-parameterization of the latent space significantly improves the image reconstruction quality, outperforming existing competitors.

latest research paper in image processing

Comparison of Image Quality Models for Optimization of Image Processing Systems

1 code implementation • 4 May 2020

The performance of objective image quality assessment (IQA) models has been evaluated primarily by comparing model predictions to human quality judgments.

latest research paper in image processing

Quaternion Convolutional Neural Networks for Heterogeneous Image Processing

1 code implementation • 31 Oct 2018

Convolutional neural networks (CNN) have recently achieved state-of-the-art results in various applications.

Deep Learning-based Image Text Processing Research *

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Image forgery detection: a survey of recent deep-learning approaches

  • Open access
  • Published: 03 October 2022
  • Volume 82 , pages 17521–17566, ( 2023 )

Cite this article

You have full access to this open access article

  • Marcello Zanardelli   ORCID: orcid.org/0000-0001-5529-2408 1 ,
  • Fabrizio Guerrini 1 ,
  • Riccardo Leonardi 1 &
  • Nicola Adami 1  

15k Accesses

10 Citations

Explore all metrics

In the last years, due to the availability and easy of use of image editing tools, a large amount of fake and altered images have been produced and spread through the media and the Web. A lot of different approaches have been proposed in order to assess the authenticity of an image and in some cases to localize the altered (forged) areas. In this paper, we conduct a survey of some of the most recent image forgery detection methods that are specifically designed upon Deep Learning (DL) techniques, focusing on commonly found copy-move and splicing attacks. DeepFake generated content is also addressed insofar as its application is aimed at images, achieving the same effect as splicing. This survey is especially timely because deep learning powered techniques appear to be the most relevant right now, since they give the best overall performances on the available benchmark datasets. We discuss the key-aspects of these methods, while also describing the datasets on which they are trained and validated. We also discuss and compare (where possible) their performance. Building upon this analysis, we conclude by addressing possible future research trends and directions, in both deep learning architectural and evaluation approaches, and dataset building for easy methods comparison.

Similar content being viewed by others

latest research paper in image processing

Comprehensive analyses of image forgery detection methods from traditional to deep learning approaches: an evaluation

Preeti Sharma, Manoj Kumar & Hitesh Sharma

latest research paper in image processing

Copy-Move Forgery Detection Methods: A Critique

latest research paper in image processing

Localization and Detection of Copy-Move Forgeries in Post-processed Images Using U-Net

Belal Ahmed, T. Aaron Gulliver & Saif alZahir

Avoid common mistakes on your manuscript.

1 Introduction

The worldwide spread of smart devices, which integrate increasing quality cameras and image processing tools and “apps”, the ubiquity of desktop computers, and the fact that all these devices are almost permanently connected with each other and to remotely located data servers through the Internet, have given ordinary people the possibility to collect, store, and process an enormous quantity of digital visual data on a scale just until recently quite unthinkable.

As a consequence, images and videos are often shared and considered as information sources in several different contexts. Indeed, a great amount of everyday facts are documented through the use of smartphones, even by professionals [ 64 ]. Massive sharing of visual content is enabled by a variety of digital technologies [ 79 ], such as effective compression methods, fast networks, and specially designed user applications. These latter, in particular, include Web platforms, e.g., social networks such as Instagram and forums like Reddit, that allow the almost instantaneous spreading of user generated images and video. On the other hand, user-friendly, advanced image editing software, both commercial like Adobe Photoshop [ 3 ], and free and open source like GIMP [ 33 ], not to mention smartphone-based apps that can apply basic image manipulations on the fly, Footnote 1 are widely available to everyone.

All these factors have contributed to the spread of fake or forged images and videos, in which the semantic content is significantly altered. Sometimes this is done for malevolent purposes, such as political or commercial ones [ 94 ]. As of 2022, all of the major social network platforms are struggling to filter manipulated data, and so avoid that such fake content, often directed to the most vulnerable users, could “go viral” [ 96 ]. Legal conundrums are also emerging regarding where to put the responsibility for the possibly damaging fallout of fake content spreading [ 34 ].

Such problems arise because most times humans are easily fooled by forgeries, and in some cases they are even demonstrably not able to detect any but the less sophisticated modifications undergone by visual content, due to the so-called change blindness cognitive effect [ 73 , 93 ]. Thus, there is the need for carefully designed digital techniques.

Semantic alterations can be carried out on all types of digital media content, like video or even audio. However, the focus of the analysis presented in this paper is on methods and algorithms specifically designed for forgery detection on still images , which is by far the most common case.

In this context, the general problem of determining if a given image has not been altered so as to modify its semantics is referred to as image authentication, or image integrity verification [ 48 ]. If the emphasis is put on expressly establish if a given image has undergone a semantic alteration, or forgery , the same application is often referred to as image forgery detection in the literature [ 29 ]. The objective of this paper is to provide a survey of selected forgery detection methods, with particular attention to deep learning (DL) techniques that have since come to the fore.

Before starting our analysis on forgery detection methods, in the rest of this Section we frame why we think this comprehensive, performance-driven survey that describes the most recent DL methods is both timely and necessary. We first provide a broad overview of the considered application, mainly to fix some definitions. Next, we provide a concise summary of the most commonly found types of forgery. We finally provide the organization of the remainder of the paper while also detailing the contributions of our present analysis.

1.1 Image forgery detection applications

Image forgery detection can mainly be divided into two categories: active and passive . Sometimes these methods also give a localization of the altered/forged areas of the image, and even provide an estimate of the original visual content.

Active methods for general visual content protection are based on technologies like digital signatures [ 74 ] and digital watermarking [ 6 ]. Digital signatures are straight cryptographic methods that authenticate the bit-stream. However, the authentication in this case is fragile, meaning that any change in the bit-stream invalidates the signature, and thus it is more tailored to alternative applications such as copyright protection. This is instead not desirable when verifying image semantic content, since alterations that does not change the semantics e.g., a mild amount of compression) should be tolerated. In other words, the authentication method needs to be robust. Another serious drawback is that the signature has to be attached as metadata to the image, and therefore could be discarded or sometimes even substituted by a malicious user.

To address these shortcomings, robust methods have been proposed. For example, robust digital watermarking embeds security information in the content itself by controlled imperceptible modifications. Ideally, an attacker should not be able to alter the content of an image without changing the embedded watermark, while being able to safely apply selected processing such as compression, thus allowing the consumer of the image to detect the manipulation.

Note that variants of the aforementioned approaches exist, namely, robust signatures (based on content hashing techniques) [ 87 , 92 ], and fragile watermarking [ 21 ]. Sometimes these variants have been cleverly combined [ 66 ]. However, they still inherit the same problems associated to metadata presence and fragility that we have just outlined.

In the end, active methods have the advantage of being able to convey side information which may be useful to detect the attempted forgery, but they need the watermark or signature to be computed on the unaltered version of the image, ideally at acquisition level. This in turn requires the capturing camera to have specific hardware and/or in-board post-processing software. Furthermore, any entity interested in verifying the semantic content of a given image must be able to decode the authentication information, which means having access to the (private or public) key of the creator and/or the watermark detector. However, leaving to potentially malicious users both the security information embedding and the decoding devices is usually a threat to the entire framework. Footnote 2

As an alternative, a trusted third party could be set up to verify the image integrity, for instance, a Web site able to embed and decode the watermarks. However, scalability problems prevent such architecture to be feasible for everyday images shared on the Internet. Recently, commercial solutions based on the blockchain paradigm aimed at image integrity have also appeared to get rid of the trusted third party presence, though little details at the present time are known of their inner workings. Footnote 3 Blockchain methods can be considered active only in the sense that a block needs to be generated for each protected image, but the image itself is released without modifications. To the best of the authors’ knowledge, however, these techniques are not widespread for forgery detection. That may well be because, while the distributed ledger paradigm does not need a trusted third party, fragile authentication is unavoidable since in the end blockchain has a cryptographic core, and furthermore scalability issues are still present. Still, new solutions are being proposed in this field, for instance [ 47 ].

Conversely, passive methods do not need the presence of additional data attached to the image, and they are commonly known as forensics [ 81 ]. Their goal is thus to tell whether an image is authentic or not by analysing only the image itself, searching for traces of specific processing undergone by the image. In the case of massively shared, ordinary images, this solution has been traditionally considered the only feasible one.

Often, an attacker can apply one or a set of successive manipulations on the target image, either on the whole image or only on a tampered region, such as a semantic alteration, e.g. object duplication, JPEG compression, geometric transformations, up-sampling, filtering, e.g. contrast enhancement, and so on. When this chain of manipulations is used by an attacker to disguise the original forgery it is referred to as anti-forensics .

The task of determining the history of attacks that a target image has undergone is sometimes called image philogeny [ 70 ]. Of course, this is a more challenging problem than simply telling apart pristine and forged images, as it involves the detection of multiple kind of attacks while also determining the order in which they were performed. Let us consider, for example, a scenario in which the attacker can perform three different manipulations, and assume for simplicity that each attack is applied at most once. The number N of possible processing histories is thus the sum of simple dispositions of k attacks from the possible three, as in:

Note that k = 0 means that the image is pristine. As can be observed, the number of possible histories grows exponentially with the number of available attacks. A possible solution can be found in [ 14 , 60 , 61 ], where the authors formulated the problem of determining the processing history as a multi-class classification problem. Therein, each of the N histories corresponds to a class, and a fusion-decision algorithm tries to combine the outputs of multiple forgery detection methods by means of an agreement function, which aims to give a higher weight to decisions on which more forgery methods agree and less to the ones on which there is less consensus.

As a final note, there is another possible forensics application, that is the trustworthy attribution of the visual content to its creator, for example, the device that generated the image. The forensics traces could be present all the way back at the acquisition level e.g. the camera-specific acquisition noise, known as Photo Response Non Uniformity [ 32 ], or PRNU) down to the post-processing stage (that is, after the original image has been stored in digital form) [ 48 ].

Sometimes, however, forgery detection follows the “in-the-wild” assumption that the creator of a particular image is not safely attributable to any entity, and thus it is to be considered coming from a possibly anonymous, unreliable source.

1.2 Image forgery types

We now present the most common forgeries and manipulations found in the context of the just discussed applications. Visual examples are depicted in Fig.  1 .

figure 1

Examples for each discussed forgery kind

The copy-move forgery is performed by copying one or more regions of an image and pasting them in the same image in different locations. Copy-move forgeries are typically used to hide information or to duplicate objects/people, thus severely altering the semantic content of the target image. An example of copy-move forgery is shown in Fig.  1a , where the right building tower has been inserted as a copy of the left one.

This forgery is similar to copy-move, with the difference that the pasted regions/objects are cut from one or more other images. A splicing forgery can be done in order to hide some content, or to show a fake situation. For example, in Fig.  1b , we can see an image in which two famous people are depicted together, but the picture has been shown to be the composition of two different images.

This kind of attack consists in filling a region or a “hole” in the image with plausible content. Inpainting is typically employed to restore damaged patches in images. However, it can also be used by potential attackers as a malicious means to hide information from an image, or to remove a visible watermark. The filled region can either be copied from another part of the image, or synthesized with a specific algorithm, such as a GAN network (Generative Adversarial Network [ 35 ], see also below). Note that, in the former instance, this attack can be thought as a particular instance of copy-move.

A particularly interesting instance of inpainting is the reconstruction of deleted parts of faces, such as the eyes or the mouth. Promising results in this regard have been obtained by Nvidia [ 63 ] (an example is shown in Fig.  1c ).

DeepFake is a particular kind of manipulation in which a deep learning model is employed to synthesize fake content in images or videos. The “deep” term is used to emphasize the difference between the pre-DL era, in which this task was manually done by experts with professional editing tools, and the current era, in which this is automatically done by deep models, such as GANs [ 35 ].

A typical application of DeepFake consists in the substitution of the face of a person with the face of another person (usually a VIP) taken from a second image or video (see Fig.  1e ). In another kind of DeepFake attack the facial expressions of a donor person are extracted and applied to the target person in another image or video. This is usually done by means of synthetization methods (namely, GANs) or by merging algorithms that aims to maximise the realism of the obtained face.

Even if most of the time DeepFakes are created for entertainment/comedy purposes, there have been cases in which a VIP was shown to be in certain situations in which he/she never was, thus damaging his/her image and leading to scandals. As a matter of fact, the vast majority of DeepFakes with the latter purpose are created in the video domain, because this kind of media usually poses a bigger semantic threat to the attacked person/VIP, especially when an appropriate audio track is available and can be matched to the facial expressions of the talking person. Furthermore, a number of easy-to-use tools have been developed to produce convincing DeepFakes, such as FakeApp , faceswap-GAN , and that available at [ 27 ]. As a consequence, many DeepFake videos have been spreading through the Web in the last few years.

DeepFakes for static images are less common, but they are still worthy of interest for forgery detection purposes. Note that this kind of attack can be thought as a particular case of the aforementioned splicing.

CGI-generated images/videos

This approach consists in creating photo-realistic content as the rendering output of a computer graphics generated 3D scene. Thanks to the recent advances in the video-gaming industry and in the GPU technology, techniques such as ray-tracing have been much easier to implement, thus making possible to reach realism levels unthinkable just a few years ago (an example is shown in Fig.  1d ). In fact, in recent years a certain number of graphic engines, such as Unity and the Unreal Engine , have been developed and can be freely (or rather cheaply) used by everyone. So, more and more convincing rendered images/videos are being produced every day.

Consequently, the images generated through these engines can be almost indistinguishable from images taken with a real camera, and, of course, this can be used for malicious intents by potential attackers that can use these renderings to depict false scenes. It is worth noticing, though, that in the case of CGI generated content a certain level of expertise is still required in order to produce convincing results.

In this case, there is no clear parallels with splicing since the generated scene is generated from scratch.

GAN-based face synthesization

Last, we introduce a particularly popular kind of fake content generation approach, which consists in the creation of a realistic face of a completely non-existing person, employing the previously cited GAN networks. This is done by feeding the trained model with a vector of random sampled noise, which is converted by the model to a realistic face (theoretically) different from any existing one. Again, as for the previously discussed CGI generated content, the fake image is synthesized anew instead of being copied from another source.

In [ 45 ], Nvidia proposed a GAN architecture that is considered a breakthrough for this technology. Interactive demos based upon this original work can also be found on the Web, such as [ 39 ]. Apart from artifacts that can sometimes still be noticeable in the background, the produced faces are really convincing and they are hardly detectable as fake by the naked eye.

1.3 Contributions and paper organization

Since the early 2000s, a lot of approaches to image forgery detection have been proposed, and many excellent reviews can be found [ 11 , 29 , 38 , 48 , 84 , 103 ]. However, deep learning techniques have proved to be a game-changer in many digital processing and computer vision applications, especially when a lot of training data are available [ 56 , 62 , 109 ]. Even if in the case of forgery detection this last assumption is not quite satisfied, nonetheless, as discussed in what follows, the best performance on standard benchmarks were obtained with algorithms that leverage DL models in one or more phases.

For this reason, we feel that it is very important to keep track of the breakthroughs made possible by deep learning in forgery detection. In particular, it is crucial that some degree of comparison between DL-based techniques that follow different perspectives is carried out. This is especially true since it is challenging to identify future (and even present) trends in a technology like DL, which is already vast and still expanding at a tremendous rate.

In this paper, we mainly focus our discussion on copy-move and splicing detection methods. Even if these attacks are not as recent as GAN-based ones or DeepFakes, they are very prominent in the literature and lots of algorithms for their detection are still being published to date. The reason why these forgeries are so diffused is mainly because of their simplicity, both related to end user employment and experimental dataset building, but also because they are a very immediate threat to the image semantics integrity.

Even so, we discuss some of the DeepFake detection techniques, insofar as this kind of attacks can be seen of a special (and more sophisticated) case of splicing, or at least a manipulation that usually involves a source or donor image/video and a target one. However, since this work aims to give an overview on image forgery detection methods, we do not deal with approaches specifically designed for video content, i.e., that cannot be applied to single images. In fact, video-specific methods typically do not analyze each frame as a standalone image, but they also leverage temporal clues between different frames or, if available, inconsistencies between the audio track and the facial expressions. We refer the reader interested in DeepFakes seen as a standalone research field to the review in [ 102 ].

This paper is organized as follows. As stated before, the focus of this paper is on the most recent methods for copy-move and splicing detection that are specifically based upon DL. To better highlight the contrast with the previous state-of-the-art, it is useful to first recap in Section  2 several of the established forensics-based techniques for image forgery detection that instead follow traditional approaches. In Section  3 , we describe the key-aspects of the deep learning based methods, including their applicability and their limitations, and we illustrate their properties such as the kind of attacks they can detect and whether they give or not the localization of the forged areas. We concurrently discuss the datasets on which they were trained/tested. Then, in Section  4 we discuss their performance, which are also directly compared when possible (that is, tested on the same benchmark dataset). Finally, in Section  5 we follow up on the previous discussion by drawing some conclusions, while providing some insights on what we think should be the most important future research directions.

2 Traditional passive forgery detection methods

We now briefly discuss some of the “conventional” passive image forgery detection approaches that have been proposed since the early 2000s. Of course, what we present here is not an exhaustive, nor in-depth review of these methods. For a more comprehensive review, see [ 29 , 38 ], and [ 103 ].

Conventional passive methods leverage techniques from the fields of signal processing, statistics, physics, and geometry, and are usually also referred to as “classic” or “traditional” approaches. In fact, they come from the pre-DL era that we are currently in and, as such, they require little or no data to perform an eventual training phase. Those that still require data for training are typically based on traditional machine learning techniques, such as clustering, support vector machines (SVM), linear/logistic regression, random forests, and so on. Here, we still consider those as belonging to the classic methods, because they rely on models that have a relatively small number of parameters, and therefore do not require a great amount of data for training.

We think it is useful to briefly describe some of the traditional approaches, for the following two reasons:

As mentioned above, they typically do not require much data for training (or none, even). Of course, this is an advantage when it is hard or impossible to collect a good amount of labelled images to train a high parameterized deep learning model. Also, most of these methods are not as computationally expensive, and thus can be easily deployed on commercial low-power hardware, like smartphones or tablets;

Some of the core ideas and principles these methods rely on can also be used in conjunction with deep learning models, in order to accelerate the training phase or to achieve better performance. For example, in [ 86 ], a SVM model is used as final classification phase applied on the output of a CNN. In [ 85 ], a YCbCr color space conversion and a DCT transform are used as pre-processing stages before a CNN. In [ 97 ], a CNN takes as input the Laplacian filter residuals (LFR) computed on the input images rather than the images themselves. All of these methods, among several others, are discussed in detail in Section  3 .

Passive traditional methods can be usually grouped into five main categories. We discuss each separately in the remainder of this Section.

2.1 Pixel based

These methods rely on the fact that certain manipulations introduce anomalies that can affect the statistical content of the image at the pixels level. Some of these anomalies can be detected in the spatial domain, while others in the frequency domain or in a combination of both.

For copy-move attacks, it is common to observe a strong correlation between copied regions in the image but, due to the fact that these can be of any size and shape, it is computationally infeasible to explore the whole space of possible shape/size combinations. The authors of [ 31 ] have proposed a method based on the Discrete Cosine Transform (DCT). In particular, they divided the image into overlapping blocks and applied a DCT on each block. The DCT coefficients were used as feature vectors that describe each block. Duplicated regions then were detected by lexicograpycally ordering the DCT block coefficients and grouping the most similar ones. Another approach, proposed in [ 82 ], consisted in applying a Principal Component Analysis (PCA) on image blocks’ features, and then comparing blocks representation in this reduced-dimension space. These approaches have been shown to be robust when minor variations in the copied regions are performed, like additive noise or lossy compression. However, in general these methods do not perform well in the case of geometric transformations like rotation or scaling.

Thus, let us consider now a situation in which a geometric transformation is used in order to make a copy-move attack more convincing. Geometric transformations usually involve some form of interpolation between neighbouring pixels, in particular, the most common techniques are bilinear or cubic interpolation. Depending on the chosen technique, a specific correlation pattern between these pixels is created. Statistical methods are then employed with the aim of finding these patterns in order to detect regions in which a geometric manipulation has been employed. An example of this approach is described in [ 83 ].

An example of frequency-based forgery detection is [ 28 ]. To detect spliced regions, the authors observed that, even if the boundary between the spliced region and the original image can be visually imperceptible, high-order Fourier statistics are affected by this kind of manipulation and thus can be used for detection.

Another common type of methods, specifically designed for copy-move attacks detection, are the key-point based methods. They typically require the following steps:

Key-points extraction. Key-points are variously defined as “points of interest” of the image, for example, local minima or maxima, corners, blobs, etc. Some of the most commonly employed key-points extraction processes include the well-known Scale Invariant Feature Transform (SIFT) [ 65 ], Speeded Up Robust Features (SURF) [ 9 ], or Features from Accelerated Segment Test (FAST) [ 89 ];

Descriptors extraction. One or more feature vectors (descriptors) are extracted from each key-point. Usually, these vectors are a compact description of the region in the vicinity of the key-point. In addition to the SIFT/SURF feature values, Histogram of Gradients (HOG) and the FAST-based ORB [ 89 ] are other common ones;

Descriptors matching. In this step, descriptors are compared according to a distance (or a complementary similarity) function. If the distance of two or more descriptors is below a certain threshold, a match between these descriptors is declared;

Filtering step. In this phase, some form of filtering of the matching results is done in order to rule out weak matches. This can be done by different criteria, such as Lowe’s ratio [ 65 ], in which a match is considered valid only if the distance between the two most similar descriptors is considerably smaller than that between the two next-best ones. Other criteria can be employed, for instance, based on the spatial relationship between the key-points.

One of the most cited key-point based methods for copy-move detection was proposed by Amerini et al. in [ 5 ]. The authors showed that these methods are quite robust even against rotation and scaling, but the performance are not as good when the copy-moved regions are too uniform. In fact, in this case only few key-points can be extracted, and consequently the matching phase provides weak results.

2.2 Format based

Usually, images captured by a digital camera are encoded in JPEG format. This means that the image is divided into 8 × 8 pixel blocks, which are then DCT transformed and quantized. As a consequence, specific artefacts are generated at the border of neighbouring blocks. The authors of [ 67 ] observed that image manipulations like copy-move or splicing result in alterations in the JPEG artefact pattern, and proposed a method in which they used a sample region (which is supposed authentic) of the target image to estimate the JPEG quantization table. Then, they divided the image into blocks, and a “blocking artefact” measure is computed for each block. A block is considered tampered if the score given by this measure is sufficiently distant from the average value on the whole image.

Obviously, a key limitation of these methods is that they are based on specific assumptions on the format of the stored image (e.g. JPEG), and therefore they are not universally applicable.

2.3 Camera based

The basic idea exploited by these methods is that every digital camera leaves a particular “footprint” or “signature” on each image they generate. This fact can also be useful to tie an image to a specific capturing device. In [ 32 ], the authors used a set of images taken by a known camera to estimate the parameters of the already mentioned PRNU, which is a camera specific multiplicative term that models the result of in-camera processing operations. These PRNU parameters are also extracted from the target image, which is supposed to be taken with the same camera, and compared with the previously estimated ones. The idea is that, if a splicing operation from a different camera type has been made, this results in a discrepancy between the estimated parameters.

One of the obvious limitations of this method is that it is camera-specific: this means that a different training set of images must be used for each type of camera in order to build its specific PRNU model. Also, this method is effective just for those splicing attacks in which the spliced region is extracted from a source image taken with a different camera with respect to the one used to acquire the target image, which is not always the case.

The authors of [ 41 ], instead, leveraged chromatic aberration to detect image forgeries. The phenomenon of chromatic aberration arises from the fact that photographic lenses are not able to focus light of different wavelengths on the same point on the camera sensor. In fact, from Snell’s Law, the refraction index of a material depends on the wavelength of the incident light too. As a consequence, each point of the physical scene is mapped, in the RGB color channels, into points that are spatially slightly shifted one from another.

So, the authors of [ 41 ] built a model that approximates the effect of the chromatic aberration and estimated its parameters. Forged regions usually show inconsistencies with the estimated model, and can thus be detected. In this case as well, the main drawback is that this method is camera-specific. In fact, different cameras have different chromatic aberration levels (that typically depend on the kind of lenses), and consequently it is hard to set a specific threshold for the anomalies detection, if the camera from which the target image was taken is not known a priori.

2.4 Lighting based

Typically, when an attacker performs a copy-move or splicing attack, it is hard to ensure that the lighting conditions of the forged region is consistent with that of surrounding image. Compensating for this effect can be hard even using professional software like Adobe Photoshop. Therefore, the basic idea of lighting (or physics) based techniques is to build a global lighting model from the target image, and then to find local inconsistencies with the model as evidence of forgery.

Different lighting models were proposed, such as those in [ 40 ] and in [ 44 ], for which least squares error approaches are usually employed for parameters estimation. Techniques like Random Sample Consensus (RANSAC) [ 30 ] are sometimes used in order to make the model more robust to outliers. The positive aspect of these methods is their wide applicability. In fact, they are not based on assumptions on the type of camera that generated the image, and they can be used to detect both copy-move and splicing attacks. However, a downside of these methods is the fact that they are dependent on the physical context present in the image. In particular, if the lighting conditions are quite complex (for example, an indoor scene), a global lighting model cannot be estimated, and thus the method cannot be applied.

2.5 Geometry based

Geometry-based methods rely on the fact that a copy-move or a splicing attack usually results in some anomalies in the geometric properties of the 3D scene from which the image is obtained.

The authors of [ 43 ] proposed a method to estimate the so-called principal point through the analysis of known planar objects, and observed that this is usually near the center of the image. They also showed that a translation of an object in the image plane results in a shift of the principal point, and thus this fact can be used as evidence of forgery.

Another interesting approach was proposed in [ 42 ]. The idea was to consider specific known objects such as billboard signs or license plates and make them planar through a perspective transformation. Once the reference objects are viewed in a convenient plane, it is possible, through a camera calibration, to make real world measurements, which can then be used to make considerations on the authenticity of the objects in the image.

Of course, these methods are based on strong assumptions on the geometry of the 3D scene. They also require a human knowledge of the real world measures taken from specific objects in the image. Consequently, their applicability is quite limited.

3 Deep Learning based methods

Deep learning methods have gained a huge popularity over the past decade, and indeed they have been applied to a great variety of scientific problems. This is due to the fact that they were shown to perform particularly well for classification problems, as well as regression and segmentation ones. For certain tasks, these methods can even outperform humans in terms of accuracy and precision. Another crucial factor that contributed to the spread of deep learning techniques is that, in contrast to conventional machine learning approaches, they do not require the researcher to manually create (craft) meaningful features to be used as input to the learning algorithm, which is often a hard task that requires domain-specific knowledge. Deep learning models, such as Convolution Neural Networks (CNN), are in fact capable of automatically extract descriptive features which capture those facets of the input data that are well tailored to the task at hand.

For image forgery detection too, deep learning techniques have been explored in the recent literature in order to achieve better accuracy than previously proposed, traditional methods. The techniques that we are considering can be grouped in distinct categories according to different criteria, in this case:

Type of detected forgery: copy-move, splicing, or both;

Localization property, i.e. if the considered algorithm is able to localize the forged areas. In the case of copy-move detection, an additional question is whether the algorithm is able to distinguish between the source region and the target one, i.e. the region on which the source patch is pasted. This property is useful, for example, in a scenario in which a forensic expert is asked to analyze a tampered image in order to interpret the semantic meaning of a copy-move attack;

Architecture type, that is, the algorithm is an end-to-end trainable solution, i.e. without parameters that need manual tweaking, or not.

As discussed in Section  1.2 , DeepFakes can be regarded as a particular case of splicing attack. However, given the fact that the vast majority of DeepFake forgeries involve face manipulations, methods that aim to detect these attacks can leverage domain-specific knowledge e.g. face detection algorithms) that cannot be used by generic splicing detection algorithms. As such, different datasets need to be used for evaluating and comparing these methods. Therefore, DeepFake forgery detection performance cannot presently be directly compared with generic splicing detection algorithms. Consequently, in this paper, the discussion on the former methods is conducted separately, both in regard to employed datasets and experimental results.

For our analysis, we have selected some papers among the most recent ones that we think are particularly representative of those that can be categorized into at least one of the distinct groups that we have outlined above. A further principle that we have used for this selection is performance driven, with the added objective of being able to do a meaningful comparison (when possible), given in Section  4 . These papers are described in some detail in this Section, with the further objective of identifying if any trend in the DL overall architecture choice is emerging.

In particular, we have used the criteria A) and B) above to sort the presentation order of the papers. Methods [ 1 , 4 , 25 , 78 ], and [ 105 ] are copy-move-only specific, and are presented first in Section  3.2 . Then, methods [ 22 , 68 , 85 , 86 , 97 , 105 , 107 ], and [ 18 ], that are for both splicing and copy-move detection, are discussed next in Section  3.3 .

Besides this first separation through criterion A), we sort the techniques in each subset using criterion B), namely, [ 1 , 4 ], and [ 105 ] in the first subset possess the localization property and are discussed first. For the second subset, such property is verified by [ 85 , 107 ], and [ 18 ], which are thus described before the others. Note that methods [ 1 ] and [ 105 ] are also able to distinguish the source from the target regions.

Regarding criterion C), which is not used for sorting the methods, we remark here that end-to-end architectures can be found in [ 25 , 68 , 105 ], and [ 78 ]. The reader is referred to Table  5 for a summary of the characteristics of the described techniques.

Finally, DeepFake specific methods are discussed in Section  3.4 .

For each described method, we also discuss:

which datasets, whether public benchmark or custom ones, were used for the experimental validation;

the performance on one or more of the above datasets: metrics like accuracy, precision, localization accuracy, etc.

Therefore, before diving into a detailed overview of the deep learning based approaches, we proceed to first briefly describe in Section  3.1 some of the benchmark datasets that are typically used in the most recent literature for evaluation of the considered forgery detection methods, and summarize the employed performance metrics.

Finally, we mention that there are several other interesting works that involve deep learning as a means for forgery detection, which are however not analyzed here because their characteristics are a mixture of the representative works that we have selected. Some examples are [ 71 ] and [ 106 ]. In the former, a copy-move-only method is presented that leverages a pre-trained AlexNet (on ImageNet) as a block feature extractor and a subsequent feature matching step that allows to localize the copy-moved regions. In [ 106 ], instead, a technique for both copy-move and splicing detection is discussed, which is built upon the formulation of the forgery detection and localization task as a local anomaly detection problem. In particular, a “Z-score” feature is designed that describes the local anomaly level and is used in conjunction with a LSTM (long short term memory) model that is trained to assess local anomalies. Note that both of these methods satisfy criterion B), i.e. they give the localization of the forged areas.

As a further remark regarding the property of being able to distinguish between source and target regions, we refer the reader to the recently published work in [ 7 ], in which a DL-based method is presented as a post-processing phase to distinguish between source and target regions, starting from the localization mask of any copy-move forgery detection technique.

3.1 Datasets description

We now provide a comprehensive list of the benchmark datasets used by a majority of the proposed copy-move, splicing and DeepFake (confined by the previously stated purposes) detection methods. In fact, most of the deep learning methods that are presented in what follows are trained and/or tested on either one of these datasets, or a custom one built upon the datasets themselves. The main characteristics of each dataset are summarized in Table  1 . Evaluation metrics are discussed next in Section  3.1.1 .

CASIA v1.0 (CASIA1) [ 24 ]

It contains 1725 color images with resolution of 384 × 256 pixels in JPEG format. Of these, 975 images are forged while the rest are original. It contains both copy-move and splicing attacks;

CASIA v2.0 (CASIA2) [ 24 ]

It contains 7491 authentic and 5123 forged color images with different sizes. The image formats comprise JPEG, BMP, and TIFF. This dataset is more challenging than CASIA1 because the boundary regions of the forged areas are post-processed in order to make the detection more difficult. It contains both copy-move and splicing attacks;

DVMM [ 101 ]

It is made of 933 authentic and 912 spliced uncompressed grayscale images in BMP format, with fixed size of 128 × 128;

MICC-F220 [ 5 ]

It is composed by 110 copy-moved and 110 original JPEG color images. Different kinds of post-processing are also performed on the copied patches, such as rotation, scaling, and noise addition;

MICC-F600 [ 5 ]

It contains 440 original and 160 tampered color images in JPEG and PNG formats. The tampered images involve multiple copy-moved regions, which are also rotated. The image sizes vary between 722 × 480 and 800 × 600 pixels;

MICC-F2000 [ 5 ]

It consists of 700 copy-moved and 1300 original JPEG images, each one with a resolution of 2048 × 1536 pixels;

SATs-130 [ 16 ]

It contains 130 images, generated by 10 source authentic images, with copy-moved regions of different sizes. Various JPEG compression levels are applied, therefore the images are stored in JPEG format;

CMFD [ 17 ]

It is composed of 48 source images in which a total of 87 regions (referred by the authors as “snippets”), with different sizes and content (from smooth areas, (e.g.), the sky, to rough ones, (e.g.), rocks, to human-made, (e.g.), buildings) are manually selected and copy-moved. The authors also provide a software that allows to apply different post-processing steps on the forged images in a controlled way. The images are given in JPEG and PNG formats;

CoMoFoD [ 100 ]

This dataset contains 4800 original and 4800 forged images, with copy-move attacks and post-processing operations such as JPEG compression, noise adding, blurring, contrast adjustment, and brightness change. The images are stored in PNG and JPEG formats;

DS0-1 [ 19 ]

It contains 200 images, 100 of which are pristine and 100 are forged with splicing attacks. All the images are in PNG format at a resolution of 2048 × 1536 pixels. Color and contrast adjustment operations are applied as counter-forensic measures;

Korus [ 49 , 50 ]

This dataset is composed of 220 pristine and 220 forged RGB images in TIFF format. The dataset contains both copy-move and splicing attacks, performed by hand with professional editing software. The resolution of the images is of 1920 × 1080.

DFDC (DeepFake detection challenge on kaggle) [ 23 ]

It contains 4113 DeepFakes videos created from a set of 1131 original ones, involving 66 subjects from various ethnicity and both genders. The video resolution varies from 180p to 2160p. All the videos are in MP4 format and the employed codec is H.264;

FaceForensic++ [ 88 ]

It is an extension of the previous dataset FaceForensic, with a total of 1.8 millions images created with 4 different DeepFake state-of-art generation methods ( DeepFakes [ 27 ], Face2Face [ 98 ], FaceSwap [ 51 ], and NeuralTexture [ 99 ]), starting from 4000 videos downloaded from YouTube . Compared to other previously proposed datasets, it is bigger by at least one order of magnitude. The dataset contains videos of different sizes, such as 480p, 720p, and 1080p. The videos are in MP4 format, and the codec used is again H.264.

Celeb-DF [ 59 ]

The authors of this dataset specifically created it in order to overcome the lack of realism of a large portion of DeepFake videos in previously published datasets (such as the original FaceForensic). It comprises a total of 5639 DeepFake videos and 590 pristine videos in MPEG4.0 format (H.264 coded), with different resolutions and a standard frame rate of 30 fps. The average length is about 13 seconds (corresponding to a total of more than 2 millions of frames). Another feature that sets this dataset apart from previously proposed ones is how it includes a pronounced variety of ethnicity and equilibrium among genders.

3.1.1 Evaluation metrics

Performance metrics in the considered forgery detection applications are the same used for binary classification problems. There are two classes, authentic or forged, that can be attributed either to the whole image or at the pixel level (through appropriate masks).

Table  2 recaps the terminology for binary classification evaluation using the so-called confusion matrix. Starting from ground-truth classes and the labels output by the detection system, the 4 outcomes given as TP, FP, TN, and FN can be counted according to the concordance or discordance of the labels with the corresponding classes.

The sum of every element in Table  2 is equal to the total number of queries T , namely the population (or the number of objects in the ground-truth). Among these T queries, P have a positive ground-truth class and N have a negative ground-truth class, therefore T = P + N . In forgery detection, as in many other binary classification problems, each element in Table  2 is suitably divided by P or N , and thus express the corresponding fraction, or rate, as follows:

Please note that in some papers the R (rate) part can be omitted, however, there is no possible confusion as the given number is in the [0,1] interval. Given the outcomes in Table  2 and the rates in ( 2 ), additional metrics can be obtained as follows:

An additional metric is the AUC (Area Under the ROC curve). The AUC is the two-dimensional area under the whole Receiver Operating Characteristic (ROC) curve, that plots FPR versus TPR varying the decision threshold of the detection algorithm.

These measures, or slight variations thereof, are extensively used in the papers described in what follows. There are commonly used synonyms for some of them, for example, the false alarm rate or fallout is the same as FPR and sensitivity is a synonym for recall. Such occurrences have been adjusted for clarity’s sake.

3.2 Copy-move specific methods

According to the grouping and sorting criteria of the DL-based techniques discussed in this work, we begin in this Section by introducing copy-move only forgery detection methods.

3.2.1 R. Agarwal et al. [ 4 ]

The authors of [ 4 ] proposed a method specific for copy-move detection that uses deep learning in conjunction with a segmentation step and further feature extraction phases. First, the M × N input image is segmented with the Simple Linear Iterative Clustering (SLIC) procedure [ 2 ]. In order to do so, a 5-D feature vector is built for each pixel, by concatenating its RGB color values and spatial x , y coordinates. A clustering is then performed on these features, and the segmented patches (referred to as “super pixels”) are given as output.

Then, multi-scale features are extracted from each super-pixel S k with a VGGNet [ 95 ] network. This process involves the following steps:

Given the segmented image, a binary mask BM for each super-pixel is obtained as:

Let \(f \in \mathbb {R}^{M^{\prime } \times N^{\prime } \times D}\) be the output of the first convolutional layer, where \(M^{\prime }\) , \(N^{\prime }\) are the spatial dimensions, and D is the number of output channels. R F ( l , m ) denotes the receptive field on the input image in the ( l , m ) position. A continuous value mask \(MConv_{k} \in \mathbb {R}^{M^{\prime } \times N^{\prime }}\) is then computed as follows:

The super-pixel-level feature map g k is obtained by multiplying the output of the convolutional layer with the mask:

The previous steps are repeated for each convolutional stage of the VGGnet. By using Max-pooling after each convolutional layer, increasingly high-level features are extracted for each super-pixel (see Fig.  2 ).

figure 2

In [ 4 ] the super-pixel segmentation map is given, along with the target image, as input to a VGGNet. Features at different levels are extracted for each of the input super-pixels. Finally, high-level features undergo a so called “relocation phase” to obtain a localization mask at the original resolution

Next, a “relocation” phase of the higher-levels features (with lower spatial resolution) is employed in order to find a pixel-level position of the features themselves in the input image. In this way, a set of key-points, with the corresponding multi-level features, is obtained for each patch.

Finally, a key-points matching phase is performed by comparing their associated features, and the copy-moved patches are identified by a further comparison of the super-pixels to which the key-points belong. This procedure is referred to as ADM (Adaptive Patch Matching) by the authors.

The VGGnet is trained on the MICC-F220 dataset. The same dataset is used for testing, though it is not specified which portion of it is used for training and which one is used for testing. The metrics used for evaluation are TNR, FNR, FPR, precision, TPR (recall), and accuracy. The reported results are:

TNR: 97.1%;

Precision: 98%;

Accuracy: 95%.

Therefore, the reported accuracy of the method is high, but at the cost of a large number of false positives.

Also, it should be noted that the reported performance is relative to the MICC-F220 dataset, that only has 220 images, with a limited number of types of copy-move attacks. For these reasons, results obtained on just this dataset are not as statistically relevant as methods tested on other, more populated copy-move datasets, such as MICC-F2000 or CoMoFoD.

3.2.2 Y. Abdalla et al. [ 1 ]

The authors of [ 1 ] proposed a 3-branches method for copy-move detection. An overview of the considered architecture is shown in Fig.  3 , which is in the end based on a GAN model. To recap, the GAN is composed of two different deep learning modules: the Generator (G) and the Discriminator (D).

The generator is a Unet that takes as input an image I and gives as output a forged version of the image itself \(I^{\prime }=G(I)\) ;

The discriminator is a CNN network that takes as input either an original image I or a generated image \(I^{\prime }=G(I)\) . The output is a binary mask, in which each pixel is labelled as either authentic or forged.

The purpose of D is to discriminate between original pixels and pixels that were manipulated by G. Instead, G aims to generate forgeries \(I^{\prime }=G(I)\) , with \(I^{\prime }\simeq I\) , in order to fool the discriminator into wrongly classify the forged areas of \(I^{\prime }\) as authentic. The training of the two modules can be seen as a competitive game between them, at the end of which the generator is able to create forgeries that are difficult to detect, and the discriminator is able to correctly classify them.

figure 3

Architecture of the GAN-based method in [ 1 ]. The upper branch implements a per-pixel binary classificator (forged/pristine), while the bottom one is used to find similarities between regions. The outputs of these branches are then combined to obtain the final output mask in which, if the image is considered forged, source and target regions can be distinguished

In addition to the described GAN network, the authors used a custom CNN model specifically designed to detect similarities between regions i.e. copy-moved areas). This CNN is composed of different convolutional layers as well as custom ones that perform a self-correlation operation on the input features. Then, different pooling steps are used to extract more compact features that are fed to fully connected layers. Finally, a mask-decoder layer is used to reconstruct, from the extracted features, a binary mask that represents the similar regions in the image.

As a final decision step in the forgery detection pipeline, a linear SVM model is used for classification. The SVM is fed with an input vector that combines the output of the GAN and the output of the similarity detection CNN. If the image is classified as copy-moved by the SVM model, an additional mask is given as output by comparing the two input binary masks obtained by the GAN and the custom CNN, in which not only the forged areas are labelled, but also the source region used for the copy-move attack is identified (with a different label).

Two datasets unrelated to forgery detection, namely, the CIFAR-10 [ 52 ] and MNIST [ 55 ] datasets, were used to pre-train and test the GAN network. In detail, the CIFAR-10 dataset contains 60,000, 32 × 32 color images categorized as 10 distinct classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), while MNIST is composed of 60,000 grayscale images depicting handwritten digits. After the pre-training phase, the other two modules of the detection pipeline were trained and validated on a custom dataset composed of a total of 1792 pairs of forged and corresponding authentic images, sampled from MICC-F600 and two other datasets, the “Oxford Building Dataset” [ 80 ] and the “IM” [ 12 ].

The obtained detection performances on this composite dataset are as follows:

F 1 -score: 88.35%;

Precision: 69.63%;

Recall: 80.42%;

In conclusion, it would have been interesting if the authors evaluated the performances of their method on one of the public benchmark datasets (such as MICC-F2000, or CASIA2) rather than a custom, composite one. One aspect of this method that should be further noted is that it is one of the few that gives as output not only a localization of the forged areas, but also the source regions of the copy-move attacks.

3.2.3 Y. Wu et al. [ 105 ]

In this paper, a pure end-to-end deep neural network pipeline (referred to as BusterNet by the authors) is presented as a copy-move forgery detection solution. A key aspect of this method, such as in [ 1 ], is the fact that it is able not just to give a pixel-level localization of the copy-move attacks, but it also distinguishes between the source and the target region.

The detection pipeline is composed of two branches and a fusion module (see Fig.  4 ):

The first branch, called Mani-det , is responsible for the detection of manipulations in the image, and it is composed of the following modules: a feature extractor, a mask decoder, and a binary classifier. The feature extractor is a standard CNN that coincides with the first 4 blocks of the VGG16 network [ 95 ].

The mask decoder is used in order to restore the input resolution of the image, via a de-convolution process, and it uses the BN-inception and BilinearUpPool2D layers [ 104 ].

The binary classifier, which is implemented as a convolutional layer followed by a sigmoid activation function, produces a binary manipulation mask, in which the pasted patches of the copy-move attacks are localized;

The second branch, referred to as Simi-det , is used in order to generate a copy-move binary mask, in which similar regions in the input image are detected. In particular, the detection process can be summarized as follows: first, a CNN is used as feature extractor. Then, a self-correlation module is used to compute all-to-all feature similarities. These are given, as input, to a percentile pooling unit, which collects useful statistics. A mask decoder is used to up-sample the obtained tensor to the size of the input image. Finally, a binary classifier is applied in order to obtain the copy-move mask;

The fusion unit takes as input the computed features from the two branches. It is constituted by a convolutional layer followed by a soft-max activation, that gives as output a three-class prediction mask: pristine, source region, and target region.

Note that the CNN networks used in the Simi-det and in the Mani-det branches have the same architecture, but they have different weights, since they are trained independently. The same applies for the mask-decoder and the binary classification modules.

figure 4

Architecture of BusterNet . [ 105 ]. Mani-det branch is used to obtain a classification of each pixel of the input image as forged or pristine. Simi-det branch instead, aims to find similarities between pixels in the input image. Finally, a fusion module is employed that takes as input the outputs of the two branches and outputs a classification for each pixel: source, target or pristine

In order to train their model, the authors built a dataset of 100,000 images by automatically performing copy-move operations from source pristine images. For each tampered image, they built three ground-truth pixel-level masks:

A three-class mask M s , t with the following labels: pristine, source copy-move, and target copy-move;

A binary mask M m a n with the following labels: pristine and manipulated. Note that the source region here is considered pristine;

A binary mask M s i m with the following labels: pristine and copy-move. Note that the source and target regions are both labeled as copy-move.

The authors adopted the following three-stage strategy for training:

Each branch is trained independently. In order to do so, the copy-move mask M s i m and the manipulation mask M m a n are used, as ground-truth, for the Simi-det and Mani-det branches, respectively;

The weights of each branch are frozen and the fusion module is trained with the three-class mask M s , t as ground-truth;

A fine-tuning step is performed by un-freezing the weights of the two branches and training the whole network end-to-end.

The performances of the method were evaluated on CASIA2. As CASIA2 contains both copy-move and splicing attacks, the authors selected a total of 1313 copy-move-only images along with their authentic counterparts, thus obtaining a test-set of 2626 images. The authors used the following metrics: precision, recall, and F 1 score, and they computed them both at image level and at pixel level. For the latter, the authors used two different approaches: (i) aggregate TPR, FPR, and FNR over the whole dataset, and (ii) compute precision, recall, and the F 1 score for each image and then average the results over all of them. The obtained results are reported in Table  3 .

3.2.4 M. Elaskily et al. [ 25 ]

In [ 25 ], a method for copy-move forgery detection is presented. It is purely DL-based, that is, no separate features are pre-computed. In detail, the authors built a CNN with the following architecture:

Six convolutional layers, each one followed by a max pooling layer;

A Global Average Pooling (GAP) layer, used to reduce the number of parameters of the network and to limit the probability of overfitting. This layer acts as a fully-connected dense layer;

A soft-max classification with two classes: authentic or forged.

Therefore, the method does not give as output the localization of the forged regions, but only a global classification of the image. It has been evaluated on 4 benchmark datasets: MICC-F220, MICC-F600, MICC-F2000, and SATs-130. Since each of the listed dataset is quite too small to train a CNN, the authors merged them into a new one that could be more suitable for the training phase. The obtained dataset is thus composed of 2916 images: 1010 tampered and 1906 original.

The authors used the following metrics in order to evaluate the performance of the method: accuracy, TPR, TNR, FNR, and FPR. The metrics were evaluated by a k -fold (with k = 10 % ) cross-validation. To elaborate, for each validation a random split of the composed dataset is performed: 90% for training and 10% for testing. Here, the 10% testing images is selected all from one of the 4 constituting sets of the composed dataset.

The obtained metrics are presented in Table  4 , and they are actually really high. However, we observe that the testing was performed on a small percentage (10%) of the composed dataset, which contains images from all the 4 benchmark datasets themselves. As a consequence, test and training images are possibly highly correlated. Hence, they likely have similar kind of forgeries, that is, with similar dimensions and types of post-processing operations. It could have been interesting if the authors trained their model on one dataset, like MICC-F2000, and evaluated it on another one, such as MICC-F600, in order to better assess the robustness and generalization capability of the model.

3.2.5 J. Ouyang et al. [ 78 ]

The method presented in [ 78 ] is an end-to-end deep learning approach that features a CNN for binary classification (forged vs. authentic) of the whole image. The crucial aspect of this approach is the use of the transfer learning technique, as follows:

A CNN with the same architecture as AlexNet [ 53 ] is used as base-model;

The classification layer is changed in order to have two classes as output: authentic or forged;

The weights of the AlexNet model trained on the ImageNet dataset [ 20 ] are used as initial weights for the training step;

A first training phase is carried out by freezing the weight values of the first levels of the network;

A second training phase (which is often referred to as “fine tuning”) is performed by de-freezing all the network weights, and by using a smaller learning-rate value than the one used in the first training step (such as 10 − 5 ).

Since, as already mentioned before, these public forgery detection datasets are not extensive enough for training a CNN without introducing overfitting issues, the authors artificially created copy-move operations by randomly selecting rectangles from an image and pasting them in different locations on the same image. By adopting this approach, they built the following datasets:

“data1”, that contains (i) all the 1338 color images from the UCID dataset [ 91 ], and (ii) a total of 10,000 forgeries obtained by applying the above discussed copy-move operations to the original images;

“data2”, that contains (i) all the 8189 color images from the Oxford flower dataset [ 75 ], and, again, (ii) a total of 10,000 forgeries obtained with copy-move operations on the original images.

The training of the network was done on both the “data1” and “data2” datasets. Data-augmentation with flipping and cropping operations was performed on the authentic images in order to balance the distribution of the two classes.

For the model performance evaluation, the “data1”, “data2”, and CMFD datasets were used. The obtained results are reported in terms of test detection error (which is the measure complementary to accuracy). They are as follows: 2.32%, 2.43% and 42% for “data1”, “data2” and CMFD, respectively.

From these results it is clear that, even if the model performs well on the custom datasets, it has poor generalization capabilities for real-scenario forgeries, such as the ones contained in CMFD, likely due to its basic approach in generating forgeries. However, this simple approach could still be useful if richer copy-move datasets were available, or a more sophisticated algorithm could be used to build synthetic forgeries, such as a GAN network (see Section  3.2.2 ).

3.2.6 Amit Doegara et al. [ 22 ]

The authors of [ 22 ] proposed a simple yet effective method for copy-move detection.

A pre-trained AlexNet model [ 53 ] on MICC-F220 dataset is used to extract deep feature vectors of 4096 elements from the input images (note that, in order to obtain the feature vector, the classification layer of the AlexNet network is removed).

An SVM model is then fed with the extracted features and used to obtain a binary classification on the whole image: either pristine or forged.

The training process is carried out in two phases (see Fig.  5 ). First, the pretrained AlexNet CNN is used to extract features both from the pristine images and from the forged ones. As a pre-processing step, the images are resized to match the input dimension required by the AlexNet model, which is 227 × 227 pixels. Then, the SVM classifier is trained on the obtained dataset of features and corresponding binary labels.

figure 5

Detection approach of [ 22 ]. A pre-trained AlexNet is used as feature extractor. The extracted features, either from pristine or forged images, are then used to train a SVM classifier to obtain the final decision on the input image: forged VS pristine

The authors evaluated their method on the MICC-F220 dataset, and it obtained the following results:

FPR: 12.12%;

Precision: 89.19%;

Accuracy: 93.94%.

Even if the accuracy is quite high, there is still room for improvement as the number of false positives is not really low, especially if compared with other approaches, such as [ 5 ], in which the reported FPR ratio was of 8%, along with a TPR of 100%.

A final note on the choice of MICC-F220 dataset for performance evalutation is in order. This dataset is also used for pre-training the AlexNet model used by the authors. In the paper, it is not clear which portion of the dataset is used for training and which for testing. Therefore, it is not possible to evaluate if and how much the reported results are affected by bias due to correlation between training and testing sets. In order to clear up these issues, the authors could have used different datasets for either phase instead, such as MICC-F2000 or MICC-F600.

3.3 Copy-move and splicing methods

We now move on to discuss those methods designed to detect both copy-move and splicing forgeries.

3.3.1 Cozzolino and Verdoliva [ 18 ]

In this work, the authors presented a deep learning approach that aims to extract a camera model noise pattern (referred to as “noise print”) as a means to detect forgeries.

A digital camera, due to the on-board processing operations carried out on the signal received from the sensor, leaves on the generated picture a distinctive pattern of artifacts that are model-specific. This can be exploited, in a forensic scenario, to estimate from which camera model a certain picture was taken from. This idea can also be applied for the purpose of forgery detection. For instance, in the case of a spliced image, if the patch used to create the composition was extracted from a photo taken by a different camera model, then inconsistencies between the camera model artifacts could be leveraged in order to detect the tampering.

A useful property of the camera noise pattern is that it is not space invariant. This means that two patches extracted at different locations from the same image are characterized by different noise artifacts. By exploiting this property, this method can also be used for copy-move detection, as the camera noise pattern at the target location of the copy-move attack is hardly consistent with the expected one at that particular location. The authors used the pre-trained denoising CNN presented in [ 108 ] as the starting point for their approach. This network was trained with a great number of paired input-output patches, where the input is a noisy image and the output is its corresponding noise pattern.

In order to estimate the camera model noise print, a further training of the previous architecture was performed. Since a mathematical model describing the camera noise pattern is not available, it is not easy to build a dataset with pairs of an input image and its corresponding desired camera noise print. In order to overcome this problem, the authors used the following key idea: patches extracted from images taken with the same camera model, and at the same location, should share similar camera noise print, while this should not be true for patches coming from different camera models or from different spatial locations. Following this insight, the authors built a Siamese architecture, in which two identical Residual CNNs (initialized with the optimal weights computed in the first training phase) are coupled and the prediction of one network is used as desired output for the other one and vice-versa. The overall architecture is shown in Fig.  6 .

figure 6

Architecture of the Siamese network proposed in [ 18 ]. Two residual networks (with shared weights) are trained to extract noise patterns that are given as input to a binary classificator. The model learns to extract similar noise patterns for positive labels (patches from same cameras) or different ones for negative labels (patches from different cameras and/or different spatial locations)

In the training phase, the two CNNs are fed with patches \({x^{a}_{i}}\) and \({x^{b}_{i}}\) , respectively. These patches can be:

extracted from images taken from different camera models;

extracted from images taken from the same camera model, but at different spatial locations;

extracted from images taken from the same camera model, at the same location.

The input pair \(({x}_{i}^{a}, {x}_{i}^{b})\) is assigned, as expected output, a positive label y i = + 1 (“similar camera noise print”) in the third case, while a negative label y i = − 1 (“different camera noise print”) in the first and second cases. The output of the Siamese architecture is obtained by means of a binary classification layer that takes as input the noise print extracted by the two CNNs. This output is then compared to the expected label y i and the error is back-propagated through the network. This way, the network is pushed towards generating a similar noise print for patches from the same camera model (and at the same location), and different ones for patches corresponding to different camera models and/or locations. As a result, the network learns to enhance the specific model artifacts and discard the irrelevant features, while reducing the high level scene content of the images. Once the network is trained, the noise print can be obtained as output of one of the two CNNs from an input target image.

In order to detect and localize forgeries, the authors used the EM (Expectation - Maximization) algorithm. With the assumption that the pristine and manipulated parts of the target image are characterized by different camera noise models, the algorithm searches for anomalies with respect to the dominant model. This is done by extracting features from the noise print image at a regular sampling grid, that are then used to train the EM algorithm. A heat-map with the probability of manipulation for each pixel is given as output.

The authors tested their method on 9 different datasets for forgery detection, containing many kind of tampering, such as copy-move, splicing, inpaiting, face-swap, GAN generated patches, and so on. Here, we only report the results on the DS0-1 [ 19 ] and Korus [ 49 ] datasets, as they contain only splicing and copy-move attacks (with possible post-processing operations). The obtained F 1 -score is 78% for DS0-1 and 35% on Korus. The authors also computed the AUC score, which is 82.1% and 58.3%, respectively.

3.3.2 Y. Zhang et al. [ 107 ]

The authors of this paper proposed the following approach for image forgery detection:

Feature extraction and pre-processing. The image is first converted into the YCbCr color space, then it is divided into 32 × 32 % overlapping patches. For each component of the YCbCr space a total of 450 features are extracted from each patch by leveraging the 2-D Daubechies Wavelet transform;

The extracted features from each patch are used to train a 3-layers Stacked AutoEncoder (SAE), which is an unsupervised model. On top of the SAE, an additional MLP (Multi-Layer Perceptron) is employed for supervised learning and fine tuning;

Context learning. In order to detect forged regions that span across multiple 32 × 32 % patches, each patch-level prediction from the MLP is integrated with the predictions of the neighboring patches. Specifically, for each patch p , a neighbouring patch set N ( p ) with cardinality k + 1 is defined as:

where \({y^{0}_{p}}\) is the output feature of the SAE for the patch p , and \({y}_{p}^{i}\) , with i ≥ 1 is the feature of its i -th neighbouring patch;

Finally, a binary output Z ( p ) (forged/authentic) is obtained by computing the average of the MLP predictions of the neighbouring patches and comparing it to a threshold, as follows:

where the authors set k = 3 and α = 0.5.

For the training and testing stages of the model, a total of 1000 images were randomly extracted both from the CASIA1 and the CASIA2 datasets. In particular, 770 images were used for training and the remaining 230 for testing. The authors manually built a pixel-wise ground-truth mask for each image in order to train their model at the patch level. Likewise, a patch-level ground-truth mask for each of the test image was also built, as shown in Fig.  7 .

figure 7

Construction of patch-wise ground-truth from the pixel-level mask as in [ 107 ]

In order to evaluate the performance, the authors used the following metrics: accuracy, FPR (fallout), and precision, where the usual rates are again defined at patch-level. The method can be applied for copy-move detection, as well as splicing detection. Note that this method gives a coarse localization of the forged areas (at patch-level).

The reported performance is 43.1%, 57.67% and 91.09% for fallout, precision, and accuracy metrics, respectively. Even if these performance are not quite satisfactory at a first glance, it should be considered that these metrics are evaluated at patch level, and hence are most restrictive than the the same metrics evaluated at image level.

3.3.3 N. H. Rajini [ 85 ]

This technique involves two separate CNN models that are used for different purposes in the forgery detection pipeline. It is able to detect both splicing and copy-move attacks. A schematic view of the method is shown in Fig.  8 , and it can be summarized as follows:

Pre-processing stage. The image is first converted into the YCbCr space. Then, a Block DCT is applied on each Y, Cb, and Cr component. In order to reduce the effect of the actual image content, horizontal and vertical de-correlation is computed from the DCT coefficients. Finally, a set of features are extracted from these values by means of a Markov Random Chain model;

Forged/authentic decision. The extracted features are given as input to the first CNN model, which gives a binary classification of the image as either forged or authentic;

Type of attack recognition. In the case that the image is recognized as forged, a second CNN is then employed to classify the type of attack: copy-move or splicing;

Post-processing. If a copy-move attack is detected by the second network, further features are extracted and used in order to localize the forged regions.

figure 8

Multi-step strategy proposed in [ 85 ]. First, features are extracted from the YCbCr converted image to classify the image as authentic or forged. If the image is classified as forged, a CNN is used to distinguish between copy-move and splicing attacks. Finally, in the case of copy-move attack, another feature extraction and localization procedure is employed to obtain a map of the forged regions

The authors evaluated their method on the CASIA2 dataset. In particular, they used 80% of the images for training and the remaining 20% for testing. The procedure was repeated 50 times with differently extracted training and testing sets, and the reported performance were computed as an average between all the experiments. The TPR, TNR, and accuracy are used as evaluation metrics.

Although the described method can provide as output the localization of the forged areas, the authors only reported performance at a global level (that is, the forged vs. non forged image assessment). The obtained results are the following:

98.91%, 99.16%, and 99.03% for TPR, TNR, and accuracy, respectively, in the case of copy-move attacks;

98.98%, 99.24%, and 99.11% for TPR, TNR, and accuracy, respectively, in the case of splicing attacks.

The reported performance metrics are really high. In addition, they are meaningful from a statistically point of view, as they are evaluated on the sizable CASIA2 dataset. It would have been interesting, though, if the authors evaluated the localization accuracy of their method too, in a similar manner to [ 107 ].

3.3.4 F.Marra et al. [ 69 ]

The authors proposed a full-resolution, end-to-end deep learning framework for forgery detection.

Typically, due to limited memory resources, deep learning models, such as CNNs, are designed to take as input images with small sizes. So, in order to process high resolution images, either a resize to match the network input size or a patch-level analysis (with possible overlapping) is needed. For computer-vision tasks in which only a high level understanding of the image content is required, such as object recognition, this is usually not an issue. But, for the purpose of forensic analysis, resizing is not recommended, as it tends to destroy important information that is usually stored at high frequencies. Patch-level analysis can also be a limiting factor, as usually the context of the whole image is important as well for forgery detection purposes.

In order to address these problems, the authors built a deep learning framework that takes as input full-resolution images and perform image-level predictions: “forged” or “pristine”. The framework is composed of three consecutive blocks:

Patch-level feature extraction. This is a CNN that takes as input a patch extracted from the target image and gives as output a feature vector;

Future aggregation module. This block takes as input the extracted feature vectors from the overlapping patches and aggregate them together in order to obtain an image-level feature. The authors considered different methods for feature aggregation, such as average pooling, min/max pooling, and average square pooling;

Decision step. It is a binary classification process, that was implemented with two fully-connected layers.

The whole framework is trained end-to-end. This is not the case for other similar approaches, in which the patch feature extractor, the feature aggregation module, and the classification layers are trained independently one from the others.

Note that, when an input large size image is processed during training, a great amount of memory is required to simultaneously store all the overlapping patches and to compute their corresponding feature vectors. Also, in the forward pass, the activations in all the intermediate layers need to be memorized for the computation of the loss gradients (needed to update the network weights) in the subsequent back-propagation pass. In order to solve this issue, the authors exploited the gradient check-pointing strategy [ 13 ]. This technique consists in saving the activations only at certain check-point layers during the forward pass. In the back-propagation phase, the activations are re-computed up to the next check-point layer and used to compute the gradients. As a consequence, less memory is required at the cost of an increased computational time during the back-propagation.

The authors evaluated their method on the DSO-1 and Korus datasets, obtaining an AUC score of 82.4% and 65.5%, respectively.

3.3.5 Y. Rao et al. [ 86 ]

An overview of the architecture of this method is shown in Fig.  9 . It starts by taking an input RGB image of size M × N and dividing it into p × p , p = 128, overlapping patches X i , i = 1,…, T , where T is the total number of patches. Each patch X i is given as input to a 10-layer CNN that gives a softmax binary output Y i , as follows:

The Y i vector represents a compact feature that describes the patch i . A global feature vector is then obtained by concatenating the Y i of each image patch:

A mean or max function is then applied for each of the 2 dimensions:

Finally, \(\hat {Y}\) is given as input to a SVM classifier that performs a global two-class prediction on the whole image: authentic vs. forged.

figure 9

Architecture of the technique in [ 86 ]. Overlapping patches are extracted from the input image and feature vectors are extracted from each of them. A global feature, computed by averaging along the spatial dimension, is then fed to an SVM model, which is used to obtain the final global classification: forged VS authentic

A key aspect of this technique is the following: in order to suppress the image perceptual content and instead focus the detection phases on the subtle artefacts introduced by the tampering operations, the authors initialized the first CNN layer weights with a set of high-pass filters that are used for residual maps computation in SRM (Spatial Rich Models). This step also has the benefit of speeding up the training phase of the network.

The CNN was trained on the CASIA1, CASIA2, and DVMM datasets. This method can be applied both for splicing and copy-move detection, because the CNN and the SVM are trained on the aforementioned datasets, which contain both type of forgeries. Note that the SVM classification step is only used for the CASIA datasets.

The detection performance, in terms of accuracy, is 98.04%, 97.83%, 96.38% on CASIA1, CASIA2, and DVMM datasets, respectively. These accuracy values are objectively high. This is true in particular in the case of CASIA2, which is the dataset with not only the most images (and consequently it is the most statistical relevant, as we said before), but it also contains both splicing and copy-move attacks. It should be noted, though, that this method only gives a global binary prediction on the image, and no localization of the forged areas is performed.

3.3.6 M. T. H. Majumder et al. [ 68 ]

The approach described in [ 68 ] is also based on a CNN to classify an image as authentic or forged. In contrast to the previously discussed methods, however, in which deep learning networks were composed of a high number of layers, in this case a shallow CNN model, composed of just two convolutional layers, was employed. Also, no max-pooling steps were used for dimensionality reduction, as this goal was achieved by exploiting large convolutional filters, with size of 32 by 32 and 20 by 12 for the first and the second layer, respectively.

This strategy is based on the following idea: in deep neural networks, complex high-level features are learnt at deeper levels, while more simple visual structures, such as edges and corners, are learnt at the first ones. Hence, in order to detect the artefacts introduced by forgery operations, low-level features are more likely to be useful. As a consequence of this choice, the number of parameters of the network is limited, thus allowing for training with less over-fitting risk.

The CASIA2 dataset was used both for training and testing. The authors trained their shallow network multiple times in an independent fashion, using different pre-processing strategies, such as: raw input (that is, no pre-processing), DCT-based transformation, and YCbCr space conversion. They showed that the best results were obtained without any kind of pre-processing.

To further reduce the risk of overfitting, real-time data augmentation was applied during training, with transformations such as shearing, zooming, and vertical and horizontal flipping. An accuracy of 79% was obtained with this training strategy, and, as we said, without pre-processing.

As a comparative experiment, the authors also applied the aforementioned transfer learning technique, by using two deep learning models with a high number of layers that were pre-trained on the ImageNet dataset: the VGG-16 [ 95 ] and the well-known ResNet-152. Despite the fact that these models perform well on standard image classification problems, they were not able to transfer the acquired knowledge to this specific task, and a substantial underfitting issue was observed in the training phase. The outcome of this test validated the choice of a shallow model instead of a deep one.

The main contribution of this work is therefore the usage of a shallow network, in which low-level features are exploited as a mean to detect subtle artefacts generated by tampering (rather than high-level ones), which thus can be used for the forgery detection task. Also, the authors showed that large convolutional filters can be exploited in place of max-pooling layers to reduce the number of network parameters, therefore reducing the risk of overfitting. Despite this, the obtained accuracy still leaves room for improvement.

3.3.7 R. Thakur et al. [ 97 ]

In [ 97 ], a filtering scheme based on image residuals is exploited. Therefore, the residuals, rather than the raw images, are fed as input to a CNN network for classification (as usual, original/forged). This approach is tailored to pursue high frequencies in the image data, which, as often assumed even by the other approaches, carry most of the possible tampering traces. The image residuals are computed as follows:

The image is resized at the 128 × 128 size, and converted to grayscale;

The second-order median filter residuals (SDMFR) are then calculated as follows. Given an image, a first median filtering is applied:

where w is a 5 × 5 window and x i , j is the ( i , j ) pixel intensity. Then, a second median filtering is applied to the median-filtered image:

Finally, the residuals are obtained by subtracting the second order median filtered image from the first order filtered image:

Laplacian filter residuals (LFR) are also computed, with the following algorithm. Let:

be the Laplacian kernel filter. The Laplacian-filtered image is obtained by convolving the original image with K , that is:

The residuals are then calculated as the difference between the filtered image and the original one:

Both the SDMFR and the LFR residuals are fed to the CNN classification network as a combined input. The CNN model comprises 6 convolutional layers, each one followed by a max pooling step (except the first one). Two fully connected layers are then used before the final binary softmax classifier.

The authors trained and tested their network on two different datasets: the CoMoFoD and the BOSSBase [ 8 ]. In the case of the first dataset, a split of 70% and 30% has been made for training and validation, respectively. In the case of the second one, as it is composed of 10,000 raw pristine images, the authors applied median filtering to each image in order to simulate a tampering operation, thus obtaining a total of 20,000 images (half authentic and half filtered). Then, they split the obtained dataset into 70% for training and 30% for validation.

The accuracy obtained on both datasets is high: 95.97% for the CoMoFoD dataset, and 94.26% for the BOSSBase. However, it could have been interesting if the authors tested their method, without retraining, also on other benchmark datasets for forgery detection, such as CASIA2, MICC-F2000 or MICC-F600, in order to assess its generalization capability.

3.4 DeepFake methods

We now present a few of the most recent DeepFake-specific detection methods, that achieved the best results on the previously introduced datasets for DeepFakes detection evaluation (see Section  3.1 ). The selection has been made according to the criteria previously outlined, namely, suitability for the still images case.

3.4.1 A. Rössler et al. [ 88 ]

In [ 88 ], the authors developed a method to detect image DeepFakes that is based upon the XceptionNet architecture proposed by Google in a previous paper [ 15 ]. The main peculiarity of this model is the employment of a custom layer, called SeparableConv , whose purpose is to decouple the depth-wise convolution from the spatial one, thus reducing the number of weights of the model itself.

The detection pipeline can be summarized as follows: a state-of-art face detection/tracking method [ 98 ] is used to extract the face region from the image/frame, which is cropped as a slightly larger rectangle than the size of the face in order to include some contextual information.

The obtained bounding box is then fed to a modified XceptionNet for binary classification. In order to do this, the final fully-connected layer of the original XceptionNet is substituted with a fully-connected layer with binary output.

The authors adopted the following transfer-learning strategy to train the model:

The weights of each layer from the original XceptionNet are initialized with the ImageNet ones, while the fully-connected layer is random initialized;

The network is trained for 3 epochs, with all the weights freezed except the ones in the fully-connected layer;

All the weights are un-freezed and the network is trained for other 15 epochs (fine-tuning step).

The authors released three different versions of their model: the first one is trained on uncompressed videos, while the second and the third one were trained on videos compressed with H.264 codec at quantization levels of 23 and 40, respectively. We denote these variants as Xception_a, Xception_b, and Xception_c, respectively.

While Xception_a achieved the best results on FaceForensic++ dataset, with a detection accuracy of 99.7%, its performance dropped when evaluated on DFDC and CelebDF, with accuracy scores under 50 % in both cases. Xception_b achieved the best accuracy on DFDC (72.2%), while Xception_c performed better on CelebDF, with an accuracy of 65.5%.

3.4.2 Huy H. Nguyen et al. [ 72 ]

In this paper [ 72 ], a novel forgery detection framework, called Capsule-Forensic was proposed. Its main feature is that it uses a particular kind of neural network, Capsule Network (first introduced in [ 37 ]), as the binary detector, instead of the more usual convolutional neural networks.

Capsule Networks were designed in order to efficiently model hierarchical relationships between objects in an image, and to infer not only the probability of observation of objects, but also their pose estimation.

The main idea behind Capsule Networks is the concept of “capsule”. A capsule is an ensemble of neurons that describe a set of properties for a given object. In contrast to single neurons, in which the scalar output represents the probability of observation of a certain feature, the output of a capsule is an activation vector, in which each element represents the activation of one of the capsule’s neurons, i.e., the value corresponding to the associated feature.

Capsules are arranged in different layers in a hierarchical fashion: a parent capsules receives, as input, the output of different child capsules. The connections between child and parent capsules i.e., which outputs are kept and which are discarded for the next layer) are not fixed at the beginning, such as for max/average pooling layers (usually employed in standard CNNs), but they are dynamically computed by means of a routing by agreement algorithm.

Thanks to this procedure, child capsules whose predictions are closest to the predictions of certain parent ones become more and more “attached” to these parents, and a connection can be considered established. The interested reader is referred to the original paper for a more detailed explanation on how the hierarchical connections are built.

Among the advantages of Capsule Networks compared to CNNs, a remarkable fact is that they have less parameters, as neurons are grouped in capsules and the connections between layers are between capsules and not directly between neurons. Also, thanks to the presence of pose matrices, they are robust against viewpoint changes under which objects are seen in the image. This is not true for CNNs, that need to be trained on lots of possible rotations and transformations in order to generalize well to unseen transformations.

The proposed method is designed for different forensics tasks, such as (i) DeepFakes detection, and (ii) computer-generated frame detection, both for image and video content.

The detection pipeline (shown in Fig.  10 ) comprises the following elements:

Pre-processing phase. It depends on the specific forensic task at hand, e.g., for DeepFakes detection it involves a face detection algorithm in order to extract the face region, while for CGI detection it consists in patch extraction from the input image. For video content the frames are separated and fed one by one to the subsequent steps;

Feature extraction. This is done by using the first layers of a VGG_19 network pre-trained on ILSVRC dataset [ 90 ]. These weights are fixed during training;

Capsule Network. It is the core of the detection method, involving three primary capsules (children) and two output capsules “Real” and “Fake” (parents). The predicted label is computed as in ( 18 ):

where V 1 ∈ R M and V 2 ∈ R M represent the output capsules, and M is their dimension;

Post-processing phase. As the pre-processing step, this is task-specific: the scores are averaged among patches for computer generated image detection, or among frames for video input.

figure 10

Overview of method [ 72 ]. Note that pre-processing and post-processing stages are task-dependent, e.g. for DeepFake detection in the former a face tracking algorithm is used to extract and normalize the face region, while for CGI detection this step consists in the extraction of overlapping patches

The achieved detection accuracy is very high on FaceForensic++, with a score of 96.6%, but it is lower on the more challenging datasets DFDC and CelebDF, with accuracies of 53.3% and 57.5%, respectively.

3.4.3 Y. Li et al. [ 57 ]

In [ 57 ], the authors proposed a deep learning method to detect DeepFakes based on the following observation: typically, DeepFakes generation algorithms tend to leave distinctive artifacts in the face region due to resolution inconsistencies between the source image/video and the target one. In particular, GAN-synthesized face images are usually of a fixed low resolution size and, in order to be applied to the target video, an affine warping needs to be performed in order to match the source face to the facial landmarks of the target face. If the resolutions of the source and target videos do not match, or if the facial landmarks of the target person are far from the standard frontal view, these artifacts are more and more evident.

The authors trained four different CNNs, namely a VGG-16, a ResNet50, a ResNet101, and a ResNet152 to detect these kinds of artifact. In particular, they used a face-tracking algorithm to extract regions of interest containing the face as well as the surrounding area, which are then fed to the networks. The reason why also a portion of the surrounding area is included is to let their model learn the difference between the face area, that contains artifacts in the case of positive (fake) examples, and the surrounding one, which does not contain artifacts.

The authors used the following training strategy. Instead of generating positive examples by means of a GAN-syntesization algorithm, which in turn requires a good amount of time and computational resources to train and run, they generated positive examples by simulating the warping artifacts with standard image processing approaches, starting from negative (real) images. The processing steps are summarized as follows:

The face region is extracted with a face tracking algorithm;

The face is aligned and multiple scaled versions are created by down/up-sampling the original one. Then, one scale is randomly selected and Gaussian-smoothed. This has the effect of simulating the mismatch in resolutions between source and target videos;

The smoothed face is then affine-warped to match the face landmarks of the original face;

Further processing can be done in order to augment the training data, such as brightness change, gamma correction, contrast variations, and face shape modifications through face landmarks manipulation.

The detection accuracy obtained are: 93.0% for FaceForensic++, 75.5% for DFDC and 64.6% for CelebDF.

4 Performance comparison

In this Section we proceed to compare the previously described forgery detection methods from a performance perspective.

We begin by comparing techniques specific for copy-move and splicing, while DeepFake detection algorithms are discussed in a separate section. In fact, even if the DeepFake methods that we previously discussed can be seen as a particular kind of splicing attack, they are mostly performed on faces. As a consequence, DeepFake detection techniques must be evaluated with datasets specialized on face manipulations, while the standard splicing datasets, such as CASIA, contain pictures of generic scenes. Furthermore, these methods can successfully exploit domain specific knowledge, such as face landmarks, mouth/eyes-based features, and so on, while of course this is not the case for generic splicing detection algorithms.

4.1 Splicing and copy-move methods

In Table  5 the performance of all previously discussed copy-move and splicing detection techniques are reported. For each method, we also indicate the type of detected attacks (splicing, copy-move, or both) and the capability or lack thereof to give as output the localization of the forged areas.

As a first comment, from the sparseness of the table it is easy to see that it is very challenging to compare the different techniques strictly in terms of performance. This is due to a number of reasons. The first and most obvious one is that approaches designed specifically for copy-move detection cannot be easily evaluated on CASIA (both v1.0 and v2.0) datasets, as these also contain splicing attacks (an exception can be made for method [ 105 ], that was evaluated on a copy-move-only subset of the dataset itself, see Section  3.2.3 ). In this case, copy-move specific datasets, such as MICC-F220, MICC-F600, and MICC-F2000 should be considered for evaluation.

The second reason is that the presented methods, especially in the case of copy-move specific ones, are mostly not trained nor tested on the same benchmark sets. This is due to the fact that some of the standard datasets are either too small for training a highly parameterized deep learning model, or contain only naive attacks (such as MICC-F220, in which copy-moved regions are square or rectangular patches). For this reason, different authors instead built their own custom datasets to fulfill their specific requirements, either by merging together the benchmark ones or by artificially generating them. However, the downside of this approach is the difficulty of comparing the results achieved by other techniques.

Therefore, the comparison between different techniques, when it is possible, is performed by grouping them on the basis of specific criteria, such as the type of detected attacks, the dataset used for evaluation, and the localization property.

We start by focusing our analysis on the methods designed for copy-move only forgeries, then proceed to both copy-move and splicing detection techniques, and conclude with DeepFake specific ones.

4.1.1 Copy-move detection methods

We start the present analysis by first comparing methods [ 4 , 25 ], and [ 22 ], as they have been all tested on the MICC-F220 dataset. The first method achieved a slightly better accuracy and a considerable better FPR (see Table  4 ) than the other two, along with a considerably better accuracy. In addition, [ 25 ] has been shown to achieve perfect results on MICC-F600 and almost perfect ones on MICC-F2000, which are more significant evaluation datasets (see Section  3.2.4 ). However, it should be considered that [ 25 ] only gives as output a global decision on the authenticity of the image, while [ 4 ] also provides the location of the forgery.

Regarding the forgery localization property, it is worth noting that the techniques presented in [ 1 ] and [ 105 ] allow not only to detect the copy-moved regions, but also to distinguish them from the source patches used to perform the attack. This property is useful in real forensic scenarios, in which it is important to understand the semantic aspects of an image manipulation.

A further interesting feature of [ 1 ] is the adoption of a GAN network to generate increasingly hard-to-detect forgeries, that are used to train the discriminator network. This is an original approach to address the problem of data-scarcity that plagues many different existing standard datasets. However, from a performance point of view, it is hard to compare this method to the other ones, as it was evaluated on a custom dataset and not on one of the benchmark datasets. This is not the case for [ 105 ], which was evaluated on CASIA2. Note that, even if its accuracy is slightly worse than [ 68 ], it has the source plus target localization property mentioned before, while the latter gives as output only a global classification on the image.

4.1.2 Splicing and copy-move detection methods

These techniques fit the best in a general application context, in which the type of attack is not known a priori, so it is better to cover as many attacks as possible. In particular, we consider the methods tested on CASIA2, which is likely the most significant dataset for copy-move and splicing detection evaluation, both for its sheer size and for the various applied post-processing operations.

Among the methods that we discussed, the one presented in [ 85 ] obtained the best overall accuracy. It also gives as output the localization of the forged areas, which as we mentioned is of course relevant in many application contexts. Looking at its forgery detection pipeline, it features both a pre-processing stage, in this case based on YCbCr space conversion and DCT compression, as well as a post-processing phase that through further features extraction allows to perform localization. Therefore, the good performance that it achieved indicate that an exclusively end-to-end deep learning model, without any pre-processing or post-processing, could be indeed a sub-optimal choice for the task of forgery detection.

On the same note, another comment can be made about the method in [ 68 ]. Even if its performance are worse than the others in terms of accuracy, the proposed approach is quite interesting because it involves a “shallow” deep learning model. This allows reducing not only the number of network parameters (and consequently the training time), but also the risk of over-fitting. This idea is in contrast to the common trend in computer vision to use ever deeper networks to achieve high accuracy on specific datasets, that usually cannot be achieved on slightly different ones, which is a clear indicator of over-fitting issues.

A remark should be made on the approach proposed in [ 18 ]. This method has a wide applicability even outside the field of forgery detection. In fact, the possibility to extract the noise camera pattern and suppress the high-level scene content of a target image is of great utility in other forensic scenarios as well as for sophisticated camera-specific denoising applications. It is important to also note that the authors evaluated the performance of their algorithm on different datasets, which contain a wide set of forgery attacks such as copy-move, splicing, inpainting, GAN-synthesized content, face-swap, etc., thus proving its wide applicability and robustness. Still, it would have been interesting to have the detection results on other more classic benchmark data, such as the CASIA2, thus allowing a better comparison with other existing methods.

4.1.3 DeepFake detection methods

In Table  6 , the performance of DeepFake detection methods are reported.

As it can be immediately observed from the table, there is not a method that performs better on all three considered benchmark datasets: [ 57 ] reports the best accuracy on DFDC, while [ 88 ](a) performs better on FaceForensic++, and [ 88 ](c) achieved the highest accuracy on Celeb-DF. It must be considered, though, that FaceForensic++ was built by the same authors of [ 88 ] (all three versions). As such, it is, to some extent, expected that these are the methods that perform better on that particular dataset. Nonetheless, [ 88 ](c) still has the best results on Celeb-DF, while [ 88 ](b) has only slightly worse performance than [ 57 ] on DFDC, thus showing how the XceptionNet-based strategy can be the to-go choice for its generalization capability on different datasets.

Finally, we observe that, when evaluated against challenging and realistic datasets such as Celeb-DF, DeepFake detection methods still need to be improved, as the best accuracy obtained is just around 65%. This allows us to infer that the research field of DeepFake detection is still lagging behind, especially considering the fact that DeepFake generation algorithms are still largely improving year after year.

5 Conclusions

In this work we provide a survey of some of the recent AI-powered methods (from 2016 onward) for copy-move and splicing detection that achieve the best results in terms of accuracy on the standard benchmark datasets. Several reviews and surveys have been published on this topic, but most concerned mainly traditional approaches like those based on key-points/blocks, segmentation, or physical properties. Instead, we focused our analysis on recently published, deep learning based methods, because they have been shown to be more effective in terms of performance and generalization capability than the traditional approaches. As a matter of fact, they are able to achieve really high accuracy scores on the benchmark datasets.

We separated the performance analysis between copy-move only, both copy-move and splicing, and DeepFake detection methods. In the case of copy-move only detection, the method in [ 25 ] shows an almost perfect accuracy on all three standard benchmark datasets (MICC-F220, MICC-F600, and MICC-F2000). The technique presented in [ 4 ] is able to achieve a similar accuracy, while also giving the identification of both the copied regions and the original ones used as source for the attacks. In the case of both copy-move and splicing detection, similar results were achieved on the CASIA2 dataset. In particular, method [ 85 ] shows the best accuracy and gives the localization of the forged regions as well.

Concerning DeepFake detection, from the reported performance (see Table  6 ) we infer that there is not a clearly winning approach, in particular no method is general enough for different kinds of DeepFake content. However, we can conclude that the XceptionNet-based models proposed in [ 88 ] are able to achieve better performance on at least two out of the three considered benchmark datasets.

From a general point of view, it can be easily inferred from the DL-based methods surveyed in this paper that a clear trend has not yet emerged. Most works have been more or less independently proposed, in the sense that the vast possibilities offered by DL architectures are still being explored, without a clear winning strategy indication. Nonetheless, we showed that, in the case of splicing and copy-move detection methods, the best accuracy scores were obtained by the techniques that involve some form of pre-processing and post-processing in addition to a deep learning network. For this reason, we think that this appears to be the most promising approach, and so we believe that further research should be conducted on algorithms that combine deep learning approaches with traditional techniques from all over the field of (statistical) signal processing.

As a further consideration, it can be noted that in the case of techniques aimed at “classic” forgery detection (splicing and copy-move), most of state-of-art methods are able to achieve good performance (on different datasets). Instead, this is not the case for newer challenges like DeepFake detection, whose methods report accuracy performance which is still not satisfactory on complex datasets, like Celeb-DF. As such, further research efforts and ideas still need to be explored in this particular direction.

Further remarks are in order on the problem of performance evaluation of deep learning based methods. Different authors built custom datasets or merged different ones in order to train and test their algorithms. While this can be a solution to overcome issues of data-scarcity (over-fitting), it makes the comparison with other methods more difficult, or even impossible. Even when the same dataset is used to evaluate different approaches, the authors do not always specify which and how many images were used as testing set.

This problem could be addressed by building a custom dataset for training, and using one or possibly more benchmark datasets in their entirety for testing. In this way, not only it would be possible to easily compare different deep learning based approaches, but also to compare them to traditional, non-learning based ones.

Of course, building a custom dataset with thousands of images, with realistic forgeries and post-processing operations on the forged areas, such as blurring, JPEG-compression, smoothing, and so on, is not a simple undertaking. For this reason, we point out that another possible future research direction could be the automation of this task, for example by leveraging a GAN network (as done in [ 1 ]), or encoder-decoder models such as a Unet.

A wholly different comment on the subject of datasets building should also be made on the meaning of the forgery attacks currently contained in the benchmark datasets. As these have always been generated in a laboratory environment (whether manually or not), they typically contain copy-move and splicing attacks that hardly bring a particular semantic value to the altered images. For example, when a tree is copied and pasted in a wood landscape, or a cloud is pasted into a blue sky, the obtained image could hardly be used for malicious purposes. This is clearly not the case for many manipulated images that can be found on the Web. Let us consider for example the splicing shown in Fig.  1b : the fact that the 2004 presidential election candidate John Kerry was (falsely) immortalized together with pacifist actress Jane Fonda, who was viewed by many as an anti-patriotic celebrity, could have seriously influenced the elective campaign (in this case, the image was shown to be false, but not quickly enough to avoid some damage to the candidate’s reputation).

Of course, in such real-world cases, the context adds a lot to the meaning of the forgery, and thus it can hardly be taken into account by a forensic tool without a human supervision. Nevertheless, we feel that it could be interesting to build a database that collects more realistic, manually made, “in-the-wild” forgeries, like the ones that routinely spread on social media these last years, and so present potentially malicious attacks from a purely semantic point of view. Also, this database should contain, for each forgery, the associated ground-truth mask, in order to better assess and compare the forgery localization capability of the forensic tools.

We would like to conclude adding a final, more philosophical observation. As is typical in the case of security-related fields, attackers usually embody, in their attacks, ideas and “hacks” that are specifically designed to counterpoise the latest state-of-art detection methods, e.g., so-called adversarial attacks [ 26 , 36 , 54 ], which are used to fool deep learning classification systems. For example, a possible strategy to achieve this confusion consists in using a certain CNN architecture as a discriminator in a GAN model, in order to produce synthesized content which is, by construction, hard to be detected as fake by that particular CNN. Another interesting example of this kind involves DeepFake detection: in [ 58 ], the authors observed that, in DeepFake videos, it was common to see unnatural eye-blinking pattern (or no blinking at all), because DeepFake generation algorithms were trained mostly on pictures of people with open eyes. As expected, attackers immediately adapted DeepFake methods in order to generate realistic eye-blinking, either by including pictures of people with closed eyes during training, or by synthetically correct this issue altogether.

As a consequence, it is probably an illusion to consider a certain forgery detection method to be “safe” forever, even if it has been shown to achieve great detection accuracy on different datasets. For this reason, we think that continuous research efforts should be made in order to develop methods that can, at least to some extent, keep up with the attackers’ pace in developing more and more sophisticated and hard-to-detect forgeries. One possible strategy, that tries to anticipate potential attacker moves, could be to actively implement new forgery techniques while developing detection algorithms, this way understanding and leveraging their flaws and thus to allow the creation of possible counter-measures.

Code Availability

No code has been developed by the authors for this work.

For example Instagram, Snapseed, Prisma Photo Editor, Visage, and many more.

The on-board image signature algorithm developed by Nikon, for example, has been long compromised [ 10 ]. Another high profile case is Blu-Ray, which protection scheme used a combination of cryptography, digital signatures, and watermarking [ 77 ]

For example, Photo Proof Pro by Keeex [ 46 ] and Numbers Protocol [ 76 ]

Abdalla Y, Iqbal T, Shehata M (2019) Copy-move forgery detection and localization using a generative adversarial network and convolutional neural-network. Information 10(09):286. https://doi.org/10.3390/info10090286

Article   Google Scholar  

Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2010) Slic superpixels. Technical report, EPFL

Adobe Photoshop. https://www.adobe.com/it/products/photoshop.html . Accessed 16 Mar 2022

Agarwal R, Verma O (2020) An efficient copy move forgery detection using deep learning feature extraction and matching algorithm. Multimed Tools Appl 79. https://doi.org/10.1007/s11042-019-08495-z

Amerini I, Ballan L, Caldelli R, Del Bimbo A, Serra G (2011) A SIFT-based forensic method for copy-move attack detection and transformation recovery. IEEE Trans Inf Forensics Secur:1099–1110. https://doi.org/10.1109/TIFS.2011.2129512

Arnold MK, Schmucker M, Wolthusen SD (2003) Techniques and applications of digital watermarking and content protection. Artech House

Barni M, Phan QT, Tondi B (2021) Copy move source-target disambiguation through multi-branch cnns. IEEE Trans Inf Forensics Secur 16:1825–1840

Bas P, Filler T, Pevnỳ T (2011) Break our steganographic system the ins and outs of organizing BOSS. In: International workshop on information hiding, pp 59–70. https://doi.org/10.1007/978-3-642-24178-9_5

Bay H, Ess A, Tuytelaars T, Van Goo L (2008) Speeded-up robust features (surf). Comp Vision Image Underst 110(3):346–359. https://doi.org/10.1016/j.cviu.2007.09.014 . Similarity Matching in Computer Vision and Multimedia

Blog post on Elcomsoft, April 2011. https://blog.elcomsoft.com/2011/04/nikon-image-authentication-system-compromised/ . Accessed 16 Mar 2022

Birajdar GK, Mankar VH (2013) Digital image forgery detection using passive techniques: a survey. Digit Investig 10(3):226–245. https://doi.org/10.1016/j.diin.2013.04.007

Cao Z, Gao H, Mangalam K, Cai Q-Z, Vo M, Malik J (2020) Long-term human motion prediction with scene context. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer vision - ECCV, pp 387–404

Chen T, Bing X, Zhang C, Guestrin C (2016) Training deep nets with sublinear memory cost

Chen J, Liao X, Qin Z (2021) Identifying tampering operations in image operator chains based on decision fusion. Sig Process Image Commun 95:116287. https://doi.org/10.1016/j.image.2021.116287

Chollet F (2017) Xception: deep learning with depthwise separable convolutions, pp 1800–1807. https://doi.org/10.1109/CVPR.2017.195

Christlein V, Riess C, Angelopoulou E (2010) On rotation invariance in copy-move forgery detection. In: 2010 IEEE international workshop on information forensics and security, pp 1–6. https://doi.org/10.1109/WIFS.2010.5711472

Christlein V, Riess C, Jordan J, Riess C, Angelopoulou E (2012) An evaluation of popular copy-move forgery detection approaches. IEEE Trans Inf Forensics Secur 7(6):1841–1854. https://doi.org/10.1109/TIFS.2012.2218597

Cozzolino D, Verdoliva L (2020) Noiseprint: a cnn-based camera model fingerprint. IEEE Trans Inf Forensics Secur 15:144–159. https://doi.org/10.1109/TIFS.2019.2916364

de Carvalho TJ, Riess C, Angelopoulou E, Pedrini H, de Rezende Rocha A (2013) Exposing digital image forgeries by illumination color classification. IEEE Trans Inf Forensics Secur 8(7):1182–1194. https://doi.org/10.1109/TIFS.2013.2265677

Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR. https://doi.org/10.1109/WIFS.2010.5711472

Dittmann J (2001) Content-fragile watermarking for image authentication. In: Security and watermarking of multimedia contents III, vol 4314, pp 175–184. International Society for Optics and Photonics. https://doi.org/10.1117/12.435398

Doegar A, Dutta M, Gaurav K (2019) Cnn based image forgery detection using pre-trained alexnet model. Electronic

Dolhansky B, Howes R, Pflaum, Baram N, Ferrer C (2019) The deepfake detection challenge dfdc preview dataset

Dong J, Wang W, Tan T (2013) Casia image tampering detection evaluation database. In: 2013 IEEE China summit and international conference on signal and information processing, pp 422–426. https://doi.org/10.1109/ChinaSIP.2013.6625374

Elaskily M, Elnemr H, Sedik A, Dessouky M, El Banby G, Elaskily O, Khalaf AAM, Aslan H, Faragallah O, El-Samie FA (2020) A novel deep learning framework for copy-move forgery detection in images. Multimed Tools Appl 79. https://doi.org/10.1007/s11042-020-08751-7

Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Xiao C, Prakash A, Kohno T, Song DX (2018) Robust physical-world attacks on deep learning visual classification. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 1625–1634

Faceswap. https://github.com/deepfakes/faceswap . Accessed 16 Mar 2022

Farid H (1999) Detecting digital forgeries using bispectral analysis. AI Lab, Massachusetts Institute of Technology, Tech Rep AIM-1657

Farid H (2009) Image forgery detection: a survey. Signal Proc Mag IEEE 26(04):16–25. https://doi.org/10.1109/MSP.2008.931079

Fischler M, Bolles R (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395. https://doi.org/10.1145/358669.358692

Article   MathSciNet   Google Scholar  

Fridrich J, Soukal D, Lukás J (2003) Detection of copy move forgery in digital images. Proc. Digital Forensic Research Workshop

Fridrich J, Chen M, Goljan M (2007) Imaging sensor noise as digital x-ray for revealing forgeries. In: Proceedings of the 9th international workshop on information hiding, Sant Malo, France, pp 342–358. https://doi.org/10.1007/978-3-540-77370-2_23

Gimp. https://www.gimp.org/ . Accessed 16 Mar 2022

Goldman E (2018) The complicated story of FOSTA and Section 230. First Amend L Rev 17:279

Google Scholar  

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. Adv Neural Inf Process Syst 3

Goodfellow I, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples, vol 12. arXiv: 1412.6572

Hinton GE, Krizhevsky A, Wang SD (2011) Transforming auto-encoders. In: Honkela T, Duch W, Girolami M, Kaski S (eds) Artificial Neural Networks and Machine Learning – ICANN 2011. Springer, Berlin, pp 44–51

Huynh TK, Huynh KV, Le-Tien T, Nguyen SC (2015) A survey on image forgery detection techniques. In: The 2015 IEEE RIVF international conference on computing & communication technologies-research, innovation, and vision for future (RIVF). IEEE, pp 71–76. https://doi.org/10.1109/RIVF.2015.7049877 https://doi.org/10.1109/RIVF.2015.7049877

Interactive Web demo: Whichfaceisreal. https://www.whichfaceisreal.com/index.php . Accessed 16 Mar 2022

Johnson MK, Farid H (2005) Exposing digital forgeries by detecting inconsistencies in lighting. In: Proceedings of the ACM multimedia and security workshop, New York, NY, pp 1–10. https://doi.org/10.1145/0731701073171

Johnson MK, Farid H (2006) Exposing digital forgeries through chromatic aberration. In: Proceedings of the ACM multimedia and security workshop, Geneva, pp 48–55. https://doi.org/10.1145/1161366.1161376

Johnson MK, Farid H (2006) Metric measurements on a plane from a single image. Tech Rep TR2006- 579

Johnson MK, Farid H (2007) Detecting photographic composites of people. In: Proceedings of the 6th international workshop on digital watermarking, Guangzhou. https://doi.org/10.1007/978-3-540-92238-4_3

Johnson MK, Farid H (2007) Exposing digital forgeries through specular highlights on the eye. In: Proceedings of the 9th international workshop on information hiding, Saint Malo, France, pp 311–325. https://doi.org/10.1007/978-3-540-77370-2_21

Karras T, Laine S, Aila T (2019) A style-based generator architecture for generative adversarial networks, pp 4396–4405. https://doi.org/10.1109/CVPR.2019.00453

Keek. https://keeex.me/products/ . Accessed 16 Mar 2022

Koptyra K, Ogiela MR (2021) Imagechain—application of blockchain technology for images. Sensors 21(1):82. https://doi.org/10.3390/s21010082

Korus P (2017) Digital image integrity–a survey of protection and verification techniques. Digit Signal Process 71:1–26. https://doi.org/10.1016/j.dsp.2017.08.009

Korus P, Huang J (2016) Evaluation of random field models in multi-modal unsupervised tampering localization. In: 2016 IEEE international workshop on information forensics and security (WIFS), pp 1–6. https://doi.org/10.1109/WIFS.2016.7823898

Korus P, Huang J (2017) Multi-scale analysis strategies in prnu-based tampering localization. IEEE Trans Inf Forensic Secur

Kowalski M (2016) https://github.com/MarekKowalski/FaceSwap/ . Accessed 16 Mar 2022

Krizhevsky A, Nair V, Hinton G (2009) Cifar-10 (Canadian Institute for Advanced Research)

Krizhevsky A, Sutskever I, Geoffrey H (2012) Imagenet classification with deep convolutional neural networks. Neural Inf Process Syst 25. https://doi.org/10.1145/3065386

Kurakin A, Goodfellow I, Bengio S (2016) Adversarial examples in the physical world

LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/ . Accessed 16 Mar 2022 [cited 2016-01-14 14:24:11]

LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521 (7553):436–444. https://doi.org/10.1038/nature14539

Li Y, Lyu S (2018) Exposing deepfake videos by detecting face warping artifacts

Li Y, Chang MC, Lyu S (2018) In ictu oculi: exposing ai created fake videos by detecting eye blinking, pp 1–7. https://doi.org/10.1109/WIFS.2018.8630787

Li Y, Yang X, Qi H, Lyu S (2016) Celeb-df: a large-scale challenging dataset for deepfake forensics, pp 3204–3213. https://doi.org/10.1109/CVPR42600.2020.00327

Liao X, Li K, Zhu X, Liu KJR (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE J Sel Top Signal Process 14(5):955–968. https://doi.org/10.1109/JSTSP.2020.3002391

Liao X, Huang Z, Peng L, Qiao T (2021) First step towards parameters estimation of image operator chain. Inf Sci 575. https://doi.org/10.1016/j.ins.2021.06.045

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Ginneken BV, Sánchez CI (2017) A survey on deep learning in medical image analysis. Med Image Anal 42:60–88. https://doi.org/10.1016/j.media.2017.07.005

Liu G, Reda F, Shih K, Wang TC, Tao A, Catanzaro B (2018) Image inpainting for irregular holes using partial convolutions

López-García X, Silva-Rodríguez A, Vizoso-García AA, Oscar W, Westlund J (2019) Mobile journalism: systematic literature review. Comunicar Media Educ Res J 27(1). https://doi.org/10.3916/C59-2019-01

Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91–. https://doi.org/10.1023/B:VISI.0000029664.99615.94

Lu C-S, Liao H-YM (2001) Multipurpose watermarking for image authentication and protection. IEEE Trans Image Process 10(10):1579–1592. https://doi.org/10.1109/83.951542

Article   MATH   Google Scholar  

Lukás J, Fridrich J (2003) Estimation of primary quantization matrix in double compressed jpeg images. Proc Digital Forensic Research Workshop. https://doi.org/10.1117/12.759155

Majumder MTH, Alim Al Islam ABM (2018) A tale of a deep learning approach to image forgery detection. In: 2018 5th international conference on networking, systems and security (NSysS), pp 1–9. https://doi.org/10.1109/NSysS.2018.8631389

Marra F, Gragnaniello D, Verdoliva L, Poggi G (2020) A full-image full-resolution end-to-end-trainable cnn framework for image forgery detection. IEEE Access:1–1.

Moreira D, Bharati A, Brogan J, Pinto A, Parowski M, Bowyer KW, Flynn PJ, Rocha A, Scheirer WJ (2018) Image provenance analysis at scale. IEEE Trans Image Process 27(12):6109–6123

Muzaffer G, Ulutas G (2019) A new deep learning-based method to detection of copy-move forgery in digital images. In: 2019 Scientific meeting on electrical-electronics biomedical engineering and computer science (EBBT), pp 1–4. https://doi.org/10.1109/EBBT.2019.8741657

Nguyen H, Yamagishi J, Echizen I (2019) Use of a capsule network to detect fake images and videos

Nightingale SJ, Wade KA, Watson DG (2017) Can people identify original and manipulated photos of real-world scenes?. Cognitive Research: Principles and Implications 2(1):1–21. https://doi.org/10.1186/s41235-017-0067-2

Nikolaidis N, Pitas I (1996) Copyright protection of images using robust digital signatures. In: 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, vol 4. IEEE, pp 2168–2171. https://doi.org/10.1109/ICASSP.1996.545849

Nilsback M, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 Sixth Indian conference on computer vision, simage processing, pp 722–729. https://doi.org/10.1109/ICVGIP.2008.47

Numbersprotocol.io. https://numbersprotocol.io/ . Accessed 16 Mar 2022

Online article on Arstechnica, May 2007 https://arstechnica.com/uncategorized/2007/05/latest-aacs-revision-defeated-a-week-before-release/ . Accessed 16 Mar 2022

Ouyang J, Liu Y, Liao M (2017) Copy-move forgery detection based on deep learning. In: 2017 10th international congress on image and signal processing, BioMedical engineering and informatics (CISP-BMEI), pp 1–5. https://doi.org/10.1109/CISP-BMEI.2017.8301940

Passarella A (2012) A survey on content-centric technologies for the current internet CDN and P2P solutions. Comput Commun 35(1):1–32. https://doi.org/10.1016/j.comcom.2011.10.005

Philbin J, randjelović R, Zisserman A (2007) The Oxford Buildings Dataset. https://www.robots.ox.ac.uk/vgg/data/oxbuildings/ . Accessed 16 Mar 2022

Piva A (2013) An overview on image forensics. International Scholarly Research Notices 2013. https://doi.org/10.1155/2013/496701

Popescu AC, Farid H (2004) Exposing digital forgeries by detecting duplicated image regions. Tech. Rep. TR2004-515

Popescu AC, Farid H (2005) Exposing digital forgeries by detecting traces of re-sampling. IEEE Trans Signal Process 53(2):758–767. https://doi.org/10.1109/TSP.2004.839932

Qureshi MA, Deriche M (2015) A bibliography of pixel-based blind image forgery detection techniques. Signal Process Image Commun 39:46–74. https://doi.org/10.1016/j.image.2015.08.008

Rajini NH (2019) Image forgery identification using convolution neural network. Int J Recent Technol Eng 8

Rao Y, Ni J (2016) A deep learning approach to detection of splicing and copy-move forgeries in images. In: 2016 IEEE international workshop on information forensics and security (WIFS), pp 1–6. https://doi.org/10.1109/WIFS.2016.7823911

Roy S, Sun Q (2007) Robust hash for detecting and localizing image tampering. In: 2007 IEEE international conference on image processing, vol 6. IEEE, pp VI–117. https://doi.org/10.1109/ICIP.2007.4379535

Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner M (2019) Faceforensics++: learning to detect manipulated facial images

Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf. In: 2011 International conference on computer vision, pp 2564–2571. https://doi.org/10.1109/ICCV.2011.6126544

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A, Fei-Fei L (2014) Imagenet large scale visual recognition challenge. Int J Comput Vision 115. https://doi.org/10.1007/s11263-015-0816-y

Schaefer G, Stich M (2003) UCID: an uncompressed color image database. In: Yeung MM , Lienhart RW, Li CS (eds) Storage and retrieval methods and applications for multimedia 2004, vol 5307. International Society for Optics and Photonics, SPIE, pp 472–480. https://doi.org/10.1117/12.525375

Schetinger M, Chang S (1996) A robust content based digital signature for image authentication. In: Proceedings of 3rd IEEE international conference on image processing, vol 3. IEEE, pp 227–230. https://doi.org/10.1109/ICIP.1996.560425

Schetinger V, Oliveira MM, da Silva R, Carvalho TJ (2017) Humans are easily fooled by digital images. Comput Graph 68:142–151. https://doi.org/10.1016/j.cag.2017.08.010

Shen C, Kasra M, Pan P, Bassett GA, Malloch Y, F O’Brien J (2019) Fake images: the effects of source, intermediary, and digital media literacy on contextual assessment of image credibility online. New Media & Society 21(2):438–463. https://doi.org/10.1177/1461444818799526

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556

Spohr D (2017) Fake news and ideological polarization: Filter bubbles and selective exposure on social media. Bus Inf Rev 34(3):150–160. https://doi.org/10.1177/0266382117722446

Thakur R, Rohilla R (2019) Copy-move forgery detection using residuals and convolutional neural network framework: a novel approach. In: 2019 2nd international conference on power energy, environment and intelligent control PEEIC, pp 561–564. https://doi.org/10.1109/PEEIC47157.2019.8976868

Thies T, Zollhöfer M, Stamminger M, Christian T, Nießner M (2018) Face2face: real-time face capture and reenactment of rgb videos. Commun ACM 62:96–104. https://doi.org/10.1145/3292039

Thies J, Zollhöfer M, Nießner M (2019) Deferred neural rendering: image synthesis using neural textures. ACM Trans Graph 38:1–12. https://doi.org/10.1145/3306346.3323035

Tralic D, Zupancic I, Grgic S, Grgic M (2013) Comofod — new database for copy-move forgery detection. In: Proceedings ELMAR-2013, pp 49–54

Various. Columbia image splicing detection evaluation dataset - list of photographers, 2004. https://www.ee.columbia.edu/ln/dvmm/downloads/AuthSplicedDataSet/photographers.htm . Accessed 16 Mar 2022

Verdoliva L (2020) Media forensics and deepfakes: an overview. IEEE J Sel Top Signal Process:1–1. https://doi.org/10.1109/JSTSP.2020.3002101

Warif NBA, Wahab AWA, dris MYI, Ramli R, Salleh R, Shamshirband S, Choo K-KR (2016) Copy-move forgery detection: Survey, challenges and future directions. J Netw Comput Appl 75:259–278. https://doi.org/10.1016/j.jnca.2016.09.008

Wojna Z, Ferrari V, Guadarrama S, Silberman N, Chen LC, Fathi A, Uijlings J (2017) The devil is in the decoder. In: British machine vision conference (BMVC), pp 1–13

Wu Y, Abd-Almageed W, Natarajan P (2018) Busternet: detecting copy-move image forgery with source/target localization. In: Proceedings of the European conference on computer vision (ECCV), pp 168–184. https://doi.org/10.1007/978-3-030-01231-1_11

Wu Y, AbdAlmageed W, Natarajan P (2019) Mantra-net: manipulation tracing network for detection and localization of image forgeries with anomalous features. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9535–9544. https://doi.org/10.1109/CVPR.2019.00977

Zhang Y, Goh J, Win LL, Vrizlynn T (2016) Image region forgery detection: a deep learning approach. In: SG-CRC, pp 1–11. https://doi.org/10.3233/978-1-61499-617-0-1

Zhang K, Zuo W, Cheng Y, Meng D, Zhang L (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans Image Process 26(7):3142–3155. https://doi.org/10.1109/TIP.2017.2662206

Article   MathSciNet   MATH   Google Scholar  

Zhang Q, Yang LT, Chen Z, Li P (2018) A survey on deep learning for big data. Information Fusion 42:146–157. https://doi.org/10.1016/j.inffus.2017.10.006

Download references

Open access funding provided by Università degli Studi di Brescia within the CRUI-CARE Agreement. No funding was received to assist with the preparation of this manuscript.

Author information

Authors and affiliations.

Department of Information Engineering, CNIT – University of Brescia, Via Branze 38, 25134, Brescia, Italy

Marcello Zanardelli, Fabrizio Guerrini, Riccardo Leonardi & Nicola Adami

You can also search for this author in PubMed   Google Scholar


The authors contributed equally to this work.

Corresponding author

Correspondence to Marcello Zanardelli .

Ethics declarations

Conflict of interests.

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material.

No additional data or material has been used for this work other than the referenced papers.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Zanardelli, M., Guerrini, F., Leonardi, R. et al. Image forgery detection: a survey of recent deep-learning approaches. Multimed Tools Appl 82 , 17521–17566 (2023). https://doi.org/10.1007/s11042-022-13797-w

Download citation

Received : 11 August 2021

Revised : 23 March 2022

Accepted : 05 September 2022

Published : 03 October 2022

Issue Date : May 2023

DOI : https://doi.org/10.1007/s11042-022-13797-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Image forgery detection
  • Image forensics
  • Deep learning
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

A Novel Image Processing Approach to Enhancement and Compression of X-ray Images

Yaghoub pourasad.

1 Department of Electrical Engineering, Urmia University of Technology, Urmia 17165-57166, Iran

Fausto Cavallaro

2 Department of Economics, University of Molise, Via De Sanctis, 86100 Campobasso, Italy; ti.lominu@orallavac

Associated Data

The dataset is available online: http://www.med.harvard.edu/AANLIB/ (accessed on 18 April 2021).

At present, there is an increase in the capacity of data generated and stored in the medical area. Thus, for the efficient handling of these extensive data, the compression methods need to be re-explored by considering the algorithm’s complexity. To reduce the redundancy of the contents of the image, thus increasing the ability to store or transfer information in optimal form, an image processing approach needs to be considered. So, in this study, two compression techniques, namely lossless compression and lossy compression, were applied for image compression, which preserves the image quality. Moreover, some enhancing techniques to increase the quality of a compressed image were employed. These methods were investigated, and several comparison results are demonstrated. Finally, the performance metrics were extracted and analyzed based on state-of-the-art methods. PSNR, MSE, and SSIM are three performance metrics that were used for the sample medical images. Detailed analysis of the measurement metrics demonstrates better efficiency than the other image processing techniques. This study helps to better understand these strategies and assists researchers in selecting a more appropriate technique for a given use case.

1. Introduction

Image processing is a substantial part of medical research/clinical practice [ 1 , 2 ]. In the last few years, medical image analysis has been highly developed by enhancing digital imaging methods. A massive number of medical images have been generated with ever-increasing diversity and quality. Although traditional medical image analysis techniques have obtained limited success, they cannot deal with colossal image data quantities [ 3 , 4 , 5 ]. The idea behind digital image processing is the processing of digital images using digital computers. Indeed, digital images are a particular composition of a limited number of elements. Each element has its location and value and is known as pixels, images, or picture elements. Generally, the term “pixel” is frequently used to point to the elements of a digital image. In medical research, medical images, e.g., CT scans, MRI, and X-ray images, are the most used images these days. Therefore, the analysis of these diverse image types requires sophisticated computerized tools. Image compression is a type of method that efficiently stores and transmits images while retaining the highest possible quality. Many software and techniques address this compression problem by establishing an appropriate balance between reconstructed image quality and compression ratio [ 6 , 7 ].

X-rays were discovered in 1895 by German physicist Wilhelm Roentgen who pioneered medical imaging. Medical images help physicians see through the human body, detect injuries or diseases, and direct therapeutic procedures. Based on image quality, they can determine how visible the different disease signs and anatomical structures are. The Food and Drug Administration (FDA) is an agency within the US Department of Health and Human Services, consisting of nine centers and offices. FDA has described “medical imaging” as a technology. Medical imaging includes various technologies employed to see through the human body. The primary purpose of this technology is to diagnose, monitor, or treat patients’ medical conditions. Different parts of this technology inform us about the treated or studied body area associated with a potential injury or disease or medical treatment effectiveness.

Nowadays, medical imaging technology is an essential part of medicine [ 8 , 9 , 10 ]. Surgeons, pathologists, and other medical groups can observe symptoms of the diseases directly. In recent years, medical imaging techniques have made much progress. Accordingly, planners and even surgeons can take advantage of this technology. Polap [ 11 ] displayed a versatile method composed of a hereditary calculation, and a cascade of the convolutional classifiers for picture examination was proposed. A genetic algorithm (GA) was proposed to indicate the probability of having a place for a suitable lesson. The indicated likelihood is imperative due to measuring the outcomes obtained from the cascade of neural classifiers.

This paper mainly aims to find an efficient method for compression and enhancement of the medical images. It starts with image compression and finishes with the enhancement of medical images for higher output. An in-depth study was conducted on the previous research, and various compression techniques were perused to obtain better output. Moreover, the performance metrics were extracted and analyzed based on state-of-the-art methods. Two compression techniques, namely lossless compression and lossy compression, were applied to image compression. Then, the chosen images were restored using enhancement techniques. Finally, this technique’s efficiency was analyzed using various performance parameters to assess the output. This technique was analyzed by using various performance parameters to assess the output.

The present paper is outlined as follows: Section 1 describes medical imaging and key medical imaging characteristics and quality factors. Section 2 reviews many relevant papers in medical image processing and studies some image processing methods used for improving medical images that researchers have proposed in their papers. Section 3 is the core of the present research paper. This section explains some of the significant engineering subjects related to image processing, general, and medical imaging, particularly in Section 4 . The evaluation metrics are discussed in Section 5 . Finally, Section 6 summarizes the numerical results and future works.

2. Literature Review

In this section, various image compression and image enhancement techniques are investigated. Image compression and image enhancement play a key role in medical image processing. There has been considerable research focusing on compression and image enhancement for the improvement of medical images. The Haar wavelet-based approach for image compression and quality assessment of compressed image is an image processing technique [ 12 ]. In [ 13 ], the concept of ‘Wavelet-based compression of images’ was used in grayscale images with various techniques, such as SPIHT, EZW, and SOFM. In [ 14 ], the authors deal with a specific type of compression by utilizing wavelet transforms. Wavelets were employed as simple patterns/coefficients, reproducing the initial pattern when multiplied and combined. The author of [ 15 ] introduces a new lossy compression technique that employs singular value decomposition (SVD) and wavelet difference reduction (WDR). These two methods are combined by increasing the SVD compression performance with the WDR compression.

In [ 16 , 17 , 18 , 19 , 20 , 21 ], the researchers investigated different steps in image processing techniques. The authors provided an overview of all relevant image processing methods, including preprocessing, segmentation, feature extraction, and classification techniques. In [ 22 ], the researchers focused their research on several medical image compression methods, such as Cosine transformations discrete, Hierarchical partitioning of the subbed block, JPEG 2000 image compression, JPEG2000 MAXSHIFT ROI coding, JPEG2000 scaling ROI coding, Mesh coding scheme, ROI-based scaling, and Wavelet adaptive shape transform. In [ 16 , 17 , 18 , 19 ], the researchers discussed various medical image compression techniques. A unique feature can be observed in the studied methods, but the medical images are compressed with certain drawbacks. Therefore, the research will overcome these shortcomings and increase the reconstructed quality of the compressed picture with a high compression rate for a medical image. The author introduced a new approach to image modification for visually acceptable images in [ 23 ].

The choice of image enhancement techniques depends on the particular task, picture content, viewing conditions, and observing features. Researchers have provided an overview of spatial domain techniques for image enhancement processing. More specifically, processing methods are categorized based on representative image improvement techniques. In [ 24 ], the detection of masses and segmentation techniques for image processing was studied by the authors. This study sought to use MATLAB tools in the area of medical image processing. Much medical imaging can be used in visualization tools, and many are challenging to work on. The generation by using the MATLAB package to manage and visualize matrix data will thus help create simple computer graphics, e.g., bar charts, histograms, and scatter plots.

Nowadays, image processing tool packages are available for researchers and image processing enthusiasts. The result of the proposing method has helped users to efficiently analyze and process the image in a newer software package. Wavelet-based volumetric medical image compression is provided in [ 25 ]. In this article, researchers studied how volumetric medical images can optimally be compressed using JP3D. An enhanced technique of medical compression with a lossless area is provided in [ 26 ]. Lossless techniques of compression, without any data being lost, but with a low compressive rate, and loss compression techniques with a high compression ratio but with a minor data loss can be compressed. In [ 27 ], the Medical Image Watermarking Technique for Lossless Compression is launched, which reduces the lossless watermark compression without loss of data. The watermark in this work combines the defined region of interest (ROI) and the secret key of image watermarking. An approach based on digital image compression, digital watermarking and lossless compression was presented in [ 28 ]. The authors proposed new ways of combining techniques, such as digital watermarking, image reduction/expansion, and lossless compression standards (JPEG-LS (JLS) or TIFF), amongst others. These compression techniques have been named wREPro. TIFF (Watermarked Reduction/Expansion Protocol in conjunction with TIFF) and wREPro. JLS (wREPro combined with JPEG-LS format).

Designing convolutional neural networks’ architecture is a classic NP-hard optimization challenge, and some frameworks for creating network architectures for particular image classification tasks have been suggested [ 29 , 30 ]. Bacanin et al. [ 31 ] developed the hybridized monarch butterfly optimization algorithm to solve this issue. Moreover, Rosa et al. [ 22 ] used metaheuristic-driven strategies to solve the overfitting issue in the sense of CNN’s by choosing a regularization parameter known as a dropout. The findings show that optimizing dropout-based CNNs is worthwhile, owing to the ease with which appropriate dropout likelihood values can be found without setting new parameters empirically. Another method in image processing is the optimized quantum matched-filter technique [ 32 ], robust principal component analysis [ 33 ], and the generalized autoregressive conditional heteroscedasticity model [ 34 ]. Moreover, expression programming [ 35 , 36 , 37 ]. the optimization problem [ 38 ], the fuzzy best-worst method [ 39 ], and the GP-DEA model [ 40 ] are other methods. Several central math problems in medical imaging are explained in [ 41 ] by the authors.

The problem was rapidly modified by improved software and hardware. Much software is built on new techniques that utilize geometric partial differential equations combined with standard image/signal processing techniques. In this enterprise, scholars have attempted to base the principles of biomedical engineering on the development of software methods for complete rigorous mathematical foundations systems on therapy delivery. They show how mathematical research can influence some key medical subjects, such as enhancement, registration, and segmentation of the images. This research has developed an extensible image processing method that includes image compression and image enhancement to facilitate imaging research in the medical areas. Currently, it is essential to know that there is no agreement among researchers regarding image processing steps. Hence, in this paper, different compression and enhancement techniques from many researchers are studied and analyzed based on different performance metrics. At first, different MRI and CT scan images were selected, and compression methods applied to the images.

3. Methods and Materials

DICOM is used in nearly every radiology, cardiology, and radiotherapy imaging and radiotherapy application (X-ray, CT, MRI, ultrasound) and in equipment in other medical fields, including ophthalmology and dentistry. DICOM is one of the most commonly used healthcare communications standards globally, with hundreds of thousands of medical imaging systems in use. DICOM has revolutionized radiology practice since its inception in 1993, allowing for the complete substitution of X-ray film with a completely automated workflow. DICOM has allowed innovative medical imaging technologies that have changed the face of clinical medicine in the same way that the Internet has enabled modern customer knowledge applications. DICOM is the model that allows medical imaging work—for doctors and patients—from the emergency room to heart stress monitoring and breast cancer diagnosis.

In different image applications, wherever an image is reconstructed from its degraded version, the image processing algorithms’ efficiency needs to be measured quantitatively. For the evaluation objective, we should have the original image. In this research, medical images used to examine and evaluate methods were selected from The National Library of Medicine presents Med-Pix. Med-Pix is an online open-access database of medical images, case teaching, and clinical subjects, integrating textual metadata and images, including more than 12,000 patient cases, 9000 themes, and almost 59,000 images. The collected images are free of copyright problems and are open for use by the public. Reading some of the relevant journals and research papers, some performance parameters for evaluating image processing algorithms where the reconstructed image and the original image from its degraded version are available for evaluation objectives are listed as follows:

  • − Mean squared error (MSE).
  • − Root-mean-square error (RMSE).
  • − Peak signal-to-noise ratio (PSNR).
  • − Mean absolute error (MAE).
  • − Cross-correlation parameter (CP).
  • − Structure similarity index map (SSIM).
  • − Histogram analysis.

In this research, MSE, PSNR, and SSIM as performance parameters were used for evaluating image processing algorithms. Additionally, the experiments were performed using MATLAB software (MathWorks, Natick, MA, USA).

3.1. Image Compression Techniques

Two categories of image compression used for medical image processing research are lossless and lossy. Lossy compression requires an accurate reconstructed image of the original image from the replica. Such compression is utilized for medical image constructions, where data loss can be misdiagnosed. Unlike error-free coding, lossy image compression in exchange for a higher accuracy reduces the coded image of the compression ratio. There is a quantizer for the encoder that restricts the number of bits needed for the image. The quantizer aims to eliminate psych visual redundancy. Vector quantization, predictive coding, and transform coding are three standard methods for lossy image compression. Hybrid coding is a combined system using the characteristics of different image compression coding schemes to improve efficiency. The two ‘lossy’ techniques were used to perform a discrete cosine transform (DCT) and discrete wavelet transform (DWT). Additionally, run length encoding (RLE) and block truncation coding (BTC) of lossless techniques were applied to medical images for evaluating experiments [ 42 ].

3.2. Discrete Cosine Transform Technique

The discrete cosine transform (DCT) technique comprises a fixed series of data points as a total of the fluctuation of cosine functions at various frequencies [ 43 ]. In contrast to every other medical imaging technique (grayscale), DCT results show better MSE and compression ratio results. Additionally, several studies on grayscale medical images have confirmed this claim. The DCT, in comparison with other techniques, is faster than other methods for an image with smooth edges. DCTs are essential and crucial for various applications in the field of medical science/engineering. In lossy compression of audio files, such as MP3, and images, such as JPEG, the small high-frequency elements may be discarded, and DCT is suitable [ 43 , 44 , 45 , 46 ].

A grayscale medical image was taken from MedPix and then compressed using the DCT technique in the present research. After, inverse DCT was employed to reconstruct the medial image. This procedure was performed twice for the following reasons:

  • − In the first step, this work was done to reduce the image’s spatial resolution.
  • − In the second step, the medical image was split into blocks and re-compressed.

The first step was to do this by employing MATLAB programming, and in the next step, the image was split into blocks, and DCT was applied to each block twice.

3.3. Discrete Wavelet Transform Technique

Discrete wavelet transform (DWT) is a technology that enables image pixels to be transformed into wavelets and used for compression and coding on a wavelet. This technique is beneficial for compressing signals and better results for medical grayscale images [ 47 , 48 ]. By using the set of analysis functions, DWT enables the multi-resolution representation. In fields, such as medical imaging, the image’s degradation is not tolerated and causes a decrease in the final accuracy result. One of the best ways to extract the key information to improve the quality of signals is an approach based on using wavelets. DWT is used continuously to solve more advanced problems, providing information on the frequency and locale of the analyzed signal. Image transform from the mat to gray was done in this method and divided into 4 bits. Then, DWT compression was applied to the medical image. Finally, the image was resized to the original size again.

3.4. Run Length Encoding Technique

It can perhaps be said that run-length encoding (RLE) is the most straightforward common compression technique. It is a ‘lossless’ algorithm and can function by searching for ‘runs’ of the same value bits, bytes, or pixels and encrypting the run’s length and value. Therefore, RLE produces the best results with pictures with large contiguous color areas, especially monochrome pictures [ 49 , 50 , 51 , 52 ]. The run length encoding technique is one of the most widely used encoding methods in lossless compression techniques. It supports most bitmap file formats, such as BMP, PCX, and TIFF, and is an elementary form of lossless compression algorithms. This technique is suitable to compress any data irrespective of its content. However, the data content affects the RLE compression ratio. Without losing important information, the RLE technique can compress medical images.

Meanwhile, medical images can be compressed into a single data sequence with a long continuous sequence. Black and white images can mainly compress with run length encoding, and better results can be obtained from image compression. In the present study, lossless compression of the medical image using RLE was obtained.

3.5. Block Truncation Coding Technique

The technique block truncation coding is a type of grayscale lossy compression technique. In this technique, the original images are split into blocks. A quantizer is then employed to reduce the gray levels in each block with the same mean and standard differences. Much of the techniques used in RLE and BTC are used together to achieve compression outputs. BTC can also be used for video compression.

In this paper, for differentiation of the medical image into blocks in some segments, the BTC technique was used. This was achieved with the help of column altering because it can adjust total column values. It is very convenient to use the BTC technique because it can be implemented quickly, relative to other techniques in several channel errors having suitable performance.

3.6. Image Enhancement Techniques

Image enhancement is used to facilitate visual interpretation and imaging. Digital imagery offers the advantage that it enables us to manipulate pixel values into an image. The image enhancement technique primarily aims to modify the attributes of an image to render it more appropriate for a particular task and observation. One or more attributes of the image are changed during this process. With image enhancement methods, the interpretability or data collection in images can be enhanced for people. This method can also provide better input for other techniques of automated image processing. Nowadays, many images, such as geographic images, medical images, and aerial images, suffer from noise and poor contrast [ 51 ]. Increasing the image view’s quality, increasing contrast, blurring, and noise are the advantages of enhancement techniques. Additionally, these methods can enhance image sharpness and borders.

Two categories of enhancement techniques include:

  • Spatial domain techniques.
  • Frequency domain techniques.

3.7. Spatial Domain Methods

The primary purpose of enhancement is to process an image to yield better results for a particular process. Image enhancement is divided into two categories: spatial domain enhancement and frequency domain enhancement. The term spatial domain implies the image plane itself, which directly manipulates pixels. For the manipulation of image pixels, in this paper, the spatial domain technique was used. This method not only achieves image adjustment but can also enhance the quality and the contrast of the compressed medical images. This study used adaptive histogram equalization and morphological operations to improve the compressed medical images’ quality.

3.8. Adaptive Histogram Equalization (AHE)

Global histogram equalization does not work effectively for images containing low contrast regions of bright or dark areas. Adaptive histogram equalization (AHE) is the change to the histogram equalization, which can be applied for better results on these images [ 53 ]. AHE only takes small regions into account and increases the contrast of these regions by considering their local CDF. Various methods can be used to implement AHE, and there are several variations in each of those. In this project, we implemented AHE using an interpolated mapping method with tiled windows with interpolated mapping [ 52 , 54 , 55 , 56 ].

The medical image was enhanced with AHE by the use of MATLAB commands and functions. AHE is a method for ‘contrast enhancement’ that is widely applicable and efficient.

3.9. Morphological Operations (MO)

Morphological operations are easy to use and operate following the set theory. ‘Morphological operations aim to remove the ‘imperfections’ in the image structure. Most of the operations used here consist of combining two dilation and erosion processes. A small matrix structure called the structuring element is used for the operation. The shape and size of the structuring element significantly affect the final result. In image processing, morphological operations aim to remove these imperfections by considering the image form and structure [ 57 , 58 , 59 ].

3.10. Evaluation Metrics

An essential image processing step is medical image compression. Comparing images to evaluate the quality of compression is an essential part of measuring improvement. Metric selection is one of the challenges in evaluating medical compression [ 37 , 53 ]. Using the right evaluation metrics for measuring the compression and enhancement techniques is critical. Otherwise, you may be trapped in thinking that your model works well, but it does not work. We used three evaluation criteria as follows:

  • Structural similarity index modulation.

Structural Similarity Index Modulation (SSIM): The luminance, contrast, and structural are three basic computation terms used to determine the structural similarity index ( SSIM ). SSIM is a multiplicative combination of the three above terms:

In the equations above, μ x ,   μ y ,   σ x ,   σ y , and σ x y represent the local mean, SD, and cross-covariance for images x, y, respectively. If α = β = γ = 1 (as the default values for the exponents), and C 3 = C 2 / 2 (as the default value for C 3 ), the index can be simplified as follows:

This method is used to evaluate the similarity between the two images. It has also been developed to improve techniques like MSE and PSNR.

Mean Squared Error (MSE): A model evaluation metric mostly applied with regression models is the mean squared error. To evaluate the compression techniques and enhancement techniques, the MSE method can be used:

In the equation above, I ( x ; y ) and I 0( x ; y ) denote the original and recovered pixels’ values at row x and column y for the M × N image, respectively.

Peak Signal-to-Noise Ratio (PSNR): The reconstruction ( PSNR ) is one of the appropriate quality assessment criteria for medical image compression for medical image enhancement and peak signal-to-noise ratio ( PSNR ). PSNR indicates a ratio of the maximum possible value (power) of an indicator with the performance of a distorting noise, which generally impacts its representation quality:

A medical image was chosen from the MedPix ® database to show the results. After compression and enhancement were are applied to the sample medical images, specific outputs from each image were obtained and analyzed. MedPix ® is an open-access online dataset of restorative pictures, educating cases, clinical subjects, coordination pictures, and printed metadata counting over 12,000 understanding case scenarios, 9000 subjects, and about 59,000 pictures. It essentially targets a group of onlookers and incorporates doctors and medical attendants, associated wellbeing professionals, medical understudies, nursing understudies, and others inquisitive about therapeutic knowledge. The substance fabric is organized by malady area (organ framework), pathology category, quiet profiles, picture classification, and picture captions. The collection is searchable by understanding side effects and signs, determination, organ framework, picture methodology, picture depiction, catchphrases, contributing creators, and numerous other look alternatives. The values take from this medical image are also categorized in Table 1 .

Performance metric for the sample medical image.

Enhanced and Compressed Output

Figure 1 , Figure 2 , Figure 3 , Figure 4 and Figure 5 display the enhanced and compressed output of the sample medical image.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-06724-g001a.jpg

Lossy techniques: DCT and DWT compression.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-06724-g002.jpg

Lossy techniques: AHE and MO Enhancement for DCT and DWT.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-06724-g003.jpg

Lossless techniques: Lossless compression utilizing RLE and BTC.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-06724-g004.jpg

Lossless techniques: Enhancement of BTC using AHE and MO.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-06724-g005.jpg

Lossless techniques: Enhancement of RLE using AHE and MO.

5. Discussion

Image compression is an application of information compression on digital images; in other words, the purpose of this work is to reduce the redundancy of the contents of the image for the ability to store or transfer information in optimal form. Photo compression can be done without loss and total loss. Lossless compression is sometimes preferred for some images, such as technical drawings and icons, so high-loss compression methods compromise image quality, primarily when used for low bit rates. Lossless compression methods may also be preferred for valuable content, such as medical photographs or scanned photographs for archiving purposes. The proliferation method is especially suitable for natural photographs, such as photographs for small (sometimes minor) applications, where the loss of fidelity is significant to reduce the bit rate. To store images, the amount of information must be reduced as much as possible, and the basis of all compression methods is the exclusion of parts of information and data. It is the compression ratio that determines the amount and percentage of information discarded. This method simplifies data storage and transmission and reduces the required bandwidth and frequency. PSNR, MSE, and SSIM are three performance metrics that were used for the sample medical images. As shown in Table 1 , to improve the images’ quality after compression with lossless and lossy techniques, morphological operations is not a suitable algorithm. By further examining and comparing the values of PSNR with the morphological operations algorithm, we found that MO is not an appropriate algorithm to enhance images after compression. In general, it can be stated that evaluation metrics values include SSIM and PSNR after MO and AHE methods are less than PSNR and SSIM values after compression; therefore, these two methods, namely AHE and MO methods, are not suitable for medical image enhancement.

Regarding the results of the presented method in Figure 6 , the graph illustrates the performance methods for both compression and enhancement. Based on our findings, the DCT method has higher PSNR than other methods and is compatible with compression. Moreover, the enhancement of the AHE method represents higher performance than other methods. Moreover, the SSIM methods indicate that the DWT and block truncation ability to compress X-ray images is weaker than the DCT and RLE techniques. After comparing the presented methods with the state-of-the-art image compression approaches, it can be estimated that the presented techniques have higher accuracy than other methods. Moreover, regarding Table 2 , the lower MSE belongs to the presented DCT and RLE. Moreover, the PSNR criteria are 89.98 and 54.77 for DCT and DWT. These are higher values in comparison with those in the literature methods.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-06724-g006.jpg

The performance of the presented methods ( a ): PSNR criteria, ( b ): SSIM criteria.

The comparison between the presented methods and the state-of-the-art.

6. Conclusions

This paper mainly aimed to obtain an efficient medical image output. For this purpose, a comprehensive literature review was conducted to comprehend these methods’ different features and functions. A piece of explicit knowledge was acquired on the enhancement and compression methods from this literature research. Additionally, how they work on medical grayscale images was investigated.

In the first step, compression was performed by employing both lossless and lossy methods, followed by enhancement. Four techniques, including BTC, DCT, DWT, and RLE, were applied for compression. Lossy compression using DWT enhancement based on MSE, SSIM, and PSNR without the loss of more information showed better results than the DCT technique. Without losing much data, the RLE and BTC techniques compressed well. The RLE technique compared with the BTC technique presented a reasonable compression rate from the analysis. Using the two techniques AHE and MO, each compression technique was further enhanced. Besides, the results of the analysis showed that the combination of compression and enhancement techniques works together well. Compared to PSNR and SSIM, the RLE technique showed higher values and better image quality following enhancement than the BTC technique. The experiments showed that when we combined AHE and RLE techniques, these two techniques presented more satisfactory enhancement results than the other techniques. The AHE technique considerably improved the compressed image in the DWT compression technique. Morphological operations were used instead of sharpening or increasing contrast images to enhance the background. Morphological operations werre utilized to improve the quality of the background rather than to sharpen the image. Such techniques, in particular, were used to improve the particular region of interest.

Medical imaging is rapidly developing due to the development of image processing techniques, including image recognition, testing, and improvement. Image processing increases the percentage and number of problems detected. For future work, machine learning algorithms, including supervised, unsupervised, reinforcement algorithms, meta-heuristic algorithms, approximate algorithms, and deep learning algorithms, are techniques that can be applied to image processing and image optimization with different parameters.


Author contributions.

Conceptualization, Y.P. and F.C.; methodology, Y.P.; software, F.C.; validation, Y.P. and F.C.; formal analysis, F.C.; investigation, Y.P.; resources, F.C.; data curation, F.C.; writing—original draft preparation, Y.P.; writing—review and editing, Y.P.; visualization, Y.P.; supervision, Y.P.; project administration, F.C.; funding acquisition, F.C. All authors have read and agreed to the published version of the manuscript.

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

image preprocessing Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Degraded document image preprocessing using local adaptive sharpening and illumination compensation

Quantitative identification cracks of heritage rock based on digital image technology.

Abstract Digital image processing technologies are used to extract and evaluate the cracks of heritage rock in this paper. Firstly, the image needs to go through a series of image preprocessing operations such as graying, enhancement, filtering and binaryzation to filter out a large part of the noise. Then, in order to achieve the requirements of accurately extracting the crack area, the image is again divided into the crack area and morphological filtering. After evaluation, the obtained fracture area can provide data support for the restoration and protection of heritage rock. In this paper, the cracks of heritage rock are extracted in three different locations.The results show that the three groups of rock fractures have different effects on the rocks, but they all need to be repaired to maintain the appearance of the heritage rock.

Image Preprocessing Method in Radiographic Inspection for Automatic Detection of Ship Welding Defects

Welding defects must be inspected to verify that the welds meet the requirements of ship welded joints, and in welding defect inspection, among nondestructive inspections, radiographic inspection is widely applied during the production process. To perform nondestructive inspection, the completed weldment must be transported to the nondestructive inspection station, which is expensive; consequently, automation of welding defect detection is required. Recently, at several processing sites of companies, continuous attempts are being made to combine deep learning to detect defects more accurately. Preprocessing for welding defects in radiographic inspection images should be prioritized to automatically detect welding defects using deep learning during radiographic nondestructive inspection. In this study, by analyzing the pixel values, we developed an image preprocessing method that can integrate the defect features. After maximizing the contrast between the defect and background in radiographic through CLAHE (contrast-limited adaptive histogram equalization), denoising (noise removal), thresholding (threshold processing), and concatenation were sequentially performed. The improvement in detection performance due to preprocessing was verified by comparing the results of the application of the algorithm on raw images, typical preprocessed images, and preprocessed images. The mAP for the training data and test data was 84.9% and 51.2% for the preprocessed image learning model, whereas 82.0% and 43.5% for the typical preprocessed image learning model and 78.0%, 40.8% for the raw image learning model. Object detection algorithm technology is developed every year, and the mAP is improving by approximately 3% to 10%. This study achieved a comparable performance improvement by only preprocessing with data.


A measurement of visual complexity for heterogeneity in the built environment based on fractal dimension and its application in two gardens.

In this study, a fractal dimension-based method has been developed to compute the visual complexity of the heterogeneity in the built environment. The built environment is a very complex combination, structurally consisting of both natural and artificial elements. Its fractal dimension computation is often disturbed by the homogenous visual redundancy, which is textured but needs less attention to process, so that it leads to a pseudo-evaluation of visual complexity in the built environment. Based on human visual perception, the study developed a method: fractal dimension of heterogeneity in the built environment, which includes Potts segmentation and Canny edge detection as image preprocessing procedure and fractal dimension as computation procedure. This proposed method effectively extracts perceptually meaningful edge structures in the visual image and computes its visual complexity which is consistent with human visual characteristics. In addition, an evaluation system combining the proposed method and the traditional method has been established to classify and assess the visual complexity of the scenario more comprehensively. Two different gardens had been computed and analyzed to demonstrate that the proposed method and the evaluation system provide a robust and accurate way to measure the visual complexity in the built environment.

Extracting Weld Bead Shapes from Radiographic Testing Images with U-Net

Metals created by melting basic metal and welding rods in welding operations are referred to as weld beads. The weld bead shape allows the observation of pores and defects such as cracks in the weld zone. Radiographic testing images are used to determine the quality of the weld zone. The extraction of only the weld bead to determine the generative pattern of the bead can help efficiently locate defects in the weld zone. However, manual extraction of the weld bead from weld images is not time and cost-effective. Efficient and rapid welding quality inspection can be conducted by automating weld bead extraction through deep learning. As a result, objectivity can be secured in the quality inspection and determination of the weld zone in the shipbuilding and offshore plant industry. This study presents a method for detecting the weld bead shape and location from the weld zone image using image preprocessing and deep learning models, and extracting the weld bead through image post-processing. In addition, to diversify the data and improve the deep learning performance, data augmentation was performed to artificially expand the image data. Contrast limited adaptive histogram equalization (CLAHE) is used as an image preprocessing method, and the bead is extracted using U-Net, a pixel-based deep learning model. Consequently, the mean intersection over union (mIoU) values are found to be 90.58% and 85.44% in the train and test experiments, respectively. Successful extraction of the bead from the radiographic testing image through post-processing is achieved.

An Improved Homomorphic Filtering Algorithm for Face Image Preprocessing

Automatic detection of mesiodens on panoramic radiographs using artificial intelligence.

AbstractThis study aimed to develop an artificial intelligence model that can detect mesiodens on panoramic radiographs of various dentition groups. Panoramic radiographs of 612 patients were used for training. A convolutional neural network (CNN) model based on YOLOv3 for detecting mesiodens was developed. The model performance according to three dentition groups (primary, mixed, and permanent dentition) was evaluated, both internally (130 images) and externally (118 images), using a multi-center dataset. To investigate the effect of image preprocessing, contrast-limited histogram equalization (CLAHE) was applied to the original images. The accuracy of the internal test dataset was 96.2% and that of the external test dataset was 89.8% in the original images. For the primary, mixed, and permanent dentition, the accuracy of the internal test dataset was 96.7%, 97.5%, and 93.3%, respectively, and the accuracy of the external test dataset was 86.7%, 95.3%, and 86.7%, respectively. The CLAHE images yielded less accurate results than the original images in both test datasets. The proposed model showed good performance in the internal and external test datasets and had the potential for clinical use to detect mesiodens on panoramic radiographs of all dentition types. The CLAHE preprocessing had a negligible effect on model performance.

Vehicle License Plate Image Preprocessing Strategy Under Fog/Hazy Weather Conditions

An image preprocessing model of coal and gangue in high dust and low light conditions based on the joint enhancement algorithm.

The lighting facilities are affected due to conditions of coal mine in high dust pollution, which bring problems of dim, shadow, or reflection to coal and gangue images, and make it difficult to identify coal and gangue from background. To solve these problems, a preprocessing model for low-quality images of coal and gangue is proposed based on a joint enhancement algorithm in this paper. Firstly, the characteristics of coal and gangue images are analyzed in detail, and the improvement ways are put forward. Secondly, the image preprocessing flow of coal and gangue is established based on local features. Finally, a joint image enhancement algorithm is proposed based on bilateral filtering. In experimental, K-means clustering segmentation is used to compare the segmentation results of different preprocessing methods with information entropy and structural similarity. Through the simulation experiments for six scenes, the results show that the proposed preprocessing model can effectively reduce noise, improve overall brightness and contrast, and enhance image details. At the same time, it has a better segmentation effect. All of these can provide a better basis for target recognition.

Export Citation Format

Share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 12 February 2024

How journals are fighting back against a wave of questionable images

  • Nicola Jones

You can also search for this author in PubMed   Google Scholar

Closeup rear view of a cropped male scientist looking at DNA test results.

Journals are making an effort to detect manipulated images of the gels used to analyse proteins and DNA. Credit: Shutterstock

It seems that every month brings a fresh slew of high-profile allegations against researchers whose papers — some of them years old — contain signs of possible image manipulation .

Scientist sleuths are using their own trained eyes, along with commercial software based on artificial intelligence (AI), to spot image duplication and other issues that might hint at sloppy record-keeping or worse. They are bringing these concerns to light in places like PubPeer, an online forum featuring many new posts every day flagging image concerns.

Some of these efforts have led to action. Last month, for example, the Dana-Farber Cancer Institute (DFCI) in Boston, Massachusetts, said that it would ask journals to retract or correct a slew of papers authored by its staff members. The disclosure came after an observer raised concerns about images in the papers. The institute says it is continuing to investigate the concerns.

That incident was just one of many. In the face of public scrutiny, academic journals are increasingly adopting tricks and tools , including commercial AI-based systems , to spot problematic imagery before, rather than after, publication. Here, Nature reviews the problem and how publishers are attempting to tackle it.

What sorts of imagery problem are being spotted?

Questionable image practices include the use of the same data across several graphs, the replication of photos or portions of photos, and the deletion or splicing of images. Such issues can indicate an intent to mislead, but can also result from an innocent attempt to improve a figure’s aesthetics, for example. Nonetheless, even innocent mistakes can be damaging to the integrity of science, experts say.

How prevalent are these issues, and are they on the rise?

The precise number of such incidents is unknown. A database maintained by the website Retraction Watch lists more than 51,000 documented retractions, corrections or expressions of concern. Of those, about 4% flag a concern about images.

latest research paper in image processing

Meet this super-spotter of duplicated images in science papers

One of the largest efforts to quantify the problem was carried out by Elisabeth Bik, a scientific image sleuth and consultant in San Francisco, California, and her colleagues 1 . They examined images in more than 20,000 papers that were published between 1995 and 2014. Overall, they found that nearly 4% of the papers contained problematic figures . The study also revealed an increase in inappropriate image duplications starting around 2003, probably because digital photography made it easier to alter photos, Bik says.

Modern papers also contain more images than do those from decades ago, notes Bik. “Combine all of this with many more papers being published per day compared to ten years ago, and the increased pressure put on scientists to publish, and there will just be many more problems that can be found.”

The high rate of reports of image issues might also be driven by “a rise in whistle-blowing because of the global community’s increased awareness of integrity issues”, says Renee Hoch, who works for the PLOS Publication Ethics team in San Francisco, California.

What happened at the Dana-Farber Cancer Institute?

In January, biologist and investigator Sholto David, based in Pontypridd, UK, blogged about possible image manipulation in more than 50 biology papers published by scientists at the DFCI, which is affiliated with Harvard University in Cambridge, Massachusetts. Among the authors were DFCI president Laurie Glimcher and her deputy, William Hahn; a DFCI spokesperson said they are not speaking to reporters. David’s blog highlighted what seemed to be duplications or other image anomalies in papers spanning almost 20 years. The post was first reported by The Harvard Crimson .

The DFCI, which had already been investigating some of these issues, is seeking retractions for several papers and corrections for many others. Barrett Rollins, the DFCI’s research-integrity officer, says that “moving as quickly as possible to correct the scientific record is important and a common practice of institutions with strong research integrity”.

“It bears repeating that the presence of image duplications or discrepancies in a paper is not evidence of an author’s intent to deceive,” she adds.

What are journals doing to improve image integrity?

In an effort to reduce publication of mishandled images, some journals, including the Journal of Cell Science , PLOS Biology and PLOS ONE , either require or ask that authors submit raw images in addition to the cropped or processed images in their figures.

Many publishers are also incorporating AI-based tools including ImageTwin, ImaCheck and Proofig into consistent or spot pre-publication checks. The Science family of journals announced in January it is now using Proofig to screen all its submissions. Holden Thorp, editor in chief of the Science family of journals, says Proofig has spotted things that led editors to decide against publishing papers. He says authors are usually grateful to have their errors identified.

What kinds of issues do these AI-based systems flag?

All these systems can, for example, quickly detect duplicates of images in the same paper, even if those images have been rotated, stretched or cropped or had their colour altered.

Different systems have different merits. Proofig, for example, can spot splices created by chopping out or stitching together portions of images. ImageTwin, says Bik, has the advantage of allowing users to cross-check an image against a large data set of other papers. Some publishers, including Springer Nature, are developing their own AI image-integrity software. ( Nature ’s news team is editorially independent of its publisher, Springer Nature.)

Many of the errors flagged by AI tools seem to be innocent. In a study of more than 1,300 papers submitted to 9 American Association for Cancer Research journals in 2021 and early 2022, Proofig flagged 15% as having possible image duplications that required follow-up with authors. Author responses indicated that 28% of the 207 duplications were intentional — driven, for example, by authors using the same image to illustrate multiple points. Sixty-three per cent were unintentional mistakes.

How well do these AI systems work?

Users report that AI-based systems definitely make it faster and easier to spot some kinds of image problems. The Journal of Clinical Investigation trialled Proofig from 2021 to 2022 and found that it tripled the proportion of manuscripts with potentially problematic images, from 1% to 3% 2 .

But they are less adept at spotting more complex manipulations, says Bik, or AI-generated fakery. The tools are “useful to detect mistakes and low-level integrity breaches, but that is but one small aspect of the bigger issue”, agrees Bernd Pulverer, chief editor of EMBO Reports . “The existing tools are at best showing the tip of an iceberg that may grow dramatically, and current approaches will soon be largely obsolete.”

Are pre-publication checks stemming image issues?

A combination of expert teams, technology tools and increased vigilance seems to be working — for the time being. “We have applied systematic screening now for over a decade and for the first time see detection rates decline,” says Pulverer.

But as image manipulation gets more sophisticated, catching it will become ever harder, he says. “In a couple of years all of our current image-integrity screening will still be useful for filtering out mistakes, but certainly not for detecting fraud,” Pulverer says.

How can image manipulation best be tackled in the long run?

Ultimately, stamping out image manipulation will involve complex changes to how science is done, says Bik, with more focus on rigour and reproducibility, and repercussions for bad behaviour. “There are too many stories of bullying and highly demanding PIs spending too little time in their labs, and that just creates a culture where cheating is ok,” she says. “This needs to change.”

doi: https://doi.org/10.1038/d41586-024-00372-6

Bik, E. M., Casadevall, A. & Fang, F. C. mBio 7 , e00809-16 (2016).

Article   PubMed   Google Scholar  

Jackson, S., Williams, C. L., Collins, K. L. & McNally, E. M. J. Clin. Invest. 132 , e162884 (2022).

Download references

Reprints and permissions

Related Articles

latest research paper in image processing

  • Scientific community
  • Machine learning

Open-access publishing: citation advantage is unproven

Correspondence 13 FEB 24

China conducts first nationwide review of retractions and research misconduct

China conducts first nationwide review of retractions and research misconduct

News 12 FEB 24

COVID’s preprint bump set to have lasting effect on research publishing

COVID’s preprint bump set to have lasting effect on research publishing

Nature Index 09 FEB 24

Just 5 women have won a top maths prize in the past 90 years

Just 5 women have won a top maths prize in the past 90 years

News 16 FEB 24

Largest post-pandemic survey finds trust in scientists is high

Largest post-pandemic survey finds trust in scientists is high

News 14 FEB 24

Build global collaborations to protect marine migration routes

Fake research papers flagged by analysing authorship trends

Fake research papers flagged by analysing authorship trends

News 07 FEB 24

‘Obviously ChatGPT’ — how reviewers accused me of scientific fraud

‘Obviously ChatGPT’ — how reviewers accused me of scientific fraud

Career Column 05 FEB 24

What’s needed to rebuild Saudi Arabia’s research reputation

Correspondence 30 JAN 24

Artificial Intelligence and Data Science Faculty Positions in the SOE at the Westlake University

We are dedicated to achieving influential innovations in theories and applications of these research fields.

Yungu, Hangzhou, Zhejiang, China

Westlake University

latest research paper in image processing

Faculty Positions in School of Engineering, Westlake University

Tenured or tenure-track faculty positions in all ranks. We seek candidates with research interests in multiple areas.

School of Engineering, Westlake University

Global Faculty Recruitment of School of Life Sciences, Tsinghua University

The School of Life Sciences at Tsinghua University invites applications for tenure-track or tenured faculty positions at all ranks (Assistant/Ass...

Beijing, China

Tsinghua University (The School of Life Sciences)

latest research paper in image processing

Professor of Biomedical Data Science (Assistant, Associate, and/or Professor Level)

OHSU Knight Cancer Institute CBDS is searching for multiple tenured or tenure-track faculty positions at all ranks in Biomedical Data Science.

Portland, Oregon

Oregon Health and Science University

latest research paper in image processing

Data Scientist (Qualitative)

Houston, Texas (US)

Baylor College of Medicine (BCM)

latest research paper in image processing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

latest research paper in image processing


  • I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI
  • Celebrating 10 years of FAIR: A decade of advancing the state-of-the-art through open research
  • Turing Award presented to Yann LeCun, Geoffrey Hinton, and Yoshua Bengio
  • Today, we’re publicly releasing the Video Joint Embedding Predictive Architecture (V-JEPA) model, a crucial step in advancing machine intelligence with a more grounded understanding of the world.
  • This early example of a physical world model excels at detecting and understanding highly detailed interactions between objects.
  • In the spirit of responsible open science, we’re releasing this model under a Creative Commons NonCommercial license for researchers to further explore.

As humans, much of what we learn about the world around us—particularly in our early stages of life—is gleaned through observation. Take Newton’s third law of motion: Even an infant (or a cat) can intuit, after knocking several items off a table and observing the results, that what goes up must come down. You don’t need hours of instruction or to read thousands of books to arrive at that result. Your internal world model—a contextual understanding based on a mental model of the world—predicts these consequences for you, and it’s highly efficient.

“V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning,” says Meta’s VP & Chief AI Scientist Yann LeCun, who proposed the original Joint Embedding Predictive Architectures (JEPA) in 2022. “Our goal is to build advanced machine intelligence that can learn more like humans do, forming internal models of the world around them to learn, adapt, and forge plans efficiently in the service of completing complex tasks.”

Video JEPA in focus

V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. This is similar to how our Image Joint Embedding Predictive Architecture (I-JEPA) compares abstract representations of images (rather than comparing the pixels themselves). Unlike generative approaches that try to fill in every missing pixel, V-JEPA has the flexibility to discard unpredictable information, which leads to improved training and sample efficiency by a factor between 1.5x and 6x.

Because it takes a self-supervised learning approach, V-JEPA is pre-trained entirely with unlabeled data. Labels are only used to adapt the model to a particular task after pre-training. This type of architecture proves more efficient than previous models, both in terms of the number of labeled examples needed and the total amount of effort put into learning even the unlabeled data. With V-JEPA, we’ve seen efficiency boosts on both of these fronts.

With V-JEPA, we mask out a large portion of a video so the model is only shown a little bit of the context. We then ask the predictor to fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space.

latest research paper in image processing

Masking methodology

V-JEPA wasn’t trained to understand one specific type of action. Instead it used self-supervised training on a range of videos and learned a number of things about how the world works. The team also carefully considered the masking strategy—if you don’t block out large regions of the video and instead randomly sample patches here and there, it makes the task too easy and your model doesn’t learn anything particularly complicated about the world.

It’s also important to note that, in most videos, things evolve somewhat slowly over time. If you mask a portion of the video but only for a specific instant in time and the model can see what came immediately before and/or immediately after, it also makes things too easy and the model almost certainly won’t learn anything interesting. As such, the team used an approach where it masked portions of the video in both space and time, which forces the model to learn and develop an understanding of the scene.

Efficient predictions

Making these predictions in the abstract representation space is important because it allows the model to focus on the higher-level conceptual information of what the video contains without worrying about the kind of details that are most often unimportant for downstream tasks. After all, if a video shows a tree, you’re likely not concerned about the minute movements of each individual leaf.

One of the reasons why we’re excited about this direction is that V-JEPA is the first model for video that’s good at “frozen evaluations,” which means we do all of our self-supervised pre-training on the encoder and the predictor, and then we don’t touch those parts of the model anymore. When we want to adapt them to learn a new skill, we just train a small lightweight specialized layer or a small network on top of that, which is very efficient and quick.

latest research paper in image processing

Previous work had to do full fine-tuning, which means that after pre-training your model, when you want the model to get really good at fine-grained action recognition while you’re adapting your model to take on that task, you have to update the parameters or the weights in all of your model. And then that model overall becomes specialized at doing that one task and it’s not going to be good for anything else anymore. If you want to teach the model a different task, you have to use different data, and you have to specialize the entire model for this other task. With V-JEPA, as we’ve demonstrated in this work, we can pre-train the model once without any labeled data, fix that, and then reuse those same parts of the model for several different tasks, like action classification, recognition of fine-grained object interactions, and activity localization.

latest research paper in image processing

Avenues for future research...

While the “V” in V-JEPA stands for “video,” it only accounts for the visual content of videos thus far. A more multimodal approach is an obvious next step, so we’re thinking carefully about incorporating audio along with the visuals.

As a proof of concept, the current V-JEPA model excels at fine-grained object interactions and distinguishing detailed object-to-object interactions that happen over time. For example, if the model needs to be able to distinguish between someone putting down a pen, picking up a pen, and pretending to put down a pen but not actually doing it, V-JEPA is quite good compared to previous methods for that high-grade action recognition task. However, those things work on relatively short time scales. If you show V-JEPA a video clip of a few seconds, maybe up to 10 seconds, it’s great for that. So another important step for us is thinking about planning and the model’s ability to make predictions over a longer time horizon.

...and the path toward AMI

To date, our work with V-JEPA has been primarily about perception—understanding the contents of various video streams in order to obtain some context about the world immediately surrounding us. The predictor in this Joint Embedding Predictive Architecture serves as an early physical world model: You don’t have to see everything that’s happening in the frame, and it can tell you conceptually what’s happening there. As a next step, we want to show how we can use this kind of a predictor or world model for planning or sequential decision-making.

We know that it’s possible to train JEPA models on video data without requiring strong supervision and that they can watch videos in the way an infant might—just observing the world passively, learning a lot of interesting things about how to understand the context of those videos in such a way that, with a small amount of labeled data, you can quickly acquire a new task and ability to recognize different actions.

V-JEPA is a research model, and we’re exploring a number of future applications. For example, we expect that the context V-JEPA provides could be useful for our embodied AI work as well as our work to build a contextual AI assistant for future AR glasses. We firmly believe in the value of responsible open science, and that’s why we’re releasing the V-JEPA model under the CC BY-NC license so other researchers can extend this work.

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Join us in the pursuit of what’s possible with AI.

latest research paper in image processing

Latest Work

Our Actions

Meta © 2024

  • Newsletters

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

  • Will Douglas Heaven archive page

OpenAI has built a striking new generative video model called Sora that can take a short text description and turn it into a detailed, high-definition film clip up to a minute long.

Based on four sample videos that OpenAI shared with MIT Technology Review ahead of today’s announcement, the San Francisco–based firm has pushed the envelope of what’s possible with text-to-video generation (a hot new research direction that we flagged as a trend to watch in 2024 ).

“We think building models that can understand video, and understand all these very complex interactions of our world, is an important step for all future AI systems,” says Tim Brooks, a scientist at OpenAI.

But there’s a disclaimer. OpenAI gave us a preview of Sora (which means sky in Japanese) under conditions of strict secrecy. In an unusual move, the firm would only share information about Sora if we agreed to wait until after news of the model was made public to seek the opinions of outside experts. [Editor’s note: We’ve updated this story with outside comment below.] OpenAI has not yet released a technical report or demonstrated the model actually working. And it says it won’t be releasing Sora anytime soon. [ Update: OpenAI has now shared more technical details on its website.]

The first generative models that could produce video from snippets of text appeared in late 2022. But early examples from Meta , Google, and a startup called Runway were glitchy and grainy. Since then, the tech has been getting better fast. Runway’s gen-2 model, released last year, can produce short clips that come close to matching big-studio animation in their quality. But most of these examples are still only a few seconds long.  

The sample videos from OpenAI’s Sora are high-definition and full of detail. OpenAI also says it can generate videos up to a minute long. One video of a Tokyo street scene shows that Sora has learned how objects fit together in 3D: the camera swoops into the scene to follow a couple as they walk past a row of shops.

OpenAI also claims that Sora handles occlusion well. One problem with existing models is that they can fail to keep track of objects when they drop out of view. For example, if a truck passes in front of a street sign, the sign might not reappear afterward.  

In a video of a papercraft underwater scene, Sora has added what look like cuts between different pieces of footage, and the model has maintained a consistent style between them.

It’s not perfect. In the Tokyo video, cars to the left look smaller than the people walking beside them. They also pop in and out between the tree branches. “There’s definitely some work to be done in terms of long-term coherence,” says Brooks. “For example, if someone goes out of view for a long time, they won’t come back. The model kind of forgets that they were supposed to be there.”

Impressive as they are, the sample videos shown here were no doubt cherry-picked to show Sora at its best. Without more information, it is hard to know how representative they are of the model’s typical output.   

It may be some time before we find out. OpenAI’s announcement of Sora today is a tech tease, and the company says it has no current plans to release it to the public. Instead, OpenAI will today begin sharing the model with third-party safety testers for the first time.

In particular, the firm is worried about the potential misuses of fake but photorealistic video . “We’re being careful about deployment here and making sure we have all our bases covered before we put this in the hands of the general public,” says Aditya Ramesh, a scientist at OpenAI, who created the firm’s text-to-image model DALL-E .

But OpenAI is eyeing a product launch sometime in the future. As well as safety testers, the company is also sharing the model with a select group of video makers and artists to get feedback on how to make Sora as useful as possible to creative professionals. “The other goal is to show everyone what is on the horizon, to give a preview of what these models will be capable of,” says Ramesh.

To build Sora, the team adapted the tech behind DALL-E 3, the latest version of OpenAI’s flagship text-to-image model. Like most text-to-image models, DALL-E 3 uses what’s known as a diffusion model. These are trained to turn a fuzz of random pixels into a picture.

Sora takes this approach and applies it to videos rather than still images. But the researchers also added another technique to the mix. Unlike DALL-E or most other generative video models, Sora combines its diffusion model with a type of neural network called a transformer.

Transformers are great at processing long sequences of data, like words. That has made them the special sauce inside large language models like OpenAI’s GPT-4 and Google DeepMind’s Gemini . But videos are not made of words. Instead, the researchers had to find a way to cut videos into chunks that could be treated as if they were. The approach they came up with was to dice videos up across both space and time. “It’s like if you were to have a stack of all the video frames and you cut little cubes from it,” says Brooks.

The transformer inside Sora can then process these chunks of video data in much the same way that the transformer inside a large language model processes words in a block of text. The researchers say that this let them train Sora on many more types of video than other text-to-video models, varied in terms of resolution, duration, aspect ratio, and orientation. “It really helps the model,” says Brooks. “That is something that we’re not aware of any existing work on.”

“From a technical perspective it seems like a very significant leap forward,” says Sam Gregory, executive director at Witness, a human rights organization that specializes in the use and misuse of video technology. “But there are two sides to the coin,” he says. “The expressive capabilities offer the potential for many more people to be storytellers using video. And there are also real potential avenues for misuse.” 

OpenAI is well aware of the risks that come with a generative video model. We are already seeing the large-scale misuse of deepfake images . Photorealistic video takes this to another level.

Gregory notes that you could use technology like this to misinform people about conflict zones or protests. The range of styles is also interesting, he says. If you could generate shaky footage that looked like something shot with a phone, it would come across as more authentic.

The tech is not there yet, but generative video has gone from zero to Sora in just 18 months. “We’re going to be entering a universe where there will be fully synthetic content, human-generated content and a mix of the two,” says Gregory.

The OpenAI team plans to draw on the safety testing it did last year for DALL-E 3. Sora already includes a filter that runs on all prompts sent to the model that will block requests for violent, sexual, or hateful images, as well as images of known people. Another filter will look at frames of generated videos and block material that violates OpenAI’s safety policies.

OpenAI says it is also adapting a fake-image detector developed for DALL-E 3 to use with Sora. And the company will embed industry-standard C2PA tags , metadata that states how an image was generated, into all of Sora’s output. But these steps are far from foolproof. Fake-image detectors are hit-or-miss. Metadata is easy to remove, and most social media sites strip it from uploaded images by default.  

“We’ll definitely need to get more feedback and learn more about the types of risks that need to be addressed with video before it would make sense for us to release this,” says Ramesh.

Brooks agrees. “Part of the reason that we’re talking about this research now is so that we can start getting the input that we need to do the work necessary to figure out how it could be safely deployed,” he says.

Update 2/15: Comments from Sam Gregory were added .

Artificial intelligence

Ai for everything: 10 breakthrough technologies 2024.

Generative AI tools like ChatGPT reached mass adoption in record time, and reset the course of an entire industry.

What’s next for AI in 2024

Our writers look at the four hot trends to watch out for this year

  • Melissa Heikkilä archive page

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

Deploying high-performance, energy-efficient AI

Investments into downsized infrastructure can help enterprises reap the benefits of AI while mitigating energy consumption, says corporate VP and GM of data center platform engineering and architecture at Intel, Zane Ball.

  • MIT Technology Review Insights archive page

Stay connected

Get the latest updates from mit technology review.

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at [email protected] with a list of newsletters you’d like to receive.

Our next-generation model: Gemini 1.5

Feb 15, 2024

The model delivers dramatically enhanced performance, with a breakthrough in long-context understanding across modalities.


A note from Google and Alphabet CEO Sundar Pichai:

Last week, we rolled out our most capable model, Gemini 1.0 Ultra, and took a significant step forward in making Google products more helpful, starting with Gemini Advanced . Today, developers and Cloud customers can begin building with 1.0 Ultra too — with our Gemini API in AI Studio and in Vertex AI .

Our teams continue pushing the frontiers of our latest models with safety at the core. They are making rapid progress. In fact, we’re ready to introduce the next generation: Gemini 1.5. It shows dramatic improvements across a number of dimensions and 1.5 Pro achieves comparable quality to 1.0 Ultra, while using less compute.

This new generation also delivers a breakthrough in long-context understanding. We’ve been able to significantly increase the amount of information our models can process — running up to 1 million tokens consistently, achieving the longest context window of any large-scale foundation model yet.

Longer context windows show us the promise of what is possible. They will enable entirely new capabilities and help developers build much more useful models and applications. We’re excited to offer a limited preview of this experimental feature to developers and enterprise customers. Demis shares more on capabilities, safety and availability below.

Introducing Gemini 1.5

By Demis Hassabis, CEO of Google DeepMind, on behalf of the Gemini team

This is an exciting time for AI. New advances in the field have the potential to make AI more helpful for billions of people over the coming years. Since introducing Gemini 1.0 , we’ve been testing, refining and enhancing its capabilities.

Today, we’re announcing our next-generation model: Gemini 1.5.

Gemini 1.5 delivers dramatically enhanced performance. It represents a step change in our approach, building upon research and engineering innovations across nearly every part of our foundation model development and infrastructure. This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture.

The first Gemini 1.5 model we’re releasing for early testing is Gemini 1.5 Pro. It’s a mid-size multimodal model, optimized for scaling across a wide-range of tasks, and performs at a similar level to 1.0 Ultra , our largest model to date. It also introduces a breakthrough experimental feature in long-context understanding.

Gemini 1.5 Pro comes with a standard 128,000 token context window. But starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via AI Studio and Vertex AI in private preview.

As we roll out the full 1 million token context window, we’re actively working on optimizations to improve latency, reduce computational requirements and enhance the user experience. We’re excited for people to try this breakthrough capability, and we share more details on future availability below.

These continued advances in our next-generation models will open up new possibilities for people, developers and enterprises to create, discover and build using AI.

Context lengths of leading foundation models

Highly efficient architecture

Gemini 1.5 is built upon our leading research on Transformer and MoE architecture. While a traditional Transformer functions as one large neural network, MoE models are divided into smaller "expert” neural networks.

Depending on the type of input given, MoE models learn to selectively activate only the most relevant expert pathways in its neural network. This specialization massively enhances the model’s efficiency. Google has been an early adopter and pioneer of the MoE technique for deep learning through research such as Sparsely-Gated MoE , GShard-Transformer , Switch-Transformer, M4 and more.

Our latest innovations in model architecture allow Gemini 1.5 to learn complex tasks more quickly and maintain quality, while being more efficient to train and serve. These efficiencies are helping our teams iterate, train and deliver more advanced versions of Gemini faster than ever before, and we’re working on further optimizations.

Greater context, more helpful capabilities

An AI model’s “context window” is made up of tokens, which are the building blocks used for processing information. Tokens can be entire parts or subsections of words, images, videos, audio or code. The bigger a model’s context window, the more information it can take in and process in a given prompt — making its output more consistent, relevant and useful.

Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.

Complex reasoning about vast amounts of information

1.5 Pro can seamlessly analyze, classify and summarize large amounts of content within a given prompt. For example, when given the 402-page transcripts from Apollo 11’s mission to the moon, it can reason about conversations, events and details found across the document.

Reasoning across a 402-page transcript: Gemini 1.5 Pro Demo

Gemini 1.5 Pro can understand, reason about and identify curious details in the 402-page transcripts from Apollo 11’s mission to the moon.

Better understanding and reasoning across modalities

1.5 Pro can perform highly-sophisticated understanding and reasoning tasks for different modalities, including video. For instance, when given a 44-minute silent Buster Keaton movie , the model can accurately analyze various plot points and events, and even reason about small details in the movie that could easily be missed.

Multimodal prompting with a 44-minute movie: Gemini 1.5 Pro Demo

Gemini 1.5 Pro can identify a scene in a 44-minute silent Buster Keaton movie when given a simple line drawing as reference material for a real-life object.

Relevant problem-solving with longer blocks of code

1.5 Pro can perform more relevant problem-solving tasks across longer blocks of code. When given a prompt with more than 100,000 lines of code, it can better reason across examples, suggest helpful modifications and give explanations about how different parts of the code works.

Problem solving across 100,633 lines of code | Gemini 1.5 Pro Demo

Gemini 1.5 Pro can reason across 100,000 lines of code giving helpful solutions, modifications and explanations.

Enhanced performance

When tested on a comprehensive panel of text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing our large language models (LLMs). And when compared to 1.0 Ultra on the same benchmarks, it performs at a broadly similar level.

Gemini 1.5 Pro maintains high levels of performance even as its context window increases. In the Needle In A Haystack (NIAH) evaluation, where a small piece of text containing a particular fact or statement is purposely placed within a long block of text, 1.5 Pro found the embedded text 99% of the time, in blocks of data as long as 1 million tokens.

Gemini 1.5 Pro also shows impressive “in-context learning” skills, meaning that it can learn a new skill from information given in a long prompt, without needing additional fine-tuning. We tested this skill on the Machine Translation from One Book (MTOB) benchmark, which shows how well the model learns from information it’s never seen before. When given a grammar manual for Kalamang , a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

As 1.5 Pro’s long context window is the first of its kind among large-scale models, we’re continuously developing new evaluations and benchmarks for testing its novel capabilities.

For more details, see our Gemini 1.5 Pro technical report .

Extensive ethics and safety testing

In line with our AI Principles and robust safety policies, we’re ensuring our models undergo extensive ethics and safety tests. We then integrate these research learnings into our governance processes and model development and evaluations to continuously improve our AI systems.

Since introducing 1.0 Ultra in December, our teams have continued refining the model, making it safer for a wider release. We’ve also conducted novel research on safety risks and developed red-teaming techniques to test for a range of potential harms.

In advance of releasing 1.5 Pro, we've taken the same approach to responsible deployment as we did for our Gemini 1.0 models, conducting extensive evaluations across areas including content safety and representational harms, and will continue to expand this testing. Beyond this, we’re developing further tests that account for the novel long-context capabilities of 1.5 Pro.

Build and experiment with Gemini models

We’re committed to bringing each new generation of Gemini models to billions of people, developers and enterprises around the world responsibly.

Starting today, we’re offering a limited preview of 1.5 Pro to developers and enterprise customers via AI Studio and Vertex AI . Read more about this on our Google for Developers blog and Google Cloud blog .

We’ll introduce 1.5 Pro with a standard 128,000 token context window when the model is ready for a wider release. Coming soon, we plan to introduce pricing tiers that start at the standard 128,000 context window and scale up to 1 million tokens, as we improve the model.

Early testers can try the 1 million token context window at no cost during the testing period, though they should expect longer latency times with this experimental feature. Significant improvements in speed are also on the horizon.

Developers interested in testing 1.5 Pro can sign up now in AI Studio, while enterprise customers can reach out to their Vertex AI account team.

Learn more about Gemini’s capabilities and see how it works .

Get more stories from Google in your inbox.

Your information will be used in accordance with Google's privacy policy.

Done. Just one step more.

Check your inbox to confirm your subscription.

You are already subscribed to our newsletter.

You can also subscribe with a different email address .

Related stories

What is a long context window.

MSC_Keyword_Cover (3)

How AI can strengthen digital security


Working together to address AI risks and opportunities at MSC

AI Evergreen 1 (1)

How we’re partnering with the industry, governments and civil society to advance AI


Pixel is now the Official Mobile Phone of the National Women’s Soccer League


Bard becomes Gemini: Try Ultra 1.0 and a new mobile app today

Let’s stay in touch. Get the latest news from Google in your inbox.

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

AI gone wild —

Scientists aghast at bizarre ai rat with huge genitals in peer-reviewed article, it's unclear how such egregiously bad images made it through peer-review..

Beth Mole - Feb 15, 2024 11:16 pm UTC

An actual laboratory rat, who is intrigued.

Appall and scorn ripped through scientists' social media networks Thursday as several egregiously bad AI-generated figures circulated from a peer-reviewed article recently published in a reputable journal. Those figures—which the authors acknowledge in the article's text were made by Midjourney—are all uninterpretable. They contain gibberish text and, most strikingly, one includes an image of a rat with grotesquely large and bizarre genitals, as well as a text label of "dck."

AI-generated Figure 1 of the paper. This image is supposed to show spermatogonial stem cells isolated, purified, and cultured from rat testes.

The article in question is titled "Cellular functions of spermatogonial stem cells in relation to JAK/STAT signaling pathway," which was authored by three researchers in China, including the corresponding author Dingjun Hao of Xi’an Honghui Hospital. It was published online Tuesday in the journal Frontiers in Cell and Developmental Biology.

Frontiers did not immediately respond to Ars' request for comment, but we will update this post with any response.

Figure 2 is supposed to be a diagram of the JAK-STAT signaling pathway.

But the rat's package is far from the only problem. Figure 2 is less graphic but equally mangled. While it's intended to be a diagram of a complex signaling pathway, it instead is a jumbled mess. One scientific integrity expert questioned whether it provided an overly complicated explanation of "how to make a donut with colorful sprinkles." Like the first image, the diagram is rife with nonsense text and baffling images. Figure 3 is no better, offering a collage of small circular images that are densely annotated with gibberish. The image is supposed to provide visual representations of how the signaling pathway from Figure 2 regulates the biological properties of spermatogonial stem cells.

Some scientists online questioned whether the article's text was also AI-generated. One user noted that AI detection software determined that it was likely to be AI-generated; however, as Ars has reported previously, such software is unreliable .

Figure 3 is supposed to show the regulation of biological properties of spermatogonial stem cells by JAK/STAT signaling pathway.

The images, while egregious examples, highlight a growing problem in scientific publishing. A scientist's success relies heavily on their publication record, with a large volume of publications, frequent publishing, and articles appearing in top-tier journals, all of which earn scientists more prestige. The system incentivizes less-than-scrupulous researchers to push through low-quality articles, which, in the era of AI chatbots, could potentially be generated with the help of AI. Researchers worry that the growing use of AI will make published research less trustworthy. As such, research journals have recently set new authorship guidelines for AI-generated text to try to address the problem. But for now, as the Frontiers article shows, there are clearly some gaps.

reader comments

Channel ars technica.


  1. Image processing research papers ieee xplore

    latest research paper in image processing

  2. Research Paper

    latest research paper in image processing

  3. Research Paper Part 1

    latest research paper in image processing

  4. 🎉 What are the components of a research paper. Components of a Qualitative Research Report. 2019

    latest research paper in image processing


    latest research paper in image processing

  6. Research Paper

    latest research paper in image processing


  1. Image Processing Lecture(7) ~Dr-Sameh Zareef

  2. Digital Image Processing (16) || Image Sampling || Quantization

  3. Image processing Lec 1&2

  4. Digital Image Processing (15) || Image Representation || Image Acquisition || Urdu || Hindi

  5. Digital Image Processing (02) || Image Classification || Syntax and Semantics || Urdu || Hindi

  6. Turnpike Sports® Spotlight


  1. Image processing

    Latest Research and Reviews Cluster-based histopathology phenotype representation learning by self-supervised multi-class-token hierarchical ViT Jiarong Ye Shivam Kalra Mohammad Saleh Miri...

  2. IEEE Transactions on Image Processing

    IEEE Transactions on Image Processing. null | IEEE Xplore. Need Help? US & Canada: +1 800 678 4333 Worldwide: +1 732 981 0060 Contact & Support

  3. Recent Trends in Image Processing and Pattern Recognition

    The RTIP2R will take place at the Texas A&M University—Kingsville, Texas (USA), on November 22-23, 2022, in collaboration with the 2AI Research Lab—Computer Science, University of South Dakota (USA).

  4. Image Processing Technology Based on Machine Learning

    This paper introduces machine learning into image processing, and studies the image processing technology based on machine learning. This paper summarizes the current popular image processing technology, compares various image technology in detail, and explains the limitations of each image processing method.

  5. Editorial: Current Trends in Image Processing and Pattern Recognition

    With this theme, we opened a call for papers on Current Trends in Image Processing & Pattern Recognition that exactly followed third International Conference on Recent Trends in Image Processing & Pattern Recognition (RTIP2R), 2020 (URL: http://rtip2r-conference.org ). Our call was not limited to RTIP2R 2020, it was open to all.

  6. Frontiers

    A Brief Historical Perspective We first briefly discuss a few key milestones in the field of image processing. Key inventions in the development of photography and motion pictures can be traced to the 19th century. The earliest surviving photograph of a real-world scene was made by Nicéphore Niépce in 1827 ( Hirsch, 1999 ).

  7. Search for image processing

    1 code implementation • 23 Jul 2014. scikit-image is an image processing library that implements algorithms and utilities for use in research, education and industry applications. 5,767. Paper. Code.

  8. Advances in image processing using machine learning techniques

    The paper ' An Unsupervised Monocular Image Depth Prediction Algorithm Using Fourier Domain Analysis ', by Lifang Chen and Xiaojiao Tang (SPR-2021-12-0186), is dedicated to image depth estimation, which is an important method to understand the geometric structure in a scene in various artificial intelligence products such as, for example, driver...

  9. Recent trends in image processing and pattern recognition

    In "Research on Fundus Image Registration and Fusion Method based on Nonsubsampled Contourlet and Adaptive Pulse Coupled Neural Network," authors presented a registration and fusion method of fluorescein fundus angiography image and color fundus image that combines Nonsubsampled Contourlet (NSCT) and adaptive Pulse Coupled Neural Network (PCNN).

  10. A comprehensive survey of recent trends in deep learning for digital

    2.2 Rotation. Rotation (Sifre and Mallat 2013) is another type of classical geometric image data augmentation; the rotation process is done by rotating the image around an axis whether in the right direction or the left direction by angels between 1 and 359.Rotation may be applied to images by a certain angle degree in an additive way. For example, rotate the image at about 30-degree angles.

  11. Image and Video Processing authors/titles Jun 2023

    Title: NNMobile-Net: Rethinking CNN Design for Deep Learning-Based Retinopathy Research Authors: Wenhui Zhu, Peijie Qiu, Natasha Lepore, Oana M. Dumitrascu, Yalin Wang. ... Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

  12. 471383 PDFs

    Jan 2024 Bayile Getu Taye Neeraj Goel Image Processing Based Implementation of Unmanned aerialvehicle (UAV) for crop monitoring Using Drone Technology Jan 2024 Richa Srivastava Anil Kumar O P...

  13. Deep Learning-based Image Text Processing Research

    Deep learning is a powerful multi-layer architecture that has important applications in image processing and text classification. This paper first introduces the development of deep learning and two important algorithms of deep learning: convolutional neural networks and recurrent neural networks. The paper then introduces three applications of deep learning for image recognition, image ...

  14. 267349 PDFs

    Jan 2024 Naillah Gul Amandeep Kaur Purpose Hyperspectral data are the most widely used remote sensing datasets. Hyperspectral Pan-Sharpening suffers from spectral distortion; the purpose of...


    results, then, finally, Section V concludes the paper. II. B. ACKGROUND AND. R. ELATED. W. ORKS. Over the past years, research has renewed interest in modeling image compression as a learning problem, giving a series of pioneering works [5]-[9], [14], [28]-[30] that have contributed to a universal fashion effect, and have

  16. Image Processing: Research Opportunities and Challenges

    The objectives of this article are to define the meaning and scope of image processing, discuss the various steps and methodologies involved in a typical image processing, and applications...

  17. Image forgery detection: a survey of recent deep-learning approaches

    Note that k = 0 means that the image is pristine. As can be observed, the number of possible histories grows exponentially with the number of available attacks. A possible solution can be found in [14, 60, 61], where the authors formulated the problem of determining the processing history as a multi-class classification problem.Therein, each of the N histories corresponds to a class, and a ...

  18. digital image processing Latest Research Papers

    Recently Published Documents TOTAL DOCUMENTS 2864 (FIVE YEARS 556) H-INDEX 55 (FIVE YEARS 5) Latest Documents Most Cited Documents Contributed Authors Related Sources Related Keywords Developing Digital Photomicroscopy Cells 10.3390/cells11020296 2022 Vol 11 (2) pp. 296 Author (s): Kingsley Micklem Keyword (s): Image Processing

  19. IOPscience

    direction of digital image processing technology is expressed. This paper is beneficial to understand the latest technology and development trends in digital image processing, and can promote in-depth research of this technology and apply it to real life. 2. Digital image processing Technology

  20. A Novel Image Processing Approach to Enhancement and Compression of X

    5. Discussion. Image compression is an application of information compression on digital images; in other words, the purpose of this work is to reduce the redundancy of the contents of the image for the ability to store or transfer information in optimal form. Photo compression can be done without loss and total loss.

  21. FULL PAPER on Image processing & Cryptography on Hardware CU

    Rest of the paper consists of three sections i.e. Hardware architecture and implementation design, results and observation followed by conclusion. A brief theory and previous work Case 1: Image thresholding as a segmentation step: The first stage that we can think of in all stage of image processing and analysis is image binarization

  22. image preprocessing Latest Research Papers

    18 (FIVE YEARS 4) Latest Documents Most Cited Documents Contributed Authors Related Sources Related Keywords Degraded document image preprocessing using local adaptive sharpening and illumination compensation Pattern Analysis and Applications 10.1007/s10044-021-01038-z 2022 Author (s): Hong Xia Wang Bang Song Jian Chen Yi Yang

  23. How journals are fighting back against a wave of questionable ...

    In a study of more than 1,300 papers submitted to 9 American Association for Cancer Research journals in 2021 and early 2022, Proofig flagged 15% as having possible image duplications that ...

  24. V-JEPA: The next step toward advanced machine intelligence

    V-JEPA is a non-generative model that learns by predicting missing or masked parts of a video in an abstract representation space. This is similar to how our Image Joint Embedding Predictive Architecture (I-JEPA) compares abstract representations of images (rather than comparing the pixels themselves). Unlike generative approaches that try to fill in every missing pixel, V-JEPA has the ...

  25. OpenAI teases an amazing new generative video model called Sora

    OpenAI has built a striking new generative video model called Sora that can take a short text description and turn it into a detailed, high-definition film clip up to a minute long.. Based on four ...

  26. Introducing Stable Cascade

    Today marks the launch of Stable Cascade in its research preview. This innovative text to image model introduces an interesting three-stage approach, setting new benchmarks for quality, flexibility, fine-tuning, and efficiency with a focus on further eliminating hardware barriers.

  27. Introducing Gemini 1.5, Google's next-generation AI model

    Gemini 1.5 delivers dramatically enhanced performance. It represents a step change in our approach, building upon research and engineering innovations across nearly every part of our foundation model development and infrastructure. This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture.

  28. Gartner Emerging Technologies and Trends Impact Radar for 2024

    Use this year's Gartner Emerging Tech Impact Radar to: ☑️Enhance your competitive edge in the smart world ☑️Prioritize prevalent and impactful GenAI use cases that already deliver real value to users ☑️Balance stimulating growth and mitigating risk ☑️Identify relevant emerging technologies that support your strategic product roadmap Explore all 30 technologies and trends: www ...

  29. Scientists aghast at bizarre AI rat with huge genitals in peer-reviewed

    Many researchers expressed surprise and dismay that such a blatantly bad AI-generated image could pass through the peer-review system and whatever internal processing is in place at the journal.