Page 16 - Detecting deepfakes and generative AI: Report on standards for AI watermarking and multimedia authenticity workshop
P. 16

Detecting deepfakes and generative AI: Report on standards for AI
                                           watermarking and multimedia authenticity workshop



                      Existing deepfake detection tools can be broadly categorized into two groups:

                      i)   Based on either the exploitation of semantic inconsistencies like irregular eye reflections or
                           known generation artifacts in the spatial or frequency domain.
                      ii)   Using neural networks to learn a feature representation in which real images can be
                           distinguished from AI-generated ones. For instance, training a standard convolutional
                           neural network (CNN) on real and fake images from a single GAN yields a classifier
                           capable of detecting images generated by a variety of unknown GANs. Given the rapid
                           evolution of generative AI models, developing detectors which generalize to new
                           generative AI models is crucial and therefore a major field of research.

                      Deepfake detection techniques use deep learning and machine learning to analyse patterns
                      and anomalies in multimedia content and identify signs of manipulation. Detection techniques
                      can be split into two detection methods: CNN-based methods and region-based convolutional
                      neural networks (R-CNN)-based methods. CNN-based techniques take pictures of people's
                      faces from video frames and feed them into the CNN for training and prediction to get an
                      image-level result. Such algorithms only employ spatial information from a single frame.

                      R-CNN-based techniques, on the other hand, require a series of video frames for training to
                      produce a video-level result. This approach, known as R-CNN, combines CNN and recurrent
                      neural networks (RNN). As a result, R-CNN-based techniques could fully use spatial and
                      temporal information in deepfake video.

                      In addition, several deepfake detection methods are based on standard machine learning
                      methods, including utilizing a support vector machine (SVM) as a classifier and extracting
                      handmade characteristics, including biological signals, as a classifier. For example, the video
                      of a person’s face contains subtle shifts in colour that result from pulses in blood circulation.
                      These changes in colour form the basis of a technique called photoplethysmography (PPG)
                      that can be used to detect synthetic media. Deepfakes cannot recreate these colour shifts with
                      high fidelity. Biological signals are not coherently preserved in different synthetic facial parts
                      and deepfakes do not contain frames with stable PPG.

                      Current deepfake video detection methods have several limitations. Firstly, these methods
                      cannot always be relied upon to detect deepfakes in real-world situations, especially when the
                      images or videos are modified using new techniques that were not present in the training data.
                      Most methods fail to model the natural structures and movements of human faces adequately,
                      which are crucial for accurate deepfake detection. Some methods rely heavily on mouth-related
                      mismatches between auditory and visual modalities, leading to performance degradation when
                      there are limited or unaltered mouth motions. These limitations highlight the need for improved
                      deepfake detection methods that can effectively handle real-world scenarios, generalize well
                      to unseen samples, and capture the natural cues of human faces.


                      Touradj Ebrahimi, Professor at EPFL and Chair of JPEG, presented a new framework for
                      deepfake detection in still images that could enhance the performance of a deepfake detector
                      under the attack of various real-world perturbations (e.g., JPEG compression artifacts, changing
                      brightness and contrast, blurry effects, and Gaussian noise). He presented two methods:

                      a)   Stochastic degradation-based augmentation.
                      b)   Degradation-based amplitude-phase switch augmentation.

                      He concluded by presenting a detection technique that allows for the detection of content
                      synthesized completely by generative AI techniques.




                   8
   11   12   13   14   15   16   17   18   19   20   21