MesoNet — Model Notes

What the model does (1–3 paragraphs)

Problem it solves: The MesoNet Github repository provides two models, Meso-4 and MesoInception, which specialize on detecting videos with faces edited by DeepFake or Face2Face. More specifically, the models selects frames from a video, extracts the faces, and focus on features such as the eyes to determine the final classification score.
Input → output: images (ideally 256x256) → Outputs an array of scores and expected class, which are values between 0.0 and 1.0 (1=real, 0=Fake) Directory of mp4, avi, or mov videos → Dictionary mapping video names to scores between 0.0 and 1.0 (1=real, 0=Fake)
Why it's relevant for AI video detection / deepfake detection: MesoNet is provides pretrained weights that are directly trained on a DeepFake dataset, and is specifically made to identify patterns in DeepFake-edited videos. The weights trained on Face2Face also had near-equal detection rates for DeepFake videos as well.

Paper title: MesoNet: a Compact Facial Video Forgery Detection Network
Authors / year: Darius Afchar, Vincent Nozick, Junichi Yamagishi, Isao Echizen (4 Sept 2018)
Link: https://arxiv.org/abs/1809.00888
Key ideas (bullet points):
- Handles videos after strong degradation from video compression
- In depth analysis of DeepFake, Face2Face, and their generation process
Architecture summary (high-level): MesoNet uses a deep learning approach to detect patterns in edited videos, with a very low number of trainable parameters of around 28,000 for both networks. Meso-4 first performs four layers of convulations, normalization, pooling. MesoInception replaces the first two convulational layers with a variant of the inception module by Szegedy et al (cited in the MesoNet paper). A diagram of the Meso-4 architecture was also provided, and attached below.

Important details:
- The model was originally developed in Python 3.5, and must have its own virtual environment to separate package dependencies
- The detail of eyes strongly help MesoNet determine real videos
- The detail of the background strongly help determine DeepFake videos
- DeepFake-edited videos primarily change faces and may keep the background of videos the same, producing videos that are the same as real videos except for the faces.
Gotchas / assumptions:
- Pretrained MesoNet weights are trained on older videos, but present-day DeepFake videos may be more sophisticated.
Strengths:
- Original developers achieved very high detection rates DeepFake and Face2Face, even on each other's test sets.
- MesoNet has a very low number of trainable parameters compared to many other models.
Weaknesses:
- Will most likely fail to classify videos that do not focus on faces.
- Requires legacy software and libraries, many of which are no longer supported or do not support newer modules.
- Completely generated videos, where not only the faces but also the backgrounds were generated, may greatly affect accuracy.

Expected preprocessing:
- An optional training set of images to fine-tune the weights to be more relevant
- Potentially cutting the video down to a shorter length, or extracting certain frames
Expected input format: mp4, avi, or mov files, placed in a known directory
Metrics typically reported: A number from 0-1 on the model prediction of fake or real

Questions to ask in weekly meeting:
Things to verify:
- Can the model functions properly in slightly higherly versions of Python and packages
- Can the model accurately detect videos generated by present-day DeepFake