Skip to content

HiDF: A Human-Indistinguishable Deepfake Dataset

License

Notifications You must be signed in to change notification settings

DSAIL-SKKU/HiDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💡 HiDF: A Human-Indistinguishable Deepfake Dataset

Sample deepfake images of HiDF
[Deepfake image samples of HiDF]


$HiDF$ is a high-quality, human-indistinguishable deepfake dataset comprising 30K images and 4K videos, curated to include diverse subjects and undergo rigorous quality checks. It addresses the limitations of existing datasets by providing more realistic and undetectable deepfakes. This makes HiDF an invaluable benchmark for advancing deepfake detection research, with data and code publicly available for future studies. You can find HiDF's deepfake images and videos in the samples folder. It includes 100 deepfake images and 10 deepfake videos. For access to the entire dataset, please refer to Request for HiDF below.


💡 News

  • [05/16/2025] Our paper on the HiDF dataset has been accepted to KDD 2025 (Datasets & Benchmarks Track)!
    • With this acceptance, the dataset is now officially available for research use.
    • Please refer to the Request for HiDF section below for access instructions.
    • 📌 The DOI and citation information will be updated here once officially available.

💡 Quantitative comparison of HiDF and existing deepfake datasets

Dataset # Real # Fake # Total # Subject DType Tool Quality
FF++ 1,000 4,000 5,000 N/A Image, Video (w/o audio) X N/A
DFDC 23,654 104,500 128,154 960 Video (w/ audio) X N/A
KoDF 62,166 175,776 237,942 403 Video (w/ audio) X Q
FakeAVCeleb 500 19,500 20,000 500 Video (w/ audio) X N/A
DFGC 2,019 3,270 5,289 40 Video (w/ audio) O N/A
HiDF 35,611 35,611 71,222 6,127 + α Image, Video (w/ audio) O QQ

Quantitative comparison of HiDF and existing deepfake datasets. Real, Fake, and Total for HiDF represent the combined count of images and videos. Tool indicates whether commercial tools were used for generating the deepfake data, and Quality denotes whether a quality assessment of the dataset was performed. Q: Quantitative (using evaluation metrics such as FID, PSNR, SSIM) only, QQ: Both Quantitative and Qualitative (including pilot studies such as human surveys), N/A: Not applicable.


💡 Data Description

HiDF provides high-quality deepfake images and videos, along with the corresponding real data. The detailed quantities are as follows.

  • Image
    • # of Real: 30,250
    • # of Fake: 30,250
  • Video
    • # of Real: 4,241
    • # of Fake: 4,241

When swapping the face of image A with that of image B, we refer to image A as the base image and the image to be swapped (i.e., image B) as the target image. The filenames of HiDF deepfake images follow the format (base_image_id)_(target_image_id).jpg. Similarly, the filenames of deepfake videos follow the format (base_video_id)_(target_image_id).mp4.


In our commitment to supporting comprehensive deepfake detection research, we provide detailed information on the race, gender, and age of the synthesized individuals in the generated deepfake images and videos. This comprehensive information is included in the HiDF_metadata.csv file, structured as follows. For detailed annotation procedures regarding race, gender, and age, please refer to the paper 'HiDF: A Human-Indistinguishable Deepfake Dataset.'


  • Configuration of HiDF_metadata.csv
Image ID Race Gender Age
c01213 white female child
f00105 Asian male Adult
... ... ... ...
  • Image ID
    • This column refers to the unique ID of the image. Each ID consists of one letter and five digits. The letters 'c' and 'f' indicate the source dataset from which the image was extracted (i.e., CelebA-HQ and FFHQ, respectively).
  • Race
    • This column indicates the race of the individuals appearing in the image. Race is divided into five categories: White, Black, Asian, Latino, and Indian.
  • Gender
    • This column indicates the gender of the individuals appearing in the image.
  • Age
    • This column indicates the age group of the individuals appearing in the image, divided into three categories: child, middle-aged adult, and elderly.

💡 Evaluation

1. Installation

git clone https://github.com/DSAIL-SKKU/HiDF.git
cd HiDF/Code/AVAD

Install the requirements file:

pip install -r requirements.txt

2. Inference

Steps to run the python code directly:

python detect_implementation_code.py --input_dir /SampleData/HiDF/Fake --output_dir ./save

You can download checkpoint sync_model.pth from here and place it in the folder where the code resides.

input_dir should contain the path to the directory of evaluation data, and output_dir should contain the path to the save folder created in the ./AVAD directory.

In the end, there would be `{evaluation data}_{Real/Fake}_score.csv' file under output_dir generated to record scores for all the testing videos.

3. Performance evaluation

Finally, you can evaluate performance by running the following command:

python APnAUC.py
# Average Precision (AP): 0.xxxx
# Area Under the Curve (AUC): 0.xxxx

If there are multiple CSV files in the save folder, you need to specify which Real and Fake you want within the APnAUC.py script.


Acknowledgments

Our code is borrowed from AVAD. Thanks for their sharing codes and models.


💡 Request for $HiDF$

To access the HiDF dataset, please visit the following link.

The HiDF dataset is available under the Creative Commons Attribution-NonCommercial 4.0 International Public License. Any violation of this license agreement may result in legal action. By downloading the HiDF, the user agrees to the terms of the CC BY-NC 4.0 license.


💡 Maintenance

This repository is maintained by Chaewon Kang and Seoyoon Jeong. Any feedback, extensions & suggestions are welcome! Please send an email to codnjs3@g.skku.edu.


💡 License

The HiDF dataset is available under the Creative Commons Attribution-NonCommercial 4.0 International Public License: https://creativecommons.org/licenses/by-nc/4.0/. The code is released under the MIT license.

About

HiDF: A Human-Indistinguishable Deepfake Dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages