Skip to content
Longterm Wiki
Back

Deepfake-Eval-2024 Benchmark

paper

Authors

Nuria Alina Chandra·Ryan Murtfeldt·Lin Qiu·Arnab Karmakar·Hannah Lee·Emmanuel Tanumihardja·Kevin Farhat·Ben Caffee·Sejin Paik·Changyeon Lee·Jongwook Choi·Aerin Kim·Oren Etzioni

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety discussions around synthetic media, disinformation, and the gap between benchmark performance and real-world robustness of detection systems.

Paper Details

Citations
40
8 influential
Year
2025

Metadata

Importance: 62/100arxiv preprintdataset

Abstract

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

Summary

Deepfake-Eval-2024 introduces a benchmark of in-the-wild deepfakes collected from social media in 2024, revealing that state-of-the-art open-source detectors suffer 45-50% AUC drops compared to academic benchmarks. The dataset spans 45 hours of video, 56.5 hours of audio, and 1,975 images across 52 languages from 88 websites. Commercial and finetuned models improve but still fall short of human forensic analysts.

Key Points

  • Existing academic deepfake benchmarks are outdated; open-source SOTA models see 45-50% AUC drops when evaluated on real-world 2024 deepfakes.
  • Dataset covers video (45 hrs), audio (56.5 hrs), and images (1,975) from 88 websites in 52 languages, representing latest manipulation technologies.
  • Commercial and finetuned models outperform off-the-shelf open-source models but still lag behind human deepfake forensic analysts.
  • The benchmark highlights a critical deployment gap: high lab accuracy does not translate to real-world detection performance.
  • Dataset is publicly available, enabling community-wide evaluation and improvement of deepfake detection systems.

Cited by 2 pages

PageTypeQuality
AI Content AuthenticationApproach58.0
AI-Era Epistemic SecurityApproach63.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202684 KB
HTML conversions [sometimes display errors](https://info.dev.arxiv.org/about/accessibility_html_error_messages.html) due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

- failed: changepage

Authors: achieve the best HTML results from your LaTeX submissions by following these [best practices](https://info.arxiv.org/help/submit_latex_best_practices.html).

[License: CC BY-SA 4.0](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2503.02857v1 \[cs.CV\] 04 Mar 2025

# Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Report issue for preceding element

Nuria Alina Chandra
TrueMedia.org
Ryan Murtfeldt
TrueMedia.org
University of Washington, Seattle
Lin Qiu
TrueMedia.org
University of Washington, Seattle
Arnab Karmakar
TrueMedia.org
University of Washington, Seattle
Hannah Lee
TrueMedia.org
Emmanuel Tanumihardja
TrueMedia.org
University of Washington, Seattle
Kevin Farhat
TrueMedia.org
University of Washington, Seattle
Ben Caffee
TrueMedia.org
University of Washington, Seattle
Sejin Paik
TrueMedia.org
Georgetown University, Washington D.C.
Changyeon Lee
Miraflow AI
Yonsei University, Seoul
Jongwook Choi
TrueMedia.org
Chung-Ang University, Seoul
Aerin Kim
TrueMedia.org
Miraflow AI
Oren Etzioni
TrueMedia.org
University of Washington, Seattle

Report issue for preceding element

###### Abstract

Report issue for preceding element

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 44 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but they do not yet reach the accuracy of human deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

Report issue for preceding element

![Refer to caption](https://arxiv.org/html/25

... (truncated, 84 KB total)
Resource ID: 919c9ed9593285fd | Stable ID: MDBmN2JmMT