Preserving the Evidence: Artifact-Aware Preprocessing for Robust Deepfake Detection
This study proves that intelligent preprocessing, not just model architecture, is the key to robust deepfake detection. Using an adaptive wavelet-based pipeline, we achieved 99.96% AUROC and perfect 100% precision at the video level on the Celeb-DF v2 dataset.
Andrew Alfonso Lie

Preserving the Evidence: Artifact-Aware Preprocessing for Robust Deepfake Detection#
Amid a 700% surge in deepfake-related fraud cases in the fintech sector in 2023, most research continues to focus on a single dimension: model architecture. Yet there is a far more fundamental stage that has long been overlooked: preprocessing.
This research sets out to answer a simple but critical question: how much does preprocessing actually affect deepfake detection performance?
Background
In March 2024, a finance employee in Hong Kong transferred $39 million to fraudsters after attending a video conference with someone who turned out to be a deepfake of the company's CFO. This was not an isolated incident. It reflects a growing real-world threat. Deloitte projects that losses from generative AI-driven fraud, including deepfakes, could reach $40 billion in the United States by 2027.
Meanwhile, sophisticated models like EfficientNetV2 and Vision Transformers continue to be refined, yet one foundational question has rarely been answered systematically: what happens before an image ever reaches the model?
What We Did
We designed a controlled comparative analysis using EfficientNetV2-S as the backbone on the Celeb-DF v2 dataset (5,639 videos, 1:1 real-to-fake ratio). Two preprocessing scenarios were evaluated head-to-head:
Scenario 1: Baseline Minimal preprocessing consisting of BlazeFace face detection, photometric normalization, and light augmentation (horizontal flip, rotation within plus or minus 5 degrees, and minor color variation).
Scenario 2: Adaptive Multi-Scale Enhancement (Proposed) A two-stage pipeline:
- CPU-side: gamma correction (gamma in [0.95, 1.05]), JPEG compression simulation (quality 60 to 85), adaptive sharpening, and eye masking.
- GPU-side: Symlet-4 level-3 wavelet decomposition to amplify high-frequency inconsistencies that serve as manipulation traces in deepfake facial regions, followed by spatial attention fusion.
The key insight behind this approach: no new parameters are added to the backbone. Instead, the input signal is enriched so that manipulation artifacts become more legible to the model.

Results
The differences were significant and consistent across all metrics.
At the frame level, the proposed method achieved an AUROC of 99.62% compared to 99.07% in the baseline. More notably, false positives were reduced by 59.2%, dropping from 240 cases to just 98. The Equal Error Rate fell from 4.78% to 3.20%, a relative reduction of 33%.
At the video level, the improvements were even more striking. The proposed method reached an AUROC of 99.96% with perfect precision of 100% on the test set, meaning zero real videos were misclassified as fake. The Equal Error Rate dropped by nearly 67%, from 1.81% to just 0.60%.
These results are benchmark-specific to Celeb-DF v2 and do not imply universal perfect precision, but they offer strong empirical evidence that preprocessing is not a trivial step.

Where the Model Still Struggles
Despite strong overall performance, failure cases reveal important limitations.
The 98 remaining false positives at the frame level were predominantly caused by extreme lighting (overexposure or underexposure), severe motion blur, and partial face occlusion. Certain visual patterns, such as heavy makeup, glasses reflections, or capture-induced compression artifacts, were consistently misread as signs of manipulation.
On the false negative side, 172 missed detections involved high-quality deepfakes with advanced post-processing techniques like color grading and noise injection that successfully mimicked natural video characteristics. Some manipulations were also limited to non-critical facial regions such as the forehead or chin, leaving core facial features largely authentic and harder to flag.
Key Takeaways
This research makes three concrete contributions to the deepfake detection field.
First, it introduces a structured evaluation framework that rigorously quantifies the impact of preprocessing on artifact integrity, addressing a methodological gap identified in prior literature.
Second, it proposes an adaptive multi-scale preprocessing pipeline that integrates CPU-side enhancements with GPU-side wavelet feature extraction, tuned to dataset-specific statistics. This improves class separability without adding any parameters to the backbone, making it suitable for resource-constrained environments.
Third, it provides empirical proof that preprocessing is a strategic determinant of model performance, not merely a preparatory technical step.
What Comes Next
Several directions are planned for future work, including cross-dataset evaluation on benchmarks like FaceForensics++, adversarial robustness testing against post-processing attacks, explainability analysis using Grad-CAM, multimodal audio-visual fusion, and replacing fixed fusion weights with learnable, adaptive weighting mechanisms.
In an era of exponentially advancing deepfake sophistication, progress will come not only from building more complex models, but from being smarter about the evidence we feed into them.