Healthcare communication increasingly relies on digital platforms, creating new vulnerabilities for misinformation through sophisticated deepfake technologies. The proliferation of synthetic medical content poses serious risks to patient safety, public health policies, and trust in healthcare institutions.
We introduce SynthMed, a comprehensive framework for generating and detecting multimodal deepfakes specifically designed for healthcare communication scenarios. Our approach combines state-of-the-art generative models with advanced detection mechanisms using multimodal late-fusion strategies across video, audio, and textual fact-checking modalities.
Our evaluation demonstrates that SynthMed achieves robust detection capabilities through multimodal fusion, significantly outperforming single-modality approaches. These results highlight the framework's potential as both a research tool for understanding deepfake vulnerabilities and a defense mechanism against malicious synthetic healthcare content.
- PUBHEALTH: Public health fact-checking dataset
- COVID-Fact: COVID-19 related claims and fact-checks
- SciFact: Scientific claim verification dataset
- HealthVer: Health-related claim verification corpus
The framework processes over 31k healthcare claims through multiple synthesis pipelines and evaluation protocols:
- 510 Generated Deepfakes: Using advanced TTS/voice cloning and lip-sync technologies
- 40 In-the-Wild Deepfakes: Web-sourced videos from Duke Reporters' Lab and YouTube (reported in
datasets/HealthcareDeepfakesInTheWild.xlsx
) - 150 Real Videos: Authentic content for balanced evaluation
- Balanced Dataset Splits: Synthetic (Dβ), in-the-wild (Dβ), and combined (Dβ) configurations
- Video Generation & Detection:
- VideoReTalking for photorealistic lip-sync synthesis
- DeepfakeBench repository models for comprehensive detection
- Audio Synthesis & Detection:
- OpenVoice, WhisperSpeech, OuteTTS for voice cloning and TTS
- deepfake-whisper-features for audio deepfake detection
- Text Generation & Fact-Checking:
- PALMYRA-MED-70B-32K, LLAMA-3.1-NEMOTRON-70B-INSTRUCT for claim elaboration (results in
datasets/Dataset_Palmyra.csv
anddatasets/Dataset_llama.csv
) - Transformer-based models (BERT variants, generative models) for fact verification (results in
datasets/Elaborated_DatasetClaim_BART.csv
anddatasets/Elaborated_DatasetClaim_T5.csv
)
- PALMYRA-MED-70B-32K, LLAMA-3.1-NEMOTRON-70B-INSTRUCT for claim elaboration (results in
- Multimodal Fusion: Logistic Regression, Random Forest, XGBoost for late-fusion strategies
- Speaker Processing: Whisper ASR, VAD, TitaNet for speaker diarization and voice extraction
Our SynthMed framework operates through a dual-phase architecture addressing both generation and detection of healthcare deepfakes:
-
Deepfake Generation Pipeline:
- Claim Dataset Curation: Merging multiple healthcare fact-checking datasets
- LLM-based Elaboration: Converting claims into persuasive spoken sentences
- Synthetic Audio Production: TTS/voice cloning with speaker diarization
- Video Lip-Sync: Photorealistic mouth movement alignment
-
Single-task Models Framework:
- Video Forensics: Spatial, geometric, and frequency-domain artifact analysis
- Audio Forensics: TTS/voice-cloning detection and audio-visual synchrony
- Textual Fact-Checking: Semantic veracity verification of spoken claims
-
Late-Fusion Models Framework:
- Meta-Classifier Engines: Logistic Regression, Random Forest, XGBoost
- Monomodal Fusion: Within-modality (video, audio, and text) ensemble of multiple detectors
- Multimodal Fusion: Cross-modal integration using meta-classifiers
- Feature Importance Analysis: Understanding modality contributions to decisions
-
Evaluation Protocol:
- Incremental Assessment: Sequential videoβaudioβtext analysis pipeline
- Cross-Domain Validation: Testing across different distribution scenarios
The system processes healthcare content through generation and detection pipelines, with late fusion combining complementary modality-specific signals for robust deepfake identification.
The generation workflow demonstrates the multi-stage approach for creating realistic synthetic healthcare communications from curated claims to final video output.
Late-fusion feature importance analysis reveals how different meta-classifiers (LR, RF, XGBoost) exploit distinct synergy patterns across video, audio, and text modalities.
- Healthcare-Specific Pipeline: Specialized framework for medical communication synthesis and detection
- Multimodal Late Fusion: Decision-level aggregation exploiting complementary error profiles
- Comprehensive Benchmarking: Integration of 15 video, 9 audio, and 6 text detection systems
- Cross-Domain Evaluation: Testing across synthetic, in-the-wild, and combined datasets
- Interpretable Fusion: Feature importance analysis revealing modality contributions
- Scalable Architecture: Subject-agnostic generation requiring no per-identity training
- Comprehensive Benchmarking: Systematic comparison across multiple detection paradigms
SynthMed/
βββ README.md # This file
βββ LICENSE.txt # CC BY-NC 4.0 license
βββ code/ # Implementation code (coming soon)
βββ datasets/ # Healthcare datasets and synthetic content
β βββ Dataset_llama.csv # Llama-generated elaborations
β βββ Dataset_Palmyra.csv # Palmyra-generated elaborations
β βββ Elaborated_DatasetClaim_BART.csv
β βββ Elaborated_DatasetClaim_T5.csv
β βββ HealthcareDeepfakesInTheWild.xlsx
βββ img/ # Methodology diagrams and visualizations
βββ ISM_DeepfakeGeneration.png
βββ ISM_LF_FeatureImportanceAnalysis.png
βββ ISM_Methodology.png
Our comprehensive evaluation demonstrates significant advances in healthcare deepfake detection:
- Multimodal Superiority: Late fusion consistently outperformed single-modality approaches across all evaluation metrics
- Cross-Domain Robustness: Effective detection across synthetic and in-the-wild distribution scenarios
- Complementary Modalities: Each modality captures distinct artifact patterns, enabling synergistic detection
- Meta-Classifier Performance: XGBoost achieved optimal balance with ~89% accuracy and 0.84 AUC
- Balanced Classification: Strong performance on both authentic and synthetic content detection
The results confirm that multimodal integration provides a more reliable and robust approach for medical deepfake detection, demonstrating the importance of ensemble strategies in high-stakes healthcare communication scenarios.
- Video Synthesis: VideoReTalking for subject-agnostic lip-sync generation
- Audio Synthesis: OpenVoice (IVC), WhisperSpeech (TTS), OuteTTS (TTS) with controllable parameters
- Text Elaboration: PALMYRA-MED-70B-32K and LLAMA-3.1-NEMOTRON-70B-INSTRUCT
- Video Models: Spatial, frequency-aware, and forensic detection architectures
- Audio Models: Feature-based approaches combining spectral and neural representations
- Text Models: Transformer-based fact-checkers and semantic verification systems
- Logistic Regression: Interpretable linear combination for transparent decisions
- Random Forest: Ensemble approach robust to noisy modality scores
- XGBoost: Advanced gradient boosting for complex cross-modal pattern learning
The framework includes comprehensive assessment tools:
- Classification Metrics: Precision, Recall, F1-score per class (True/Fake)
- Macro-Averaged Metrics: Macro-P, Macro-R, Macro-F1 for balanced evaluation
- Threshold-Independent: AUC, Equal Error Rate (EER), Accuracy
- Cross-Validation: Stratified splits ensuring class balance (70/15/15%)
- Feature Importance: Model-agnostic analysis of modality contributions
- Domain Analysis: Performance across synthetic (Dβ), in-the-wild (Dβ), and combined (Dβ) scenarios
This research addresses the critical need for deepfake detection in healthcare while acknowledging the dual-use nature of synthetic content generation:
- Research Purpose Only: All synthetic content generated solely for detection research
- Responsible Disclosure: Results shared to enhance healthcare cybersecurity awareness
- Safeguards Required: Clear labeling and guidelines for any data release
- Privacy Protection: All datasets ethically sourced with proper anonymization
- Healthcare Context: Special attention to patient safety and public health implications
- Transparency: Open methodology enabling validation and responsible deployment
The complete implementation code including generation pipelines, detection models, and fusion frameworks will be released upon paper acceptance. The codebase will include:
- Multimodal deepfake generation scripts with healthcare-specific prompting
- Comprehensive detection model implementations and benchmarking tools
- Late-fusion framework with interpretability analysis
- Evaluation protocols and dataset preprocessing utilities
- Jupyter notebooks for reproducible experiments and analysis
We welcome contributions to advance healthcare deepfake detection research! Please feel free to submit pull requests, report issues, or suggest improvements. All contributions should align with our ethical guidelines for responsible AI in healthcare.
π¨βπ» This project was developed by Mariano Barone, Francesco Di Serio, Antonio Romano, Giuseppe Riccio, Marco Postiglione, and Vincenzo Moscato at University of Naples Federico II β PRAISE Lab - PICUS
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.