SWAN-DF database of audio-video deepfakes

SWAN-DF is the first high fidelity publicly available dataset of realistic audio-visual deepfakes, where both faces and voices appear and sound like the target person. The SWAN-DF dataset is based on the public SWAN database of real videos recorded in HD on iPhone and iPad Pro (in year 2019). For 30 pairs of manually selected people from SWAN, we swapped faces and voices using several autoencoder-based face swapping models and using several blending techniques from the well-known open source repo DeepFaceLab and voice conversion (or voice cloning) methods, including zero-shot YourTTS, DiffVC, HiFiVC, and several tuned models from FreeVC.

For each pair of people, you can see their original videos and four deepfake variants where the faces and voices of the identities are swapped. After that, you can listen to the examples of audio-only deepfakes generated using text to speech and voice conversion methods from the original utterances of LibriTTS dataset.

If you use the dataset, please cite the following paper:


@INPROCEEDINGS{Korshunov_IJCB_2023,
  author = {Korshunov, Pavel and Chen, Haolin and Garner, Philip N. and Marcel, S{\'{e}}bastien},
  projects = {Idiap, NAST, Biometrics Center},
  month = sep,
  title = {Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes},
  booktitle = {IEEE International Joint Conference on Biometrics (IJCB)},
  year = {2023},
}       
SWAN-DF deepfakes for swap pair 1

Original source video

Original target speaker

Deepfake, 320px face, FreeVC audio

Deepfake, 256px face, FreeVC audio

Deepfake, 160px face, FreeVC audio

Deepfake, 160px face, FreeVC audio

SWAN-DF deepfakes for swap pair 2

Original source video

Original target speaker

Deepfake, 320px face, FreeVC audio

Deepfake, 256px face, FreeVC audio

Deepfake, 160px face, FreeVC audio

Deepfake, 160px face, FreeVC audio

SWAN-DF deepfakes for swap pair 3

Original source video

Original target speaker

Deepfake, 320px face, FreeVC audio

Deepfake, 256px face, FreeVC audio

Deepfake, 160px face, FreeVC audio

Deepfake, 160px face, FreeVC audio

LibriTTS-DF Samples

Compare the voices of original speakers from LibriTTS dataset with three selected voice deepfakes generated using adapted Adaspeech, tuned TorToiSe, and zero-shot model YourTTS used in voice convertion capacity.

Text (speaker ID: 121) Ground Truth Adaspeech TorToiSe YourTTS (1188 to 121)
Also, a popular contrivance whereby love-making may be suspended but not stopped during the picnic season.
Text (speaker ID: 1188) Ground Truth Adaspeech TorToiSe YourTTS (121 to 1188)
You see how doubly, how intimately, opposed the ideas are; yet how difficult to explain without apparent contradiction.