Auditory deception through latent diffusion and text-to-speech models
Category: Data science
Location: Thun / Zurich / Lausanne
Contact:
Raphael Meier
Recent advances of deep learning models in text-to-audio and text-to-speech generation may pose new challenges and opportunities for generation of synthetic audio data. In particular, methods based on Contrastive Language-Image Pretraining (CLIP) [1] and latent diffusion models (LDMs) [2] enable the generation of diverse synthetic audio data in a zero-shot fashion. AudioLDM combines a modified version of CLIP for audio data (CLAP) with LDMs to generate audio data based on a given text prompt [3]. AudioLDM opens up many interesting applications including style transfer for audio and audio inpainting. Moreover, AudioLDM enables the synthesis of auditory context (e.g. the sound of someone speaking in a small room versus large hall). A combination of AudioLDM with recent text-to-speech models (e.g. PaddleSpeech [4]) may provide a sophisticated tool for threat actors to synthesize potentially deceptive audio data.
Goal
The goal of this project is to explore the capabilities of AudioLDM (and the more recent AudioLDM 2 [5]) for a range of applications in military deception and information warfare. In addition, recent text-to-speech models have to be reviewed and their potential combination with AudioLDM needs to be explored. The final outcome of the project is a demonstrator illustrating a set of well-defined threat scenarios involving synthetic audio generation.
Requirements
- Good programming skills (Python)
- Basic knowledge of machine learning
- Interest in social cybersecurtiy
- Knowledge in sound engineering is a plus
If you are interested and want to hear more about the project, please contact us.
References
[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.
[2] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[3] Liu, Haohe, et al. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.” arXiv preprint arXiv:2301.12503 (2023).
[4] Zhang, Hui, et al. “PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit.” arXiv preprint arXiv:2205.12007 (2022).
[5] Liu, Haohe, et al. “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining.” arXiv preprint arXiv:2308.05734 (2023).