Recent advances of deep learning models in text-to-audio and text-to-speech generation may pose new challenges and opportunities for generation of synthetic audio data. In particular, methods based on Contrastive Language-Image Pretraining (CLIP) [1] and latent diffusion models (LDMs) [2] enable the generation of diverse synthetic audio data in a zero-shot fashion. AudioLDM combines a modified version of CLIP for audio data (CLAP) with LDMs to generate audio data based on a given text prompt [3]. AudioLDM opens up many interesting applications including style transfer for audio and audio inpainting. Moreover, AudioLDM enables the synthesis of auditory context (e.g. the sound of someone speaking in a small room versus large hall). A combination of AudioLDM with recent text-to-speech models (e.g. PaddleSpeech [4]) may provide a sophisticated tool for threat actors to synthesize potentially deceptive audio data.

Goal

The goal of this project is to explore the capabilities of AudioLDM (and the more recent AudioLDM 2 [5]) for a range of applications in military deception and information warfare. In addition, recent text-to-speech models have to be reviewed and their potential combination with AudioLDM needs to be explored. The final outcome of the project is a demonstrator illustrating a set of well-defined threat scenarios involving synthetic audio generation.

Requirements

  • Good programming skills (Python)
  • Basic knowledge of machine learning
  • Interest in social cybersecurtiy
  • Knowledge in sound engineering is a plus

If you are interested and want to hear more about the project, please contact us.

References

[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

[2] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Liu, Haohe, et al. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.” arXiv preprint arXiv:2301.12503 (2023).

[4] Zhang, Hui, et al. “PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit.” arXiv preprint arXiv:2205.12007 (2022).

[5] Liu, Haohe, et al. “AudioLDM 2: Learning holistic audio generation with self-supervised pretraining.” arXiv preprint arXiv:2308.05734 (2023).