Recent advances in cross-modal deep learning, including Contrastive Language-Image Pretraining (CLIP) [1], enable a much more effective assessment of similarity across different modalities (e.g. text and images) than has been possible so far. In particular, the measurement of similarity between a given text and a set of images opens up a series of interesting applications including zero- and few-shot image classification, image retrieval and image clustering. More recently, image tagging models have been released which demonstrated high performance for open-set recognition (e.g. RAM [2]). Investigators and analysts of the armed forces, intelligence agencies and law enforcement are confronted with the forensic analysis of video content. When large amounts of video data need to be screened for a particular vehicle or behavior, important time is wasted which could be otherwise invested. Hence, recent deep learning methods may provide an important tool to render forensic analysis in security and defense more efficient.

Goal

The goal of this project is to extend recent deep learning methods for image tagging to the video domain, to evaluate their performance in open-set recognition for a series of relevant entities (e.g. specific actions, vehicles, etc.) and investigate ways to render these models more task-specific (e.g. through prompt ensembling [3] or fine-tuning). The final outcome of this project is a validated Python toolkit useful for forensic analysis of video content based on a recent image tagging method.

Requirements

  • Good programming skills (Python)
  • Basic knowledge of machine learning
  • Interest in forensic content analysis

If you are interested and want to hear more about the project, please contact us.

References

[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

[2] Zhang, Youcai, et al. “Recognize Anything: A Strong Image Tagging Model.” arXiv preprint arXiv:2306.03514 (2023).

[3] Zhou, Kaiyang, et al. “Learning to prompt for vision-language models.” International Journal of Computer Vision 130.9 (2022): 2337-2348.