Conditional Generation of Audio from Video via Foley Analogies

Yuexi Du^1,2

Ziyang Chen¹

Justin Salamon³

Bryan Russell³

Andrew Owens¹

University of Michigan¹

Yale University²

Adobe Research³

CVPR 2023

[Paper]

[Github]

[Video]

[Poster]

We generate a soundtrack for a silent input video, given a user-provided conditional example specifying what its audio should "sound like."

Abstract

The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example.

Video Demo

*Please wear an earphone or headset and turn up the volume slightly for the best quality.

Learning from Foley Analogies

We propose a self-supervised pretext task for learning conditional Foley. Our pretext task exploits the fact that natural videos tend to contain repeated events that produce closely related sounds. During training, we randomly sample two pairs of audio-visual slips from a video, and the use one as the conditional example for the other. Our model then learns to infer the types of actions within the sene from the conditional example, and to generate analogous sounds to match the input example.

Our model takes the silent input video and the conditional audio-visual example as input. It then autoregressively generates a spectrogram from an input videos using a VQ-GAN, then converts the sound to a waveform.

Inference-time Audio Re-ranking

Inspired by other work in cross-model generation, we use re-ranking to improve our model's predictions. We generate a large number of sounds, the select the best one, as judged by a separate classifier. Instead of using a classifier that judges the multimodal agreement between the input and output, we propose to use an off-the-shelf audio-visual synchronization model to measure the temporal alignment between the predicted sound and the input video.

Example Results

Internet Videos

*generated with model trained on the CountixAV dataset

Greatest Hits Dataset Results

Generation with Different Conditions

Internet Videos *generated with model trained on the CountixAV dataset	:	Greatest Hits Dataset Results

Paper and Supplementary Material

Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens.
Conditional Generation of Audio from Video via Foley Analogies.
CVPR 2023.
(Arxiv)

[Bibtex]

Acknowledgements

We thank Jon Gillick, Daniel Geng, and Chao Feng for the helpful discussions. Our code base is developed upon two amazing projects proposed by Vladimir Iashin et.al,, check out those projects here ([SpecVQGAN], [SparseSync]). This work was funded in part by DARPA Semafor and Cisco Systems, and by a gift from Adobe. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The videos that appears in the webpage and the video demo are credit here. The webpage template was originally made by Phillip Isola and Richard Zhang for a Colorization project.