|
|
|
|
|
|
|
|
|
|
|
|
|
|
We generate a soundtrack for a silent input video, given a user-provided conditional example specifying what its audio should "sound like." |
The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. |
We propose a self-supervised pretext task for learning conditional Foley. Our pretext task exploits the fact that natural videos tend to contain repeated events that produce closely related sounds. During training, we randomly sample two pairs of audio-visual slips from a video, and the use one as the conditional example for the other. Our model then learns to infer the types of actions within the sene from the conditional example, and to generate analogous sounds to match the input example. |
|
Our model takes the silent input video and the conditional audio-visual example as input. It then autoregressively generates a spectrogram from an input videos using a VQ-GAN, then converts the sound to a waveform. |
|
Inspired by other work in cross-model generation, we use re-ranking to improve our model's predictions. We generate a large number of sounds, the select the best one, as judged by a separate classifier. Instead of using a classifier that judges the multimodal agreement between the input and output, we propose to use an off-the-shelf audio-visual synchronization model to measure the temporal alignment between the predicted sound and the input video. |
Internet Videos*generated with model trained on the CountixAV dataset |
: |
Greatest Hits Dataset Results |
---|
Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens. Conditional Generation of Audio from Video via Foley Analogies. CVPR 2023. (Arxiv) |
Acknowledgements |