LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

📖 Introduction

We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations.

Demo

Original video	Lip-synced video
demo1_video.mp4	demo1_output.mp4
demo2_video.mp4	demo2_output.mp4
demo3_video.mp4	demo3_output.mp4
demo4_video.mp4	demo4_output.mp4
demo5_video.mp4	demo5_output.mp4