We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations.
Original video | Lip-synced video |
demo1_video.mp4 |
demo1_output.mp4 |
demo2_video.mp4 |
demo2_output.mp4 |
demo3_video.mp4 |
demo3_output.mp4 |
demo4_video.mp4 |
demo4_output.mp4 |
demo5_video.mp4 |
demo5_output.mp4 |
(Photorealistic videos are filmed by contracted models, and anime videos are from VASA-1 and EMO)
- Inference code and checkpoints
- Data processing pipeline
- Training code