학술논문

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Document Type

Conference

Author

Xu, Chao; Liu, Yang; Xing, Jiazheng; Wang, Weida; Sun, Mingze; Dan, Jun; Huang, Tianxin; Li, Siyuan; Cheng, Zhi-Qi; Tai, Ying; Sun, Baigui

Source

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVPR Computer Vision and Pattern Recognition (CVPR), 2024 IEEE/CVF Conference on. :1292-1302 Jun, 2024

Subject

Computing and Processing
Training
Geometry
Costs
Semantics
Diffusion models
Controllability
Pattern recognition
Talking Face
Diffusion Model
Disentangled Audio
FaceChain

Language

ISSN

2575-7075

Abstract

In this paper, we abstract the process of people hearing speech, extracting meaningful cues, and creating vari-ous dynamically audio-consistent talking faces, termed Lis-tening and Imagining, into the task of high-fidelity diverse talking faces generation from a single audio. Specifically, it involves two critical challenges: one is to effectively de-couple identity, content, and emotion from entangled au-dio, and the other is to maintain intra-video diversity and inter- video consistency. To tackle the issues, we first dig out the intricate relationships among facial factors and sim-plify the decoupling process, tailoring a Progressive Audio Disentanglement for accurate facial geometry and seman-tics learning, where each stage incorporates a customized training module responsible for a specific factor. Secondly, to achieve visually diverse and audio-synchronized animation solely from input audio within a single model, we intro-duce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and semantics, as well as texsture and temporal coherence between frames. In this way, we inherit high-quality diverse generation from LDMs while significantly improving their controllability at a low training cost. Extensive experiments demonstrate the flexibility and effectiveness of our method in handling this paradigm. The codes will be released at FaceChain.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송