Alibaba's Tongyi Laboratory released and open-sourced the first multimodal large model that supports professional dubbing in multiple scenes for film and television.
Tongyi Laboratory has released and open-sourced the first multimodal large model Fun-CineForge that supports multi-scene dubbing for film and television.
Tongyi Laboratory has released and open-sourced the first multimodal large model, Fun-CineForge, which supports professional multi-scene dubbing in the film and television industry. In addition, they have also provided a method for constructing high-quality datasets. Through the integrated design of "data + model," Fun-CineForge is attempting to address key issues faced by AI dubbing in the film and television industry.
In real film and television production scenarios, high-quality dubbing needs to pass four rigorous tests:
Mouth synchronization: The synthesized speech needs to be highly synchronized with the lip movements of the characters on screen;
Emotional expression: Achieving personalized presentation and free control of emotions and tones based on character facial images and instruction descriptions;
Consistent voice: Maintaining the similarity and consistency of voices in complex multi-character dubbing scenes;
Time alignment: Even if the speaker in the video is obstructed or not present, the speech must be synthesized within the correct time frame; however, existing AI dubbing methods generally face two major bottlenecks:
01 Scarcity of high-quality multimodal datasets
High-quality dubbing datasets rely on various modalities of information. Current dubbing datasets are small in size, have limited annotation types, making it difficult to effectively train large models; they are highly dependent on costly manual annotation and are difficult to produce on a large scale; the lack of dialogue and long video data with multiple characters makes it difficult for large models to handle complex dubbing scenarios.
02 Insufficient model capabilities
Traditional dubbing models rely solely on visible lip movements in video frames to learn audio-visual synchronization. However, in real film and television dubbing production, there are many complex scenarios, such as multi-person dialogues, frequent camera angle changes, face obstruction, and facial blur, making it difficult for current technology to achieve audio-visual synchronization in scenes where the speaker's face is missing.
To address these issues, Tongyi Laboratory proposed Fun-CineForge. The core components of this open-source release include two parts, aiming to bridge the gap in the data and model loop for film dubbing:
Model side: A multimodal large model for dubbing in complex film scenarios
Data side: The process for constructing a large-scale multimodal dubbing dataset (CineDub)
Based on the data foundation, Fun-CineForge leverages the powerful speech synthesis capabilities of CosyVoice3 to build a dubbing large model for complex film scenarios, completing the task of video + text speech.
The inputs include:
Silent video clips
Dubbing text
Character attributes and emotional cues
Time information
The reference speech model can synthesize speech aligned with time and video information based on the voice characteristics.
Fun-CineForge first established an automated dataset production process that transforms raw film materials into structured multimodal data.
This process involves voice separation, text transcription, long video segmentation, audio-visual joint speaker separation, etc. By using a bidirectional correction mechanism based on the thinking chain of a universal large model, transcription text and speaker separation results' error rates were significantly reduced.
Chinese character error rate decreased from 4.53% to 0.94%;
English word error rate decreased from 9.35% to 2.12%;
Speaker separation error rate decreased from 8.38% to 1.20%. The data covers monologues, narrations, dialogues, multi-speaker scenarios, etc. Each piece of data includes transcribed lines, frame-level face lip data, character attributes and emotional cues, millisecond-level timestamps, and clean voice tracks.
These complementary and interdependent multimodal information provide a solid foundation for the professional dubbing capabilities of training large models.
Note: CineDub dataset produced from over 350 Chinese and English films and TV dramas statistics in scene categories, age distribution, personality distribution, and popular vocoder terms.
The most important technical innovation of Fun-CineForge is the introduction of the "time modality" into the dubbing model for the first time. Traditional TTS models typically focus only on text content, audio features, or visual information, but in film and television dubbing, there is a crucial dimension: time.
For example:
When does the speech start
When does the speech end
Which character is speaking in that time period, this information directly helps the model understand "at what time period, which character is saying what." In scenes where the speaker is not visible in the visual modality, the time modality serves as a strong supervision target, ensuring that the speech appears in the correct time period.
This allows the model to have dubbing capabilities in complex scenarios.
To achieve the above capabilities, the Fun-CineForge model simultaneously utilizes four types of information that complement and support each other.
Visual modality: Learns lip movements, understands facial expressions;
Text modality: Provides dialogue content, describes character attributes and emotional tones;
Audio modality: Acts as the model's prediction target;
Time modality: Controls the occurrence of speech, indicating the speaker's identity in dialogue scenes.
Experimental results show that Fun-CineForge's dubbing model outperforms existing open-source dubbing models in several key indicators, including:
Speech naturalness
Word error rate
Emotional expression ability
Voice similarity
Lip synchronization
Time alignment ability
Instruction compliance ability Among them, Fun-CineForge's dubbing model performs best in the monologue and narration single-person dubbing scenes, supporting dual and multiple-person dialogues for the first time and achieving accurate time alignment, audio-visual synchronization, and voice consistency.
Tongyi Laboratory conducted a comprehensive evaluation of Fun-CineForge on their self-built CineDub dataset, covering various typical film and television dubbing scenarios such as monologues, narrations, dialogues, and multi-person scenes. The results show that the single-person scenario has the best performance, with Chinese character error rates of only 1.49% and 1.90% in monologues and narrations, with precise audio-visual synchronization.
In monologue scenes, Tongyi Laboratory compared Fun-CineForge with DeepDubber-V1 and InstructDubber. The results show that Fun-CineForge outperforms the baseline models in word error rate, lip synchronization, time alignment, voice similarity, and other indicators.
Note: CER/WER stands for Chinese character/English word error rate (lower is better); SPK-SIM stands for voice similarity (higher is more similar); SPK-TL stands for time alignment error (lower is more accurate); LSE-C/D stands for lip synchronization degree (C higher is better/D lower is better).
Currently, Fun-CineForge has been open-sourced, allowing developers to experience the dubbing capabilities in various complex scenarios in Chinese and English films and television (including emotional expression, camera angle changes, face obstructions, etc.).
(The website provides rich examples such as monologues, narrations, dialogues, multiple speakers, voice cloning, instruction control, etc., and users can experience advanced features such as voice cloning and instruction control. The samples cover various complex scenarios encountered in actual film and television settings, including emotional expression, frequent camera angle changes, frequent speaker changes, speaker face obstructions or camera focusing on other characters, dark scenes, multiple people in the frame, etc.)
Technical paper: Fun-CineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
Dataset examples: The website has open-sourced example datasets for CineDub with the original videos removed, including CineDub-CN and CineDub-EN in Chinese and English bilingual formats for reference.
At the current stage, AI speech technology has been widely applied in scenarios such as customer service and assistants, but there are still higher requirements in professional anime or film content production and post-processing. For longer videos, the more given time stamp intervals and reference character audios, the lower the audio-visual synchronization performance and voice cloning accuracy, and the robustness decreases in multi-person dialogue scenes.
Fun-CineForge provides a new technical solution for audio large model technology in the professional dubbing production field, currently supporting inference of video clips within 30 seconds.
In the future, with the continuous improvement of multi-modal large model capabilities, it is hoped that AI can play a greater role in content production in fields such as film, animation, and gaming.
Related Articles

MIN XIN HOLD (00222) announces profit growth, with estimated annual net profit attributable to shareholders of approximately HK$1.1 billion to HK$1.3 billion, a year-on-year increase of about 25% to 48%.

CHINA EB LTD (00165) completes the issuance of RMB 2 billion medium-term notes.

Ming Liang Holdings (08152) released a profit announcement, expecting a post-tax profit of approximately HKD 8 to 9 million for the year 2025, turning from a loss to a profit year-on-year.
MIN XIN HOLD (00222) announces profit growth, with estimated annual net profit attributable to shareholders of approximately HK$1.1 billion to HK$1.3 billion, a year-on-year increase of about 25% to 48%.

CHINA EB LTD (00165) completes the issuance of RMB 2 billion medium-term notes.

Ming Liang Holdings (08152) released a profit announcement, expecting a post-tax profit of approximately HKD 8 to 9 million for the year 2025, turning from a loss to a profit year-on-year.






