Alibaba Cloud releases Wan2.7-Video video generation model.

date
15:14 03/04/2026
avatar
GMT Eight
Alibaba Cloud releases Wan2.7-Video video generation model.
On April 3, Alibaba Cloud's Tongyi Experimental Laboratory announced the official launch of the video generation model Wan2.7-Video. This model supports text, images, videos, and audio inputs in all modes, focusing on the entire process of "creation", covering generation, editing, reproduction, reshaping, driving, rewriting, referencing, etc., claiming to be more controllable, versatile, and capable of leading performances. It is reported that Wan2.7 supports text, images, videos, and audio inputs in all modes, allowing control over aspects such as visual structure, plot development, local details, and temporal changes, making videos editable like documents. Users can make local adjustments to video frames through commands, and the edited areas seamlessly blend in with the original video in terms of lighting and texture. It also supports commands to add or delete elements (such as "remove the train from the video"), replace objects (such as "replace film with a plate"), and modify object properties (such as building color). It also supports precise additions based on reference image content. The model also supports environmental and style transformations, where character actions remain unchanged while the background seasons can transition from summer to deep autumn, or instantly transform into a felt wool style, allowing for instantaneous travel across parallel universes. Furthermore, it also supports video quality enhancement (such as coloring black and white footage), visual comprehension tasks (such as subject segmentation), and adjustments to shooting styles (such as focusing modifications), catering to various editing needs. For existing video content, whether already filmed or generated, modifications to plot content and shooting methods can be achieved through descriptive commands. Wan2.7 allows for revolutionary modifications to character behavior, dialogue, and even shooting perspectives without altering the original identities and scenes, enabling flexible secondary creation. It also supports modifying character dialogue content while maintaining emotion, mouth movements, and voice consistency with the new dialogue, as well as changing behaviors such as "keeping everything else the same, but having the girl sitting on the sofa standing to play games", only changing the logical actions. The model also supports the subversion and interpretation of characters in the same scene, such as replacing players with medieval knights and controllers with cold weapons while maintaining the original grip posture unchanged. It is also possible to modify camera settings (position, perspective, background, lens type, focal length, etc.), for example, "changing the camera to gradually rise from the ground", presenting a completely different viewing experience for the same material. Through methods such as the beginning and ending frames, video continuation, and continuation + ending frames, this model can achieve precise control over the plot development, composition of images, lighting, and balance between dynamic continuity and structural controllability. It also supports multi-modal references such as images, videos, and audio to lock appearance and sound. It can support up to 5 video subject references, allowing each character to have a distinctive sound, while maintaining consistency in features across multiple camera angles.