Google has announced a cutting-edge text-to-video model called Lumiere. Lumiere is a text-to-video diffusion model that produces realistic video from text or image commands.
Unlike previous TTV models, Lumiere’s ability to produce spatially and temporally consistent video provides a big leap forward. This aims to keep scenes visually consistent and movements smooth in every frame.
The model’s functions include text-to-video, image-to-video conversion, video editing, video painting and stylized rendering. Lumiere offers users the ability to produce different videos through text prompts or image commands.
Artificial Intelligence Google Lumiere
Here are the highlights of Lumiere:
- Text to video: When users enter a text prompt, Lumiere can create a 5-second video clip of 80 frames at 16 frames per second.
- Image to video: Lumiere takes an image as a command and converts it into a video.
- Stylized production: Using a reference image style, Lumiere can create video in a specific style.
- Video coloring: Lumiere can take a masked video scene and edit it by painting it.
The working principle of Lumiere is that it uses a Space-Time U-Net (STUNet) architecture that learns to minimize in space and time. This model is designed to achieve consistent motion in both space and time.

Google Research Lumiere
Google Research, in a user study, found that Lumiere is preferred over other TTV models. Lumiere has the potential to produce longer and more coherent videos, but has some limitations when it comes to handling scene transitions or multi-shot video scenes.
Noting the risk of misuse of Lumiere, Google expressed the hope of finding ways to effectively watermark videos or avoid copyright issues. This is intended to ensure the safe and ethical use of Lumiere. What do you think Google Announces text to video model Lumiere news?
