Google VLOGGER
The program is similar to Alibaba’s EMO application, which we have introduced to you before. To understand how VLOGGER works, the goal is to create a variable-length photorealistic video depicting a talking target person, including head and gestures. The first network takes an audio waveform as input to generate intermediate body motion controls responsible for gaze, facial expressions and pose along the target video length. The second network is a temporal image-to-image translation model that extends large image diffusion models by taking the estimated body controls. To condition the process to a specific identity, the network also takes a reference image of a person.
The diversity of the model is an important measure of success. The model provides a significant amount of motion and realism while producing a diverse distribution of videos of the original subject. This emphasizes the realistic appearance and diversity of the generated videos. Furthermore, VLOGGER’s video editing applications are also quite impressive. For example, VLOGGER can take a video and close the mouth or eyes to change the subject’s expression, making video edits consistent with the original unaltered pixels.
Google and Artificial Intelligence
VLOGGER represents an important step forward in the field of human talking video production with artificial intelligence. Standing out from other state-of-the-art methods in terms of image quality, identity preservation, and temporal consistency, this model could shape future developments in this field and offer impressive application areas. You can click on the link for a more detailed review and to see the work done.