Microsoft Research has recently published the VASA-1 research work, which is a model for generating talking head videos with text-to-speech capabilities. This model can generate natural movements using only a single image of a face and an audio file as input. The highlight of this model is its real-time capabilities, allowing faces to be adjusted with very low latency.
The VASA-1 model can create high-resolution videos at 512×512 resolution and 45 frames per second when run offline in batch mode. For online streaming, it can achieve a maximum of 40 frames per second.
Additionally, VASA-1 supports additional input such as desired characteristics like eye position, facial movements, and emotional expressions. In their research, VASA was tested with images like Mona Lisa to speak in languages other than English, producing favorable results even without training data.
Reading up to this point may raise further concerns, especially after recent AI advancements like OpenAI’s speech synthesis. Now, face-cloning clips can operate in real-time, leading Microsoft to clarify that videos created with VASA are distinguishable as AI-generated rather than real videos. Nevertheless, in light of potential misuse, Microsoft has no plans to commercialize or release APIs or additional information regarding this technology until appropriate usage guidelines and legal regulations are established.
Source: Microsoft Research
TLDR: Microsoft Research introduces VASA-1 model for generating talking head videos in real-time with natural face movements, supporting additional input and high-resolution outputs, but refrains from public release until appropriate usage guidelines are in place.
Leave a Comment