Marijan Hassan - Tech Journalist

Microsoft’s VASA-1 AI can make photos talk and sing

Microsoft has announced a new AI model called VASA-1, capable of transforming a single image and audio clip of a person into a realistic video of them lip-syncing, complete with facial expressions and head movements.

Trained on AI-generated images from platforms like DALL·E-3, VASA-1 combines these images with audio clips to produce lifelike talking faces. Microsoft's researchers claim that their method surpasses existing techniques in terms of quality and realism.

Unlike its competitors, such as Runway and Nvidia, Microsoft's model can handle audio clips of any length, generating synchronized talking faces accordingly.

In a notable experiment, the researchers used the Mona Lisa, an image not included in the model's training data, and made it lip-sync to Anne Hathaway's "Paparazzi." This demonstrates the model's ability to manipulate images beyond its training scope, including artistic photos and audio in various languages.

The researchers showcased the model's real-time capabilities in a demo video, illustrating how it can instantly animate images with head movements and facial expressions.

Acknowledging concerns about the potential misuse of such technology for creating deep fakes, the researchers emphasized their opposition to misleading or harmful content, saying they aim to use their technique to advance forgery detection.

“Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection,” the company wrote on their website.

The researchers then proceeded to highlight some of the positive applications of their technology, including:

Enhancing educational equity
Improving accessibility for individuals with communication challenges
Offering companionship or therapeutic support to those in need

Microsoft has said that it will not be releasing the AI model in any form until they are certain that the technology will be used responsibly and in accordance with proper regulations.

Google also presented a similar project recently, demonstrating an AI capable of turning a photo into a controllable video, complete with head movements, blinks, and hand gestures, using only the user's voice commands.

Microsoft’s VASA-1 AI can make photos talk and sing

Recent Posts