More specifically, the audio in the video is separate from the visuals, which were compiled by the researchers to match the audio. This lip-sync capability is a breakthrough because previous tests in this field have been less successful. Indeed, matching audio to a person speaking on video is tough. In most cases even the casual viewer would see something is not right. Researchers call this the uncanny valley, which is when human replicas look real but also creepy or unnatural. At the University of Washington, researchers aimed to make a breakthrough and create a realistic human replica that perfectly aligned with the audio speech. Using 14 hours of audio from Obama’s weekly address, the team could train a neural network to learn the speech.
Once the machine learning was complete, the system could create mouth shapes that synced with the audio. Next the AI developed a realistic looking mouth that was mapped from Obama’s. This mouth was synced to the audio and superimposed onto a differently sourced video of Obama.
Getting it Right
However, while talking comes from the mouth, nuances in head and jaw movement are also important. For that extra realism, the team used the system to tweak head and jaw movements for timing. As you can see in the videos, while the results are not 100% perfect, they are advanced beyond what we have seen before. As the video shows, the system improves the quality the more it learns. However, the team says there are occasional mistakes in syncing and the jaw would sometimes glitch (two chin Obama, anyone?). The team will be presenting its work in the ACM Transactions on Graphics.