Artificial intelligence makes voice cloning easy and ‘the monster is already on the loose’
In a video from a Jan. 25 news report, President Joe Biden talks about tanks. But a doctored version of the video has amassed hundred of thousands of views this week on social media, making it appear he gave a speech that attacks transgender people.
Digital forensics experts say the video was created using a new generation of artificial intelligence tools, which allow anyone to quickly generate audio simulating a person’s voice with a few clicks of a button. And while the Biden clip on social media may have failed to fool most users this time, the clip shows how easy it now is for people to generate hateful and disinformation-filled “deepfake” videos that could do real-world harm.
“Tools like this are going to basically add more fuel to fire,” said Hafiz Malik, a professor of electrical and computer engineering at the University of Michigan who focuses on multimedia forensics. “The monster is already on the loose.”
It arrived last month with the beta phase of ElevenLabs’ voice synthesis platform, which allowed users to generate realistic audio of any person’s voice by uploading a few minutes of audio samples and typing in any text for it to say.
The startup says the technology was developed to dub audio in different languages for movies, audiobooks and gaming to preserve the speaker’s voice and emotions.
Social media users quickly began sharing an AI-generated audio sample of Hillary Clinton reading the same transphobic text featured in the Biden clip, along with fake audio clips of Bill Gates supposedly saying that the COVID-19 vaccine causes AIDS and actress Emma Watson purportedly reading Hitler’s manifesto “Mein Kampf.”
Shortly after, ElevenLabs tweeted that it was seeing “an increasing number of voice cloning misuse cases,” and announced that it was now exploring safeguards to tamp down on abuse. One of the first steps was to make the feature available only to those who provide payment information. Initially, anonymous users were able to access the voice cloning tool for free. The company also claims that if there are issues, it can trace any generated audio back to the creator.
But even the ability to track creators won’t mitigate the tool’s harm, said Hany Farid, a professor at the University of California, Berkeley, who focuses on digital forensics and misinformation.
“The damage is done,” he said.
As an example, Farid said bad actors could move the stock market with fake audio of a top CEO saying profits are down. And already there’s a clip on YouTube that used the tool to alter a video to make it appear Biden said the U.S. was launching a nuclear attack against Russia.
Free and open-source software with the same capabilities have also emerged online, meaning paywalls on commercial tools aren’t an impediment. Using one free online model, the AP generated audio samples to sound like actors Daniel Craig and Jennifer Lawrence in just a few minutes.
“The question is where to point the finger and how to put the genie back in the bottle?” Malik said. “We can’t do it.”
When deepfakes first made headlines about five years ago, they were easy enough to detect since the subject didn’t blink and audio sounded robotic. That’s no longer the case as the tools become more sophisticated.
The altered video of Biden making derogatory comments about transgender people, for instance, combined the AI-generated audio with a real clip of the president, taken from a Jan. 25 CNN live broadcast announcing the U.S. dispatch of tanks to Ukraine. Biden’s mouth was manipulated in the video to match the audio. While most Twitter users recognized that the content was not something Biden was likely to say, they were nevertheless shocked at how realistic it appeared. Others appeared to believe it was real – or at least didn’t know what to believe.
Hollywood studios have long been able to distort reality, but access to that technology has been democratized without considering the implications, said Farid.
“It’s a combination of the very, very powerful AI based technology, the ease of use, and then the fact that the model seems to be: let’s put it on the internet and see what happens next,” Farid said.
Audio is just one area where AI-generated misinformation poses a threat.
Free online AI image generators like Midjourney and DALL-E can churn out photorealistic images of war and natural disasters in the style of legacy media outlets with a simple text prompt. Last month, some school districts in the U.S. began blocking ChatGPT, which can produce readable text – like student term papers – on demand.
ElevenLabs did not respond to a request for comment.
Learn how to navigate and strengthen trust in your business with The Trust Factor, a weekly newsletter examining what leaders need to succeed. Sign up here.