Jan 27, 2022 7:00 AM

Synthetic Voices Want to Take Over Audiobooks

Publishers hope computer-generated voices can help them tap surging demand, but some fans—and Amazon—are resisting the robots.

Green laser pointing across a piece of paper folded like a book page on a blue background

The AI Database →

Company

Amazon

Google

Microsoft

End User

Startup

Sector

Publishing

When voice actor Heath Miller sits down in his boatshed-turned-home studio in Maine to record a new audiobook narration, he has already read the text through carefully at least once. To deliver his best performance, he takes notes on each character and any hints of how they should sound. Over the past two years, audiobook roles, like narrating popular fantasy series He Who Fights With Monsters, have become Miller’s main source of work. But in December he briefly turned online detective after he saw a tweet from UK sci-fi author Jon Richter disclosing that his latest audiobook had no need for the kind of artistry Miller offers: It was narrated by a synthetic voice.

Richter’s book listing on Amazon’s Audible credited that voice as “Nicholas Smith” without disclosing that it wasn’t human. To Miller’s surprise, he found that “Smith” voiced a total of around half a dozen on the site from multiple publishers—breaching Audible rules that say audiobooks “must be narrated by a human.” Although “Smith” sounded more expressive than a typical synthetic voice, to Miller’s ear it was plainly artificial and offered a worse experience than a human narrator. It made giveaway mistakes, like pronouncing Covid as “kah-viid” when referring to the pandemic.

Miller tracked down “Smith”—the voice matched a sample posted to SoundCloud by Speechki, a San Francisco startup that offers more than 300 synthetic voices for audiobook publishing across 77 dialects and languages. He and other narrators and audio fans who discussed the artificial audiobooks online reported the titles to Audible, which eventually removed them. Although it wasn’t a large number, discovering that synthetic voices were good enough for some publishers to put them to work prompted Miller to wonder about the future of his art and income. “It’s a little terrifying because it’s my livelihood and that of many people I respect,” he says.

Richter says he chose an artificial voice because the concept and its “uncanny valley” sound suited his book, which has a piece of intelligence software as one of its main characters, and that he was unaware of Audible’s policies. “My intention was never to upset or offend anyone,” he says. Speechki says it recommends publishers identify that narrations are synthetic and that it informs them of Audible’s policies. Will Farrell-Green, a senior director at Audible, said in an emailed statement that the company uses automated and manual processes to enforce its rules but that “due to the volume of content on our service, titles that are not compliant do slip through from time to time.” Audible’s “human’s only” policy dates back to at least 2014, when synthetic voices were much less convincing, and the company has said the rule helps provide listeners the performances they expect.

Synthetic voices have become less grating in recent years, in part due to artificial intelligence research by companies such as Google and Amazon, which compete to offer virtual assistants and cloud services with smoother artificial tones. Those advances have also been used to make reality-spoofing “deepfakes.” Speechki is one of several startups developing speech synthesis for audiobooks. It analyzes text with in-house software to mark up how to inflect different words, voices it with technology adapted from cloud providers including Amazon, Microsoft, and Google, and employs proof listeners who check for mistakes. Google is testing its own “auto-narration” service that publishers can use to generate English audiobooks for free, using more than 20 different synthetic voices. Audiobooks published through the program include an academic history of theater and a novelist’s exploration of cultural attitudes to sex. Google spokesperson Dan Jackson says its auto-narrated books supplement rather than replace professionally narrated books. “Our goal with auto-narration is to make it possible to create a low-cost audiobook for any ebook title and increase content accessibility for those that are unable to read via ebook,” he says.

Listen to a sample of WIRED’s feature about AI researcher Timnit Gebru’s ejection from Google, narrated by technology from Speechki.

Some publishers see synthetic voices as a way to tap the growing demand for audiobooks, a segment healthier than other parts of the book business. Total US book publisher revenue declined slightly between 2015 and 2020 and ebook revenue shrank, but audiobook revenue surged by 157 percent, according to the Association of American Publishers. Consumers have steadily grown more comfortable with the format, helped along by technical improvements to mobile apps, smart speakers, and wireless headphones. But due to the cost of a narrator and audio production, most titles never become audiobooks, particularly at smaller publishers, says Brian Carroll, rights manager at Indiana University Press.

IU Press licenses a fraction of its catalog for traditional audio production but is now a customer of Speechki. It plans to release its first synthetically narrated audiobooks later this year. “All the other books at last have a chance of becoming audiobooks now,” Carroll says.

Speechki’s technology has been impressive in tests so far, Carroll says, navigating the academic language of titles on paleontology and philosophy. One book chosen for production is Around the World in 80 Toasts, in which the software has to handle text sprinkled with words from other languages. “We thought if it can do this it will probably be able to do anything, and it did a pretty good job,” Carroll says.

Taylan Kamis, CEO of London-based DeepZen, says synthetic narration can compensate for a global imbalance in audiobooks, the majority of which are in English. “A large backlist of titles never gets converted into audio, or are converted only into English,” Kamis says. DeepZen uses in-house speech synthesis technology to clone the voice of professional narrators, with results clients can then put to work. The company’s software looks for cues in a book’s text to apply seven different emotional tones, including fear and anger.

Both startups say they are not a threat to professional narrators because their technology will be used to make audiobooks that would not otherwise have been recorded. “Human and synthetic narration can thrive side by side—there’s plenty of work,” says Bill Wolfsthal, a book industry veteran helping Speechki with business development. But the economics can look alarming to professional narrators, who might receive around $250 per finished hour of audio they send a publisher. DeepZen charges publishers around $120 for each finished hour, or less for clients willing to skimp on quality control.

Kamis of DeepZen claims his technology can increase earnings for narrators who allow him to clone their voice because they will receive royalties. Edward Herrmann, who starred in the Gilmore Girls television series and narrated books from authors including Stephen King and Walter Isaacson, died in 2014 but still narrates new books today through DeepZen, which struck a deal with the actor’s estate and cloned his voice using old recordings. New audiobooks read by Herrmann but disclosing that he is a “synthesized voice,” like a history of the Battle of Stalingrad, can be bought on Apple and Google’s digital stores.

You won’t find Herrmann’s digital reincarnation on Amazon’s Audible, which dominates audiobooks much as the company’s store does sales of print and digital books. Audible’s longstanding rule requiring human narrators poses a major limitation to the ambitions of synthetic voice providers. Wolfsthal, who works with Speechki, predicts that once synthetic voices become more common on competing stores Audible will feel pressured to allow them.

Audible has not detailed the automatic and manual processes it uses to keep out synthetic voices. Even after the cleanup prompted by Miller and others, WIRED found nonfiction audiobooks made with a DeepZen voice by UK company IT Governance Publishing on the service that did not disclose they were synthesized. “Alice White” was listed as the narrator of titles on topics including computer security and EU data protection law and matched a sample on DeepZen’s homepage.

Those books are now gone. “Audible does not produce or retail titles narrated by artificial intelligence; therefore, these titles have been removed,” Farrell-Green said. Andreas Chrysostomou, publishing relations manager at IT Governance Publishing, told WIRED that Google and Apple listings for the synthetic audiobooks lacked disclosures because of a mix-up with a distributor. He said the company had tried DeepZen’s technology to get audiobooks to market more quickly but that after mixed reviews, it does not plan to produce more titles this way. Last year, one buyer wrote in a one-star review that it was “virtually impossible to listen to this robot murder the English language.” Chrysostomou said the company hopes to eventually use both human and synthetic narration, depending on the books and advancements in AI.

If audiobooks with artificial narrators start to receive more favorable reviews, the small number available today could grow fast—software can generate audio more quickly than humans can.

Eline Blackman, who runs a blog on audiobooks and with Miller and others hunted and reported synthetic audiobooks on Audible, has mixed feelings about seeing them become common. She doesn’t think the technology will get good enough to threaten existing narrators but worries cheaper, less evocative AI production could prevent some books or authors from getting the recognition they deserve from listeners and critics. She can also see benefits if publishers use the technology with care. “If it means that more books will be available in audio format and that makes them more accessible, I’m all for that,” Blackman says.