AI-Powered Text-to-Speech Is a Game Changer for the Entertainment Industry

Colin Tedards
|
Feb 16, 2024
|
Bleeding Edge
|
6 min read

Colin’s Note: AI is going to transform the entertainment industry…

It’s something I’ve been writing about in these pages for a while now. I believe the future of Hollywood will see us using AI to insert ourselves into our own films, with our own soundtracks, and our own stories.

And Amazon’s Base TTS – that’s text-to-speech – AI is an impressive step toward that reality. With a relatively small dataset – just 10,000 hours of speech data – it started to understand nuances in speech, including emphasizing certain parts of sentences and even whispering.

It’s very impressive… And a little scary when you consider what that kind of technology can do when used for the wrong reasons.

We get into all the details – and some of the biggest risks – in today’s video. Just click below to watch…


Hello everyone, and welcome back to The Bleeding Edge.

Today, we’re going to be talking about two separate research papers that dropped this week in the world of artificial intelligence (“AI”).

One came from OpenAI, the company behind ChatGPT. They released research with very stunning short-form – like a minute and 30 seconds long – video clips that were all made using AI. That paper was called Sora.

But as I was watching those video clips, what was missing was sound and dialogue. So while they had the visual effects that looked amazing, there was no sound. There was no music.

My idea of Hollywood is we are going to be generating a lot of entertainment in the future using AI. And I think one of the ways we’re going to do that is we’re going to drop ourselves, our friends, our family into the entertainment.

So if we want to star in the movie alongside our favorite actor, in our favorite setting, with our favorite soundtrack, we’re going to be able to put ourselves into the scenes.

We’re going to make ourselves a part of Hollywood.

But this comes with some major risks. At the end of this video, I’ll talk about one way in particular… And how you can protect yourself from some of the things that are going to be coming with AI.

Now, the second research paper is what we will mostly be talking about today. We’re going to be exploring Amazon’s Base TTS. TTS stands for text-to-speech… And this Base TTS could be a groundbreaking new model as it relates to generating speech from text.

Here’s how these large language models work… Actually, first, let’s look at something like ChatGPT. With a large language model, you take text from the internet – Wikipedia, other websites, blogs – and you feed it into a computer system. Then the computer system makes sense of all this.

A speech model works very similarly. You’re just pulling speech instead of text. So you’re probably scraping different podcasts that are publicly available and free. They might even take a video like this and train a computer model on the way I’m speaking or other people are speaking.

Amazon took 100,000 hours, 10,000 hours, and 1,000 hours of different people talking, and they fed this into a computer system. Now, the first thing that they found out was particularly interesting. It was surprising that the 10,000-hour, not the 100,000-hour model, performed the best.

The 10,000-hour model was able to understand punctuation, non-English words, some non-English words, and certain emotions, and be able to emphasize certain things in sentences.

Remember, Amazon is simply just taking people’s voices and feeding it into the computer. They’re not explicitly telling the computer about punctuation, how commas work, or how exclamation points work.

It simply feeds the computer system these voices, and then it types in sentences with punctuation… And the computer system recognizes that.

Now, more impressive than that is that it started to understand nuances in speech, including emphasizing certain parts of sentences and even whispering.

So listen to this sample that was included in the research paper. Again, the researchers over at Amazon didn’t teach the computer system about whispering. It just happened to pick it up in the nuances of the language that it studied…

[Audio at 3:27] A profound sense of realization washed over Maddie as he whispered, “You’ve been there for me all along, haven’t you? I never truly appreciated you until now.”

That is super impressive. Using hours of speech data, the computer system can figure out what whispering is really on its own very similar to maybe how a child would learn different things in school.

From tackling compound nouns and conveying emotional expressions, Amazon’s Base TTS is very impressive. And combined with something like OpenAI’s Sora, the two could create some very entertaining media.

It’s also believed this system outperformed other large text-to-speech systems, including Meta’s – the company behind Facebook and Instagram. They have an open-source model called MetaVoice that is currently the front-runner in the field of text-to-speech.

And as I said, I think this advancement in AI video and AI text-to-speech and dialogue is going to revolutionize the entertainment business, but doesn’t come without its challenges in real concerns.

Just a few weeks ago, there was a fake Joe Biden voice message being sent out to people voting in a particular state. It was convincing enough. People thought it was Joe Biden on the call.

Now, this is a real concern, not just with politics, but I think this hits a little closer to home. Text-to-speech could be used to mimic the voice of your children, your spouse, your coworkers, or your boss. Tech experts need to figure out a way to protect people against these deepfakes.

Maybe there’s some kind of watermarking or technology that will identify these text-to-speech products and be able to mark them as fake as we listen to them. I think a simpler way to protect yourself is to have a code word, just like you have a password on your computer, just like you probably have a password on your phone.

Talk to your children, talk to your spouse in particular, and come up with a code word.

So if you get a call late at night and it’s someone saying that it’s your child or it’s your wife and they’re in trouble and they need money or they need help, have a code word you can ask them. Chances are the text-to-speech model will be able to respond to you, but it won’t know your specific code word.

I actually have family already who I know have been targeted in this manner. A member of my family just had their son go off to college, and they got a call late at night saying that their son was in jail. Well, it turned out to be a scam.

These text-to-speech models are going to get more sophisticated over time. We need to do everything we can to protect ourselves because there’s going to be no slowing down of this technology.

Now, it’s exciting, it’s interesting, and I think it’s the future of entertainment. But again, it comes with some concerns. And I wanted to just bring that to you so you know both sides of the story, folks. That was The Bleeding Edge for today. I hope you have a wonderful weekend, and I’ll see you again soon.


Like what you’re reading? Send your thoughts to feedback@brownstoneresearch.com.


Want more stories like this one?

The Bleeding Edge is the only free newsletter that delivers daily insights and information from the high-tech world as well as topics and trends relevant to investments.