Will AI Replace The Narrator?

In just 10 years, AI voice technology has made a jump from comedic to largely convincing. That’s like an infant graduating senior school in 10 years. You’d call them a genius.

With the advent of AI and artificial voices becoming more and more convincing, it can feel that being ‘good’ isn’t quite good enough any more. Leverage AI or be left behind, the narrative goes. The progress of the technology just in the field of AI voices is foreboding for professional human voices.

In 2016, AI voices had developed realistic timbre instead of sounding like the stereotypical “computer robot voice”. Think of Google Translate about 10 years ago - the time when AlphaGo beat world champion Go master Lee Sedol. But still, in 2016 its delivery was stilted; easily distinguished over about one second. It wasn’t comical, but it was understandable and highly functional.

By about 2021, before the advent of ChatGPT but well alongside LLM developments, artificial voices had developed realistic flow. Their elevation to the point of convincing to a casual listener happened before similar developments in image generation or generative language transformer models. By this time you could imagine this voice belonging to a human being. But a focused ear could pick them out over a short sentence. It became more difficult if the AI voice was set alongside music or sfx.

Funnily, from this point onwards, the punchy processing applied on human voices to normalise volumes and make them pop on little phone speakers had been baked into the AI models, and they always exhibited the same processed quality. What had once been sought after became a catalyst for derision among some - though this was before any popular fearful sentiment about a rise of the machines.

Note the pattern here. What could be distinguished in a single word now requires a sentence.

How soon will an AI voice be better than mine? Or is it already, and I just don’t know?

This image is real. Read to the end for its story.

There was one thing in 2021 that seemingly kept the jobs of audiobook narrators safe. That the AI models had very limited ‘characters’. They were generally a narrow band of nondescript, safe, generic voiceover styles (much like a Toyota Corolla), designed to be functional for the masses, focusing on intelligibility (a Corolla’s reliability) rather than strong characterisation (a howling Ferrari V12). A skilled audiobook narrator could do both, or land on any point in the spectrum between on the spot.

However in 2025, AI voices have that vocal range. Actually, it’s more - you can have a facsimilie of pretty much anyone. Want Morgan Freeman to narrate your meditation? Or Diana to direct you through the GPS? So many people train an AI model on their own voices, as well as the voices of others (e.g. celebrities or public vocalists), with or without their permission. [I will not go into the rabbit cavern of copyright and artistic infringement here.]

Self-centredly I consider that having a range of character voices was always something I practised and enjoyed. But now, someone with a vision could create a multicast audiobook, delivering something with more convincingly different ‘voices’ than I ever could. Unless I made the recordings decades apart to really get that aged texture? Wouldn’t help record as a child character, though.

There was another development that became widely applied by 2025. (It was probably developed much earlier.) The voices were no longer confined to the relatively narrow vocal dynamic range. The AI voices could speak softly and shout, with a convincing display of emotion - certainly in a soundscape with background sfx and music. Well, if that was so, how could I tell it was an AI voice? Apart from the fact they were rife in social media adverts where all the imagery and video was clearly AI, there was still a disconnect between the voices and what was actually happening in the video. Context, and contextual behaviours and emotions, were lacking. This is fine human behaviour that we have learned to distinguish over decades of real life, interacting with real people.

What about young viewers, who make up a huge proportion of the user base on social media platforms? While a part of me worries that distinguishing fact from fiction will be more difficult for this age bracket, I remind myself that they will continue to be human beings with real experiences and the real world, and they probably have experimented or will experiment with these technologies more than those who treat them with suspicion. I hope this means they will become, or remain, even better at identifying ‘artificial reality’.

In terms of the pattern identified before, a focused ear now requires many sentences, perhaps even paragraphs, to distinguish an AI voice. And the chance of false positives is ever higher.

I was recently sent some YouTube video which was filled with music and sfx, and I couldn’t work out why the narrator made me feel agitated, and why I simply didn’t trust them, even though it did contain correct information. It was an AI voice of course, and not only that, the whole video was compiled using AI software - not generative AI, but AI powered editing. I was once again surprised at how much AI had advanced in the field of videomaking during the year I had been away from it.

Consider that in just 10 years, the technology has made a jump from comedic to largely convincing. That’s like an infant graduating senior school in 10 years. You’d call them a genius.

So where does it fall short now? How was I certain the voice in that video was artificial? Apart from the aforementioned power of context, there was also a lack of changing dynamics as it flowed through the script. Each sentence was given the same weight - of course, they had all likely been generated with a prompt on the lines of “Write a video script optimised for YouTube retention”. Pauses were equal throughout, giving the impression the voice didn’t care for the relative significance of the points it was making.


And this again is the key - context, emotion and significance. At the moment, many can notice an AI voiceover with focus but wouldn’t pick it up casually unless they’ve used the tool or identifed it before - and then follows an emotional reaction as much as an intellectual one. Today, the best AI voices are like someone reading a script with high accuracy and with punchy sound processing applied afterwards, but who actually knows nothing about a topic. Some names are mispronounced, the dynamics stay inherently predictable throughout, and it feels like being informed about a story rather than experiencing storytelling from someone who might have been really there.

So, I don’t worry about whether AI will replace the narrator. Because if it does, it will mean that it has grown to be genuinely better, to have won over the currently significant sceptical part of the population. And if it’s better than me… well done. I can always find another way to be useful, and you can be sure I’ll keep telling stories anyway.

The reality of this cover image. The human story here cannot be known by an AI.

Putting the camera on self timer and pointing it right. Going into a silly pose to get my face next to the monitor to line up with the background I arranged beforehand.

It was harder to take this photo than generate it. But this was real, and it meant something to me, and is more satisfying than generating some random image that anyone could prompt for.

Next
Next

Better Song Demos for Free(ish)