34 Replies
Danny Stefanic

Mike is right its a balancing act and if you want the best quality use professional narration for sure. If budget or rapid production tilts the balance in the other direction then text to speech can serve as a reasonable alternative, for example where you need accessibility compliance in order to deploy.

I also hear from our ResponsiveVoice users that TTS (Text to Speech) is very useful timesaver in the design of elearning prior to sending off the script for final narration.

We built an add-on for SL2 if you are interested in trying it, I'd love to hear community feedback on it.

Joel Harband

Our Speech-Over Professional text to speech software (www.speechover.com) adds professional e-learning narration fast to PowerPoint based e-learning and training - saving course development time and costs. Speech-Over works with Articulate Presenter as well as with iSpring and Camtasia.

Ordinary text-to-speech applications can only read the text on the PowerPoint slide; Speech-Over does a lot more: the user enters narration texts that describe and explain individual text bullet and graphic objects on the slide like a real presenter would. The texts are stored off-screen. When an object is animated in the slide show, Flash, or video, Speech-Over speaks its narration text in perfect sync. The result has the learning impact of a live presentation.

A text-based audio editor can adjust the diction, inflection and phrasing of the voices - eliminating the pronunciation problems that Mike refers to in his article above.

Speech-Over Professional includes two voices from Acapela-Group with a commercial license.

I invite you to try it.

 

Mike Harrison

Joel, I can certainly appreciate your intent and the time and effort you've put into Speechover. Absolutely. I've just listened to the samples in the website video and, my apologies, but while Speechover may well be a solution for those whose only concern is cost, this is still far from impactful or compelling speech.

The biggest misunderstanding in the hiring of voice talent is that it is not at all the SOUND of the voice that is most important. Many people have "nice" voices. The most important facet of speech is the articulation; the way words flow naturally out of the speaker's mouth, with the correct amount of emphasis on only the words and syllables it belongs... and not where it does not belong. In many cases, emphasis placed on the wrong words can change the very meaning of a statement. And, at the very least, naturally articulated speech imparts confidence in the listener. They feel that the speaker knows intimately what he or she is talking about. Anything short of that and the listener is left with reservation over the authenticity of what is being presented.

So, while the "speakers" in the samples I heard had pleasant-sounding voices, not only was emphasis unnatural in many areas, the rhythm of their words still sounds mechanical. There are places where syllables whiz by almost unintelligibly. Learners should not have to replay portions in order to understand them.

eLearning shares one very important basic goal with radio and TV commercials: to motivate people to action. Commercials are successful when they are responsible for sales going up. eLearning is successful when those taking the courses not only remember but are able to apply what they've learned. And those whose task it is to instruct others (especially at the corporate level, where employee performance is everything) better be engaging enough to make people want to listen, and should sound absolutely convincing.

The #1 quality sought in the casting of voices for practically all genres of voice-over is to find the person who is best able to connect with the subject matter so as to convince the listener that what they're hearing is the real deal. We want to have a voice that is pleasant to listen to, yes, but first and foremost we must trust implicitly that what we are hearing is genuine.

Again, my apologies, but my opinion is that only human speech can be regarded as genuine because there is a mixture of emotion and point of view behind it. TTS – software – is not innately capable of emotion or point of view. And even as diction, inflection and phrasing may be adjustable, will someone be willing to spend the time and money to listen to and evaluate every word and then adjust these qualities as necessary in lessons of considerable length? By the time these adjustments are completed, total expenditure would probably equal the fee of a talented professional who would have a finished and more compelling product in less time.

I would suggest to any company preparing to enter into eLearning that they conduct a test with perhaps six minutes of typical training content. Enter the first three minutes of the text into any TTS application and give the remainder to a professional narrator. When each have completed the audio, randomly select a small group of employees to listen; first to the TTS portion of the lesson and, then, without any break or discussion, the professionally spoken portion of the lesson. Then ask the employees their impressions; specifically with the intent on discovering which they would choose to listen to (especially for extended periods) and have more trust in what they were hearing.

Because the success of any eLearning content hinges solely on what employees are able to remember and later apply, I have a very difficult time understanding why any company would even consider cutting costs in the very area that has the power to make employees and, ultimately, their company the best they can be.

Suggested reading: http://www.scilearn.com/blog/prosody-matters-reading-aloud-with-expression

Joel Harband

Mike,

Thanks for your reply to my comment about our product Speech-Over (www.speechover.com) for adding text to speech (TTS) narration to e-learning and training presentations to save time and costs.

I am happy you have set down your objections to text to speech in e-learning so clearly so we can address them one by one.

1. Articulation. A couple of years ago I would have agreed with you, but Speech-Over engineers have made a breakthrough in improving the articulation of TTS voices: by entering simple punctuation in the text, the voice can be made to articulate like the best public speakers. To hear what I am talking about, see the video https://www.youtube.com/watch?v=4fuD15hpUbg - (which is the sample on our website.)

2. Motivate people to action. I submit that corporate students come to their e-learning already motivated (the boss wants it!). The most important thing is to present the material clearly and consistently, including correct diction and articulation so that they can easily understand and retain the material. Our customers report that the retention of the material is the same with TTS as with human voice.

3. Authentic. Our customers have found that once the student begins to learn, they accept  and trust the TTS voice just as they would a lecturer with a regional accent. One customer actually notified the learners at the beginning of the course that the voice would be TTS so they knew what to expect.

4. Time required to adjust the articulation. Here you make a good point. To improve the articulation as in #1 above, the Speech-Over user has to enter special pause punctuation (a vertical bar | )  in the text. And, the user has to know the rules of effective public speaking to know where to put the pauses (these rules are presented in a Speech-Over tutorial).  I expect that e-learning text editors can do this quickly, but the time required does need to be taken into account.

I agree with your suggestion that companies that want to use TTS voices in developing e-learning and training should run tests to evaluate the savings and to compare retention of material with courses developed with human voices. Free Speech-Over trials are available for this purpose.

 

Mike Harrison

As for articulation, the examples of TTS I've heard over the past year or so and more recently have been capable of what appears to be only three levels of tone: the standard mid-tone, a slightly raised tone for emphasis, and a slightly lowered tone for endings of sentences. There are several occurrences in the Speech-Over sample where the incorrect syllable or word was emphasized. Despite these three levels of tone and the insertion of pauses, TTS is not even close to approaching compelling speech. (In my personal opinion, although he's a very smart guy, neither is that given by Bill Gates. The pauses in his talk were due only to his having to occasionally refer to the paper he was reading his speech from; they were not purposely placed for impact because what he was talking about at that time was not so enlightening as to warrant dramatic pauses.)

With regard to the proper placement of emphasis, I submit the following example. Read aloud each sentence, emphasizing the word in boldface, to hear how the meaning of the sentence changes from the previous.

"I never said she ate your sandwich." (Somebody else said it)
"I never said she ate your sandwich." (I definitely did not say anything)
"I never said she ate your sandwich." (I implied it)
"I never said she ate your sandwich." (I said someone else did)
"I never said she ate your sandwich." (I said she did something else with the sandwich)
"I never said she ate your sandwich." (I said she ate someone else’s sandwich)
"I never said she ate your sandwich." (I said she ate something else)

"The boss wants it" is motivation only enough to make someone do something. But people who are motivated by fear of consequences are still reluctant and, thus, will not totally commit. Yes, they might force themselves to sit through the lessons, but we can hardly call this being interested and engaged. They might even receive a passing grade at the conclusion of each lesson. But the real test is whether – six months down the road – they can remember and apply what they heard to satisfactory result.

Some comments of colleagues of mine who have heard TTS:

"If i had to listen to more than 60 seconds of this as employee training, i'd quit."

"A man spent years training his dog to walk on its hind legs. When he showed the trick to a friend, the observation was, 'Yes, very impressive... but tell me, why? It will only ever be a curiosity as a dog, and a pale and pointless imitation of a man.'"

"If people don’t benefit from the courses, those producers who are trying to skimp by using TTS will go out of business. The market will decide whether or not TTS is a viable option. I frankly don’t see a future for TTS for anything worth listening to."

Yes, the market will ultimately decide. But TTS still has a LONG way to go before it can be considered engaging enough to generate genuine enthusiasm in the listener. And genuine enthusiasm makes for great learning.

Steve Flowers

Agree with Mike, here.

I find TTS tremendously helpful in generating scratch audio for stakeholder review before we send it off to have the pro voice it. Saves a ton of time. I really only want to have my narrator read it once. 

There are some places where I might use TTS intentionally. Like if I wanted to personify a machine. Used strategically as a production element, there are some situations where the insertion of TTS as a spice and not as a substitute could be really successful.

Unfortunately, most TTS output is pretty awful. Some folks have a high tolerance for awful or their focus is elsewhere so they have extra grace for artificial production elements or "cheapness." To each their own:) I take Joel's point above well. Let folks know up front that it's bot-read and own it if you're going to use TTS as a substitute. At least then they can vote with their feet or use the mute option.

TTS is getting better. But most of it is still stuck in the last decade. There aren't that many voice vendors. Far less than there are tools that use the same licensed voices in their TTS generator. Some are better than others. Loquendo and Ivona were pushing things in the right direction but they both had a long way to go...

I'm in the "If i had to listen to more than 60 seconds of this as employee training, i'd ___________." camp... I'd actually just mute the darn thing, as I have with lots of stuff that isn't well selected or well produced. There are pro narrators (including great ones that get commissioned to execute bad scripts) that scratch my brain too. Or stuff that just simply doesn't need to be narrated at all. These will get the same treatment as high doses of TTS on my machine. I have a volume control for a reason;)

YMMV

Joel Harband

Mike (and Steve)

Thanks for submitting a specific test and challenge for articulation and emphasis in text to speech voices.

I agree that "out of the box" text to speech voices would not pass your test. However, when used with Speech-Over's rhetorical pauses the TTS voices can emphasize the right words and capture the meaning of each sentence in your test.

These results are demonstrated in a video I made up in which a number of TTS voices from different vendors take your test using Speech-Over's rhetorical pauses with good results.

The video is narrated by TTS voices  Ryan and Heather from Acapela-Group where rhetorical pauses have been used in the narration as well. The pauses that were inserted are shown by vertical bars "|" on the screen text. This is for instructional purposes only; in fact, pauses are stored internally and do not need to appear on the screen.

Click on this link to see the video:

Mike's Challenge to Speech-Over Text to Speech Voices

Again, I recommend using Speech-Over text to speech voices to save time and costs for applications where voice clarity and consistency together with correct diction, emphasis and phrasing are the main requirements - as in e-learning and training, where people are pre-motivated to learn the material.

For any kind of application, Speech-Over is perfect for generating "scratch audio"  for review prior to final professional voice recording - a method that Steve recommends in his comment below.

Using Steve's terminology, I submit that Speech-Over has brought TTS into this decade - and the next.

 

 

 

 

Joel Harband

Steve

Your observation:

"Used strategically as a production element, there are some situations where the insertion of TTS as a spice and not as a substitute could be really successful."

intrigued me. We are always looking for new applications for TTS. Could you expand a little on this idea, maybe with some examples?

Might this also be applicable to live presentations and not just to e-learning?

Thanks,

Joel

Steve Flowers

All of the TTS I've seen to date carries an unmistakeable synthetic quality. If I wanted to produce a synthetic character in a situation or scenario or to have the application "think out loud", I might find it in my heart to use TTS:) Since TTS simply can't pretend to be non-synthetic, a clear intentional use of synthetic matched with a synthetic design choice could be successful.

Ivana Vayleux

I would like to buy IVONA software with 2 British accent voices, but I cannot  find the information on the official web site how I can buy it and how much does it cost. I can see there is the development and commercial licence. 

I would use the voices only for e-learning courses for internal purposes (Employees onboarding and learning inside the company) What kind of licence is that? 

Does anyone know what is the price and how I can get it easily? 

Thanks a lot!