Text to Speech software

Mar 14, 2012

Can anyone recommend a good "Text to Speech" software that you have used with Articulate?  We are looking for a software that sounds as close to a real voice as possible.

Thanks, Joy

37 Replies
Steve Flowers

I use Ivona for temporary voices during review periods. It's relatively inexpensive for personal use. For commercial use it's quite a bit more expensive ($700 last I checked). I'll be moving to the built in Mac voices for my temp work - triggering voice to file from terminal. These are slightly lower quality than Ivona or Loquendo. Loquendo is quite good.

Good voices are not cheap. Most of the best ones offer a "pay as you go" program. If you're looking for a good quality voice on the cheap for commercial purposes, I haven't found one yet. I haven't found an application of narration where I'd rather use a synthetic voice than a real one other than for scratch and review.

James Henderson

I'll throw in iSpeech's text to speech . In my opinion they have the most human sounding voices available, even a tad better than Ivona (although their Polish stuff is amazing). iSpeech offers quite a bit for free/very low fee. Might be just what you are looking for. They also offer some limited speech recognition should you ever need that. Their text to speech is top notch though. I used the service to turn my college reading into mp3s to listen to while cooking and similar when I couldn't read.

Michael Case

Hello Joy,

I have found that no matter what TTS software or voice I use, there are far too many oddities in pronunciation to efficiently and cost effectively use one for narration. Besides, finding a professional narrator is easy, and depending on who you choose, it can be inexpensive as well.

Check out The Narrator Files. They price narration by the page, and they have exemplary voice talent.



Steve Flowers

I am. I've switched over entirely from other TTS programs to Mac voices. Pretty neat trick I use to batch each file using terminal. It takes a little bit to set up my transcript input files, I haven't automated that part yet. 

Basically, when the script is approved, I generate a .txt file for each bit of audio (on the plus side, I have found a way to use this as a transcript feeder). Then I setup a batch file for terminal to automatically generate the outputs. The batch template lines look something like this:

say -v lee -f /Users/sflowers/Desktop/Dropbox/projectname/production/scratch_audio_scenarios/s1_c1.txt -o /Users/sflowers/Desktop/Dropbox/projectname/production/scratch_audio_scenarios/s1_c1.aiff

Copying and pasting this line into terminal will grab the text file and output an audio file in the voice I've selected. Copying and pasting multiple lines will do it multiple times. It only fails if there's a funny character or the text file is missing. Easy to pick up by the file size of the output .aiff. All in all pretty fast. And really easy to update. Just update the .txt file and copy / paste the batch line into terminal.

Ron Starc

The current best text to speech software is Text Speaker. It has customizable pronunciation, reads anything on your screen, and it even has talking reminders. It is great for learning as it highlights the words as they are being read. The bundled voices are well priced and sound very human. Voices are available in English, French, Italian, Spanish, German, and more. Easily converts blogs, email, e-books, and more to MP3 or for listening instantly.

Mike Harrison

It is not at all because I'm a voice-over/narrator that I am opposed to the use of text-to-speech technology. It is solely out of concern over results. The goal of instruction of any kind is to either simply share information or to change behavior/performance. And, just like TV and radio commercials, success hinges squarely on whether the message is able to not only grab but hold the attention of the listener/viewer/learner so that they will ABSORB what they saw and/or heard, and that they will also RETAIN that information so that they can later APPLY it. In the case of commercials, advertisers hope to motivate people to buy their product or service. In instruction, it is hoped that learners will be able to use what they learned to better their performance.

Thus, eLearning success cannot be measured by the quantity of material that was produced in X number of hours or that X number of dollars were saved. Success is measured in what people remember and are able to apply.

I wonder if any company who uses eLearning has done an analysis to compare the money spent on producing the learning content against whether there was a marked improvement in employee performance. It would seem to me, that if a company's goal was to maximize employee performance, the LAST place they'd consider cutting costs would be the tools and methods used to achieve that goal. When the goal is to get to the finish line faster, we don't use cheaper fuel.

It seems that proponents of text-to-speech don't understand that inflection; NATURAL inflection is the key, and synthesized speech, no matter that the "voice" may sound so human-like, will never be able to place correct inflection where needed and not where it is not needed.

Here's a practice exercise for everyone: spend some time listening to people engaged in conversation. People you know, even people you don't know. When we speak among ourselves, we add inflection without thinking about it, placing importance on some words, less on others, a smile here, some compassion there, a sudden burst of excited whisper, the occasional dramatic pause to build anticipation or allow a point to sink in before moving on, etc. We do these things automatically and it turns our speech into music. And it makes those listening more prone to continue listening. Until such a time where the algorithms behind synthesized speech are able to "understand" and contextualize the words (which, to a computer are just more ones and zeroes), it will never be able to add the NATURAL engagement factor called inflection where it's supposed to be and, thus, will never reach the effectiveness level of human speech.

If there is no connection; no grasp of the material so as to place the proper inflection where it belongs, what is spoken is cold and completely non-engaging. Just like some of the boring teachers we all had in school. If there is no natural engagement, there is no hope of holding the attention of a learner. And without their focused attention, the efforts and money spent on everything that went into the eLearning is wasted. I'll say it again:

eLearning success cannot be measured by the quantity of material that was produced in X number of hours or that X number of dollars were saved. eLearning success is measured ONLY in what people remember and are able to apply in order to make a difference.