Forum Discussion

JohnCooper-be3c's avatar
JohnCooper-be3c
Community Member
19 days ago

Advances in TTS Technology and AI voice generation

There have been a number of posts here about the use of SSML (Speech Synthesis Markup Language) when using AI-generated voices and Text-to-Speech transcription. Several members and staff have pointed out that ElevenLabs voices in Storyline do not fully support SSML. In particular, <speak> tags may cause the contained text to be skipped, and other tags are not consistently applied. In reality, the reduced usefulness of SSML is not just a Storyline 360 or even an ElevenLabs issue — it reflects a broader shift in how modern AI TTS engines work.

To understand why SSML is only partially supported with certain voices, it helps to understand how TTS technology has evolved.

Traditional TTS engines (Amazon Polly, Azure Neural TTS, Google TTS) use a structured, rule-based pipeline:

Text → SSML parser → phoneme engine → waveform

In this process, the text is first converted into phonemes, which are the smallest individual units of sound in a language (for example, the sounds /k/, /æ/, and /t/ in “cat”), and these phonemes are then used to generate the spoken audio waveform. SSML allows developers to explicitly control this conversion by specifying pronunciation, pauses, emphasis, and other speech characteristics.

Modern AI TTS engines such as ElevenLabs use neural network models that generate speech more holistically. Rather than strictly following SSML instructions, the model predicts pronunciation, timing, and prosody directly from the text based on patterns learned during training. As a result, pronunciation control via <phoneme> tags is limited and inconsistent, and many traditional SSML tags are ignored or unsupported.

In practice, pronunciation is often better controlled through phonetic respelling, punctuation, and clear sentence structure rather than relying on SSML tags.

(Additionally, Storyline sends plain text to ElevenLabs rather than a fully parsed SSML stream, so even tags that may be supported in ElevenLabs’ native interface may not work reliably when used inside Storyline.)

For precise pronunciation control, many developers now generate audio externally using dedicated TTS platforms such as ElevenLabs, Azure Neural TTS, or Murf.ai, and then import the finished audio into Storyline. Some platforms, such as Murf.ai, provide alternative pronunciation control mechanisms — for example, its “Say It Like I Say It” feature allows you to directly specify pronunciation using phonetic guidance without relying on SSML, which can be more reliable when working with neural voices.

I have written a number of articles on my website explaining how AI TTS is evolving, comparing different engines, and outlining practical workflows for use in e-learning development, including Storyline-based projects:

Articles

AI Voice Generation: Murf.ai in Focus – Profile Learning Technologies

From Narration to Conversation: How ElevenLabs Elevates AI Speech for Online Learning – Profile Learning Technologies

These articles may be helpful if you are trying to decide which approach provides the best balance of realism, pronunciation control, and workflow efficiency.

2 Replies