Forum Discussion
Advances in TTS Technology and AI voice generation
There have been a number of posts here about the use of SSML (Speech Synthesis Markup Language) when using AI-generated voices and Text-to-Speech transcription. Several members and staff have pointed out that ElevenLabs voices in Storyline do not fully support SSML. In particular, <speak> tags may cause the contained text to be skipped, and other tags are not consistently applied. In reality, the reduced usefulness of SSML is not just a Storyline 360 or even an ElevenLabs issue — it reflects a broader shift in how modern AI TTS engines work.
To understand why SSML is only partially supported with certain voices, it helps to understand how TTS technology has evolved.
Traditional TTS engines (Amazon Polly, Azure Neural TTS, Google TTS) use a structured, rule-based pipeline:
Text → SSML parser → phoneme engine → waveform
In this process, the text is first converted into phonemes, which are the smallest individual units of sound in a language (for example, the sounds /k/, /æ/, and /t/ in “cat”), and these phonemes are then used to generate the spoken audio waveform. SSML allows developers to explicitly control this conversion by specifying pronunciation, pauses, emphasis, and other speech characteristics.
Modern AI TTS engines such as ElevenLabs use neural network models that generate speech more holistically. Rather than strictly following SSML instructions, the model predicts pronunciation, timing, and prosody directly from the text based on patterns learned during training. As a result, pronunciation control via <phoneme> tags is limited and inconsistent, and many traditional SSML tags are ignored or unsupported.
In practice, pronunciation is often better controlled through phonetic respelling, punctuation, and clear sentence structure rather than relying on SSML tags.
(Additionally, Storyline sends plain text to ElevenLabs rather than a fully parsed SSML stream, so even tags that may be supported in ElevenLabs’ native interface may not work reliably when used inside Storyline.)
For precise pronunciation control, many developers now generate audio externally using dedicated TTS platforms such as ElevenLabs, Azure Neural TTS, or Murf.ai, and then import the finished audio into Storyline. Some platforms, such as Murf.ai, provide alternative pronunciation control mechanisms — for example, its “Say It Like I Say It” feature allows you to directly specify pronunciation using phonetic guidance without relying on SSML, which can be more reliable when working with neural voices.
I have written a number of articles on my website explaining how AI TTS is evolving, comparing different engines, and outlining practical workflows for use in e-learning development, including Storyline-based projects:
AI Voice Generation: Murf.ai in Focus – Profile Learning Technologies
These articles may be helpful if you are trying to decide which approach provides the best balance of realism, pronunciation control, and workflow efficiency.
2 Replies
- RonPricePartner
I recommend this article if using the AI TTS -
Especially this section for prompting/tagging -
- JohnCooper-be3cCommunity Member
Thanks RonPrice - helpful articles. I don't subscribe to Storyline360 AI (although I understand next year I will have to?). We have developed our own tools and an app that can extract voices from multiple different sources (including Eleven Labs and Azure Neural TTS) with the appropriate set of controls not disimilar to Storyline AI - although different engines have different commands.
I'm excited that Storyline AI can generate sound effects and background noise - that saves a time. I think sound and music is undervalued in eLearning development. Does Storyline AI have (or is it planning to have) full music generation capabilities - this is the kind of thing we use:
John
Related Content
- 7 months ago