Forum Discussion

AnnaBertoncini's avatar
AnnaBertoncini
Community Member
1 year ago

AI TTS and SSML functionality

Hello everyone,

I would like to bring the attention of the community on AI TTS and the limited use of SSML with it.

I know that it is not supported because you built AI voices "to understand the relationship between words and adjust delivery accordingly". 

However, AI voices mispronounce acronyms and other words like company names and such. 

By saying that I am forced to use the old TTS voices, and it is a bit upsetting because the AI voices sound indeed more natural and "human", big benefits in e-learning for ensuring a more pleasant learning experience for our users. 

This is a request to work on SSML for AI voices because I strongly believe it is needed. 

Anna

17 Replies

  • Hello AnnaBertoncini,

    Thanks for reaching out and sharing your thoughts on AI TTS and SSML functionality.

    You are correct that AI Assistant has limited support for speech synthesis markup language (SSML) because AI-generated voices are designed to understand the relationship between words and adjust delivery accordingly.

    I believe this is a good request, so I shared your feedback with our product team. We'll let you know if there are any changes in the future regarding this area in Articulate AI.

    Enjoy the rest of your day!

     

    • suzarina's avatar
      suzarina
      Community Member

      Any update on this? I'm also having issues trying to get the AI voices to pronounce acronyms correctly, or interject pauses so that the speech sounds more natural. 

      • EricSantos's avatar
        EricSantos
        Staff

        Hi suzarina,

        Thanks for following up on this request.

        There are no updates at the moment, but we'll make sure to post in this thread if there are any developments regarding expanded SSML support in Articulate AI text-to-speech.

  • arabellas's avatar
    arabellas
    Community Member

    I'd like to piggyback on this request. The AI voices overall sound more natural - but it's more challenging to adjust pronunciation. It would also be really nice to be able to adjust the inflection/emphasis and expression. 

    • LaurenDuvall's avatar
      LaurenDuvall
      Staff

      Hi BrendtWaters! Thanks for letting us know that this feature would be beneficial for you, too. I know others have shared interest in more control when it comes to pronunciation with AI voices. To ensure I'm sharing the correct information with our Product team, would you mind sharing where you've been stuck with the lack of control with AI voices? Is it certain words, languages, or something else?

      • BrendtWaters's avatar
        BrendtWaters
        Community Member

        By far, the biggest loss is the phoneme tag. Even the smartest of AI voices stumbles on pronunciations. While sometimes, you can "fake" it out to say what you want (e.g., with creative spelling), when emphasis is put on the wrong syllable, there's no way to fix that other than the phoneme tag.

        Case in point: The word "conduct" can be a verb ("Bob will conduct the orchestra") or a noun ("Mary has good conduct"). Same word, two different pronunciations.

        While we were overhauling all our courses (before moving to on-board AI TTS), we were using an external tool (creating mp3s) that allowed the phoneme tag. So that we didn't have to keep figuring out the same IPA over and over, we made a library of phoneme tags. It exceeded 50 tags. So this is no minor loss.

        I'm truly confused why, what is supposed to be an upgrade, *loses* an ability that on-board (regular) TTS had.

  • arabellas's avatar
    arabellas
    Community Member

    Seconding Brendt's reply. The phoneme tag is a huge help. It's not about just certain words - my organization uses lots of acronyms that are pronounced as words, but not always in a way that's easy to "fake" with spelling. It's especially frustrating when it's part of a bigger phrase and everything else about the phrase is perfect, but just that one word is totally wrong and you have to regenerate the whole thing. And if you're using two words with unusual pronunciations in the same script, that can be extra frustrating.

    • LucianaPiazza's avatar
      LucianaPiazza
      Staff

      Hi arabellas

      We appreciate you sharing your insight as well! We understand that you'd like more precise control over how specific words are spoken so you can get accurate results without rework. Totally makes sense! We’ve shared your feedback with our product team so they understand your experience. We'll share any future updates in this thread so everyone is aware! 

  • BrendtWaters's avatar
    BrendtWaters
    Community Member

    Another example: "lead" (as in the element, Pb). Half the time, the TTS says it with a long "e" (as in, "He will lead the parade"). In the regular (non-AI) TTS, spelling it as "led" fixes the issue. But the AI TTS is too smart for its own good, recognizes that "led" doesn't fit the context, and assumes "this must be an acronym (despite the lower case letters)" and reads it as individual letters. A phoneme tag would get around this.

    Fortunately, I found a different workaround. This time.

    • LucianaPiazza's avatar
      LucianaPiazza
      Staff

      Appreciate you sharing another example with us, BrendtWaters. I've passed this along to our Product Team for awareness. We'll be sure to share any updates in this thread. 

  • PeterGrennan's avatar
    PeterGrennan
    Community Member

    Adding another vote for this.  I have to spend silly amounts of time regenerating ai voices to counter the way it tries to pronounce certain words - mainly ones that have both a noun and a verb that are pronounced differently (e.g. record).  We also work in healthcare, therefore a number of words or abbreviations need certain pronunciations.  Also, some words are pronounced the American way even with British voices. We have figured many of them out in terms of writing them strangely, but even then longer text often needs multiple efforts to get it right.  I've only discovered SSML today but can already tell this would be a godsend.  It's hard for AI to workout the context of words when, for example, we produce software simulations, so verbs can be used as screen headings (e.g. the Record Referral screen).  AI isn't very good at understanding that is a verb.  We need more control.

    • BrendtWaters's avatar
      BrendtWaters
      Community Member

      Yeah, I've run into a similar thing as Peter with titles/headings. They're usually just a phrase, so the AI just doesn't have enough context info.

  • Hi PeterGrennan, BrendtWaters, and arabellas,

    You’re right to call this out, especially with the examples you’ve all shared around pronunciation and context.

    I appreciate you checking in on this as well. While I don’t have a specific update to share right now, this is still something we’re actively tracking.

    At the moment, AI text-to-speech is designed to infer pronunciation based on context, but it doesn’t offer the same level of control as traditional TTS when it comes to fine-tuning output. That’s where cases like acronyms, industry-specific terms, or words with multiple pronunciations can become challenging, especially when context is limited, like in headings or short phrases.

    The examples you’ve all provided, from healthcare terminology to words like “record” and “lead,” are really helpful in highlighting where more control is needed. I can also see how features like phoneme support or a pronunciation library would make a big difference in reducing rework and improving consistency.

    I’ve added your feedback to the existing request, including these newer use cases. We’re continuing to gather input as the team explores ways to improve pronunciation control in AI voices.