Forum Discussion
AI Audio Consistency
There's a lot to love about the AI Audio, however, the frustration with consistency is a major problem. I don't just mean the consistency of how the voices change their pronunciation between each generation or within the same generation itself.
I'm talking about the QUALITY of the generation.
I can have text, click generate with my chosen voice and I can get WILDLY different results each time.
By different results, I mean the audio may be loud in one generation and quiet in the next. It can sound great and studio quality one time and then sounds like it's on a speaker phone the next. The voice can sound like the example one time and sound like a completely different voice another time. It can sound excited at first and then sound more and more disinterested as the voice gets to the end of the generation. I use the default settings.
I've attached a video to show you one of these examples. The only way to fix this, I've found, is to copy the script to another slide and generate the audio there, insert it, then copy and paste it on the slide it needs to go on.
For something that is supposed to save time, I spend an awful amount of time re-generating trying to get a useable file.
1 Reply
Hello AdamPeterson-5b,
Thanks for sharing the video and the details of what you are experiencing. I understand you are seeing inconsistencies in the quality of the generated audio.
For comparison, I ran a test on my end using the settings shown in the screenshot below (Stability—1.00, Similarity—1.00, Style exaggeration—0.50). I did not see the issue with those settings. Each generation produced the same text-to-speech audio.
As outlined in this article: Create Content with AI Assistant
- Stability: Controls how stable the voice is and how much randomness appears between each generation. Lower values can sound more emotional, while higher values sound more professional and formal.
- Similarity: Controls how closely the AI should match the original voice when replicating it.
- Style exaggeration: Adjusts how much the style of the original voice is amplified. Higher values can increase generation time.
I’m curious if you can try using the same settings I used in the screenshot and let us know if that makes a difference for you.
Related Content
- 3 months ago