What is text-to-speech (TTS)?
Text-to-speech is AI technology that converts written text into spoken audio. It's how AI voice agents 'talk' to callers — turning the agent's generated response into natural human-sounding speech.
Written By Catherine Weir
Last updated About 1 hour ago
Text-to-speech, usually shortened to TTS, is the technology that converts written text into spoken audio. It's how AI voice agents actually talk to callers — turning the response a language model produces into sound that goes over the phone line.
TTS has improved dramatically in the last five years. Pre-2020 TTS was the robotic, unnatural voice everyone recognized from phone menus ("your call is... important to us"). Modern neural TTS sounds nearly indistinguishable from a human speaker — to the point where businesses often have to tell callers that the voice they're hearing is synthetic, not assume they'll figure it out.
Generations of TTS
•Concatenative TTS (early 2000s through mid-2010s) — splices together recorded phonemes. Cheap, reliable, robotic.
•Parametric TTS (mid-2010s) — uses a model to generate speech parameters rather than splicing. Smoother than concatenative, still obviously synthetic.
•Neural TTS (2016–) — a neural network generates audio directly from text. The leap in quality was enormous. This is what modern voice assistants, audiobook services, and AI phone agents use.
•Generative TTS (2020–) — large neural models trained on enormous datasets, capable of voice cloning and emotional variation. This is the state of the art today.
What good TTS sounds like
•Natural prosody — pitch, stress, and timing match how humans actually speak
•Smooth pronunciation — no jarring transitions, correct handling of numbers, names, and abbreviations
•Consistent voice identity — the voice sounds like the same person from sentence to sentence
•Appropriate pauses and breathing — natural rhythm, not machine-gun delivery
•Handles interruptions gracefully — can stop mid-sentence when a caller jumps in
TTS in a business phone context
When an AI voice agent is on a call, TTS is generating audio in real time. The agent's language model produces the text of what it wants to say, TTS converts that text to audio, and the audio is streamed over the phone line. Latency matters — if TTS takes too long to generate audio, the caller experiences awkward pauses.
Production TTS systems used in AI phone agents are optimized for low first-audio-byte latency (usually under 200 milliseconds) and continuous streaming so the caller hears speech as soon as possible.
Voice choice and branding
Most platforms offer a curated library of voices — male and female, multiple accents, different tones (warm, authoritative, energetic). Higher-end platforms support voice cloning, letting you brand your AI agent with a custom voice you own the rights to.
Related concepts
•Generative voice AI — the modern category TTS falls into
•Voice cloning — custom-voice TTS
•Voice AI — the broader category
•Speech-to-text (STT) — the opposite direction
See it in action
The Receptionist Agent at 365agents uses premium generative TTS optimized for phone audio. Choose a voice from our library or brand your agent with a custom cloned voice. Book a demo and hear it in your own browser.