What is speech-to-text (STT)?

Speech-to-text is the AI technology that converts spoken audio into written text in real time. It's how AI voice agents 'hear' what a caller is saying on the phone.

Written By Catherine Weir

Last updated About 1 hour ago

Speech-to-text, usually shortened to STT, is the technology that converts spoken audio into written text. It's also known as automatic speech recognition (ASR) and voice-to-text. When you dictate a message to your phone, use live captions on a video call, or hear an AI voice agent respond to what you just said, STT is the first step in the chain.

In an AI phone agent, STT is how the agent "hears" the caller. It runs continuously throughout the call, transcribing what the caller says so the language model in the middle of the stack has something to reason over.

What makes STT hard on a phone call

Transcribing studio-quality audio is relatively easy. Transcribing a phone call is much harder, for a few reasons:

  • Compressed audio — phone calls are transmitted at much lower audio quality than a podcast recording

  • Background noise — callers are often in cars, at airports, in noisy offices, or on speakerphones

  • Accents and dialects — STT models need to handle the full range of how people actually speak

  • Domain-specific vocabulary — medical terms, legal terms, product SKUs, and proper names are often mistranscribed

  • Real-time constraints — the transcription has to be accurate and fast; latency kills conversational flow

Modern STT systems trained on phone-channel audio can reach high accuracy even in noisy conditions, but quality varies significantly between providers.

Streaming STT vs. batch STT

There are two modes STT operates in:

  • Streaming STT produces a running transcript as the caller speaks. This is required for live voice agents — you can't wait until the caller finishes to start thinking about your response.

  • Batch STT processes a complete audio file and returns a transcript. This is used for post-call transcription, search, and analytics — not for live conversation.

Both are used in production AI voice platforms. Streaming handles the live call; batch produces the cleaner post-call transcript you see in call logs.

Accuracy metrics

  • Word Error Rate (WER) is the standard accuracy measure — the percentage of words mistranscribed. Good phone-channel STT lands in the single digits on English.

  • Turn-taking detection — separate from WER, this is how well the STT system recognizes when the caller has finished speaking. Poor turn-taking makes an AI agent feel laggy or rude.

Related concepts

See it in action

The Receptionist Agent from 365agents uses streaming STT tuned for phone-channel audio, with automatic vocabulary adaptation for your business's specific terms. You can review the exact transcript of every call in your admin dashboard.