Foundational Terms
Master these essential Voice AI concepts that form the building blocks of any voice-enabled application.
Core Concepts
STT (Speech-to-Text)
Converts spoken language into written text, typically using Automatic Speech Recognition (ASR) algorithms.
TTS (Text-to-Speech)
Technology that converts written text into synthetic speech (artificial voice output).
TTFT (Time to First Token)
A latency metric representing the time from a user's query (end of spoken input) to the system producing the first part of its response (first output token).
Turn (Dialogue Turn)
In a conversation, one turn is a single participant's utterance before the other party responds (i.e. speaking one-at-a-time in alternating turns).
Wake Word
A specific word or phrase that, when detected, causes a voice assistant or device to "wake up" and start listening for commands (e.g. "Hey Siri", "Alexa").
NLU (Natural Language Understanding)
A branch of AI that enables machines to comprehend meaning and intent from human language input (converting raw text into structured data like intents and entities).
Audio Processing
AEC (Acoustic Echo Cancellation)
Client-side filter that removes speaker bleed-through before STT.
AGC (Automatic Gain Control)
Mic-level amplifier that evens loud/quiet input; can hide pauses.
Opus
Low-latency audio codec baked into WebRTC; 6–510 kbps.
PCM (Pulse-Code Modulation)
Raw, uncompressed audio (e.g., 24 kHz × 16-bit = 384 kbps).
Jitter Buffer
Client queue that re-orders late packets; bigger buffer ⇒ more delay.
Networking & Protocols
PSTN (Public Switched Telephone Network)
Legacy phone network your bot must join to reach real numbers.
SIP (Session Initiation Protocol)
VoIP signalling standard; used for carrier, PBX or call-center hand-offs.
DTMF (Dual-Tone Multi-Frequency)
Key-press tones; bots send these to navigate IVRs.
Edge Routing
Technique that sends user packets to nearest PoP, then across a private backbone.
QUIC
UDP-based transport behind HTTP/3; removes TCP head-of-line blocking.
MoQ (Media over QUIC)
Emerging IETF spec for low-latency media distribution on QUIC.
P50 / P95 Percentiles
Latency statistics; P95 ≅ "worst typical", design for both.
Optimization Techniques
Token Caching
Provider-side reuse of previous prompt tokens; cuts cost and TTFT.
Warm Transfer
Live hand-off: bot briefs the human agent before connecting caller.
Word-level Timestamps
TTS metadata mapping each word to exact playback time; vital for rollback.
Context Summarization
LLM-generated abridged history inserted to keep prompts under token limits.