Foundational Terms

Master these essential Voice AI concepts that form the building blocks of any voice-enabled application.

Core Concepts

STT (Speech-to-Text)

Converts spoken language into written text, typically using Automatic Speech Recognition (ASR) algorithms.

TTS (Text-to-Speech)

Technology that converts written text into synthetic speech (artificial voice output).

TTFT (Time to First Token)

A latency metric representing the time from a user's query (end of spoken input) to the system producing the first part of its response (first output token).

Turn (Dialogue Turn)

In a conversation, one turn is a single participant's utterance before the other party responds (i.e. speaking one-at-a-time in alternating turns).

Wake Word

A specific word or phrase that, when detected, causes a voice assistant or device to "wake up" and start listening for commands (e.g. "Hey Siri", "Alexa").

NLU (Natural Language Understanding)

A branch of AI that enables machines to comprehend meaning and intent from human language input (converting raw text into structured data like intents and entities).

AEC (Acoustic Echo Cancellation)

Client-side filter that removes speaker bleed-through before STT.

AGC (Automatic Gain Control)

Mic-level amplifier that evens loud/quiet input; can hide pauses.

Opus

Low-latency audio codec baked into WebRTC; 6–510 kbps.

PCM (Pulse-Code Modulation)

Raw, uncompressed audio (e.g., 24 kHz × 16-bit = 384 kbps).

Jitter Buffer

Client queue that re-orders late packets; bigger buffer ⇒ more delay.

PSTN (Public Switched Telephone Network)

Legacy phone network your bot must join to reach real numbers.

SIP (Session Initiation Protocol)

VoIP signalling standard; used for carrier, PBX or call-center hand-offs.

DTMF (Dual-Tone Multi-Frequency)

Key-press tones; bots send these to navigate IVRs.

Edge Routing

Technique that sends user packets to nearest PoP, then across a private backbone.

QUIC

UDP-based transport behind HTTP/3; removes TCP head-of-line blocking.

MoQ (Media over QUIC)

Emerging IETF spec for low-latency media distribution on QUIC.

P50 / P95 Percentiles

Latency statistics; P95 ≅ "worst typical", design for both.

Token Caching

Provider-side reuse of previous prompt tokens; cuts cost and TTFT.

Warm Transfer

Live hand-off: bot briefs the human agent before connecting caller.

Word-level Timestamps

TTS metadata mapping each word to exact playback time; vital for rollback.

Context Summarization

LLM-generated abridged history inserted to keep prompts under token limits.

HIPAA BAA (Business Associate Agreement)

U.S. healthcare contract that allows PHI processing under HIPAA.

COPPA (Children's Online Privacy Protection Act)

U.S. regulation restricting data collection for < 13 yrs; affects call recording & evals.