Diarization: how AudioMap understands who says what · AudioMap

"Diarization" is the technical word for something your brain does without thinking: telling voices apart in a conversation. For a machine, it's one of the hardest problems in audio.

Why it's hard

Similar voices (siblings, same gender, same accent).
Overlaps: people talk over each other.
Channel changes: someone leaves the computer and comes back.
Noise: AC, another language in the background.

Transcription can hit 99% accuracy and still, if speakers are mis-assigned, the note is useless.

How we approach it

Per-segment embedding. We split audio into 1-3 second chunks, pass each chunk through a speaker-embedding model (trained on Spanish + English + Portuguese). Each chunk yields a ~512-dim vector.

Progressive clustering. We group vectors into clusters. The cluster count isn't fixed — it's inferred by density. So a 2-person meeting doesn't produce 5 fake speakers, and an 8-person one does pick them up.

Prosody refinement. We use rhythm, pauses, and intonation as secondary signals to sharpen the boundary between speakers when embeddings hesitate.

Optional re-identification. If the user labels a speaker ("this is Juan"), the embedding is stored encrypted in their account. In future meetings, Juan is recognized automatically.

What we DON'T do

We don't upload your voice to a public repository. Embeddings are yours alone and get deleted on request. Cross-account identification doesn't exist.

Results

On internal benchmarks with typical Zoom audio:

Speaker error rate: 4.7% in Spanish, 5.1% in English.
With >5 speakers, rises to ~9%.
Overlaps detected at 87% precision.

Compared to generic services (which sit at 12-18% error in Spanish), the gap is notable. Reason is simple: we optimize for the real case — team meetings and interviews — not professional podcasts.

Ready to try it?

Record your next meeting and get an actionable summary in seconds.

Start free