"Diarization" is the technical word for something your brain does without thinking: telling voices apart in a conversation. For a machine, it's one of the hardest problems in audio.
Why it's hard
- Similar voices (siblings, same gender, same accent).
- Overlaps: people talk over each other.
- Channel changes: someone leaves the computer and comes back.
- Noise: AC, another language in the background.
Transcription can hit 99% accuracy and still, if speakers are mis-assigned, the note is useless.
How we approach it
Per-segment embedding. We split audio into 1-3 second chunks, pass each chunk through a speaker-embedding model (trained on Spanish + English + Portuguese). Each chunk yields a ~512-dim vector.
Progressive clustering. We group vectors into clusters. The cluster count isn't fixed — it's inferred by density. So a 2-person meeting doesn't produce 5 fake speakers, and an 8-person one does pick them up.
Prosody refinement. We use rhythm, pauses, and intonation as secondary signals to sharpen the boundary between speakers when embeddings hesitate.
Optional re-identification. If the user labels a speaker ("this is Juan"), the embedding is stored encrypted in their account. In future meetings, Juan is recognized automatically.
What we DON'T do
We don't upload your voice to a public repository. Embeddings are yours alone and get deleted on request. Cross-account identification doesn't exist.
Results
On internal benchmarks with typical Zoom audio:
- Speaker error rate: 4.7% in Spanish, 5.1% in English.
- With >5 speakers, rises to ~9%.
- Overlaps detected at 87% precision.
Compared to generic services (which sit at 12-18% error in Spanish), the gap is notable. Reason is simple: we optimize for the real case — team meetings and interviews — not professional podcasts.