Building a great notes product isn't picking the newest model. It's orchestrating three models in the right order and knowing when each wins.
Layer 1 — Transcription
We start with Whisper Large v3 (OpenAI) self-hosted on our own GPUs. Reasons:
- Native multilingual (>90 languages at good quality).
- Noise robustness with no commercial equivalent.
- Predictable cost when we run it.
For long audio (>30 min), we pre-segment with VAD (voice activity detection) so the model doesn't drift. For mixed languages in the same meeting, we detect per chunk and send each to a specific pass.
Layer 2 — Diarization + alignment
As we covered in the dedicated article, this layer is in-house: embeddings + clustering + prosody. The output is Whisper's transcript enriched with speaker labels per turn.
Layer 3 — Synthesis (LLM)
Here we use Gemini 2.5 Pro as the main model for three reasons:
- Long context: 1M tokens. A 90-minute meeting fits without truncation.
- Spanish quality: in our benchmarks, it beats GPT-4 on real Spanish meeting summaries.
- Cost/perf: when push comes to shove, it does what premium models do at lower cost.
For specific tasks (task classification, date extraction) we use smaller, cheaper models — you don't need a cannon for that work.
Layer 4 — Chat over the note
This is the layer we're most fond of. We use embeddings retrieval over the transcript + diarization: when you ask something, we retrieve the exact fragment and feed it to the model. That means answers have verifiable citations (timestamp + literal text).
The principle
Each model sits where it's best, not where it's flashiest. The stack gets updated when a new model wins real evals vs. the current one — not when an announcement drops.