Back to blog
AI · 6 min

Whisper, Gemini and LLMs: the technology behind your notes

Building a great notes product isn't picking the newest model. It's orchestrating three models in the right order and knowing when each wins.

Layer 1 — Transcription

We start with Whisper Large v3 (OpenAI) self-hosted on our own GPUs. Reasons:

  • Native multilingual (>90 languages at good quality).
  • Noise robustness with no commercial equivalent.
  • Predictable cost when we run it.

For long audio (>30 min), we pre-segment with VAD (voice activity detection) so the model doesn't drift. For mixed languages in the same meeting, we detect per chunk and send each to a specific pass.

Layer 2 — Diarization + alignment

As we covered in the dedicated article, this layer is in-house: embeddings + clustering + prosody. The output is Whisper's transcript enriched with speaker labels per turn.

Layer 3 — Synthesis (LLM)

Here we use Gemini 2.5 Pro as the main model for three reasons:

  1. Long context: 1M tokens. A 90-minute meeting fits without truncation.
  2. Spanish quality: in our benchmarks, it beats GPT-4 on real Spanish meeting summaries.
  3. Cost/perf: when push comes to shove, it does what premium models do at lower cost.

For specific tasks (task classification, date extraction) we use smaller, cheaper models — you don't need a cannon for that work.

Layer 4 — Chat over the note

This is the layer we're most fond of. We use embeddings retrieval over the transcript + diarization: when you ask something, we retrieve the exact fragment and feed it to the model. That means answers have verifiable citations (timestamp + literal text).

The principle

Each model sits where it's best, not where it's flashiest. The stack gets updated when a new model wins real evals vs. the current one — not when an announcement drops.

Ready to try it?

Record your next meeting and get an actionable summary in seconds.

Start free
Whisper, Gemini and LLMs: the technology behind your notes · AudioMap · AudioMap