Audio pipeline design¶
The audio → topics pipeline lives in autorag.agent. It’s
structured as five focused stages so each LLM call has one job and the
heavy stages share an identical prompt prefix for cache reuse.
Stages¶
1. Whisper -> list[WordSpan] 1 call
2. L1 boundaries (single LLM call) -> list[{s,e}] 1 LLM
3a Decide subdivide (per long L1) -> list[bool] N LLM
3b L2 boundaries (per yes-L1, batched) -> list[list[{s,e}]] M LLM (M<=N)
4. Summarize nodes (per L1+L2, batched)-> {title,summary} per node K LLM
5. L0 aggregate -> {title, summary} 1 LLM
Total LLM calls per clip: roughly
2 + N1_long + N1_yes + N1 + N2_total — about 20 calls for a
seven-minute clip.
Boundary calls receive the transcript as a time-bucketed view
(autorag.blocks.format_blocks(), 30-second windows by default,
tunable via boundary_block_seconds — one
MM:SS-MM:SS Speaker K: <words> line per turn instead of one
timestamped line per word, which keeps the boundary prompts compact).
They emit {s, e} as MM:SS strings copied straight from those
range markers; autorag.agent._parse_ts() converts them back to
float seconds before tiling — the model never does the arithmetic.
Per-node summary calls operate on the slice’s plain text (no
timestamps) and emit {title, summary}. The K = N1 + N2 summary
calls share an identical prompt prefix so Ollama’s per-slot prefix
cache pays once.
Final shape: {"topics": [L0]} with L0.children = [L1...],
each L1.children = [L2...] or []. The L0 root is the explicit
“what is this audio about” node.
Default LLM model: gemma4:latest (8B Q4_K_M, ~9.6 GB), a
thinking-capable model. Override via the --llm-model flag on
the CLI or the llm_model kwarg on the SDK methods.
All five stages do mechanical JSON extraction (boundaries, yes/no
subdivide decisions, {title, summary}), so the agent sets
reasoning=False by default — the same determinism/latency
rationale as temperature=0.0. This sends think: false to
Ollama and suppresses gemma4’s chain-of-thought preamble, which would
otherwise be pure latency and a structured-output parse hazard.
reasoning is an overridable kwarg on build_topic_runnable /
build_agent / AutoRAG.generate_topics (default False);
pass reasoning=True to trade latency for chain-of-thought, or with
a non-thinking model where it is a harmless no-op.
Whisper backend¶
autorag.whisper_runner runs whisperX — faster-whisper
(CTranslate2) for transcription plus a wav2vec2 forced-alignment pass
for frame-accurate word timestamps.
After each transcribe_segment call:
The CTranslate2 model is removed from the module cache so Python GC can free VRAM.
The wav2vec2 alignment model is offloaded to CPU via PyTorch
.to("cpu"); the next call restores it to CUDA.On a CUDA error, the runner falls back to CPU.
Diarization¶
autorag.diarize uses pyannote/speaker-diarization-3.1,
which is HuggingFace-gated. HF_TOKEN must be set. Without it (or
on a load / runtime failure), every word is labelled "0" and the
agent logs a warning — output then matches pre-diarization behaviour.
Each WordSpan carries a speaker field
normalized to "0", "1", … in first-appearance order. Both
transcript views the agent feeds the LLM build on
autorag.blocks.group_by_speaker() to coalesce consecutive
same-speaker spans into turns: the boundary stages use
autorag.blocks.format_blocks() (MM:SS-MM:SS Speaker K:
<words>) and the per-node summary input uses Speaker N: <words>,
so the LLM always sees explicit turn-taking.
After each _run_diarization call the pyannote pipeline is
offloaded to CPU and VRAM freed; _ensure_pipeline_on_cuda restores
it on the next call.
LLM model residency¶
Whisper and pyannote are offloaded to CPU between calls; the Ollama
LLM gets the opposite treatment. All five topic stages share one
num_ctx and keep_alive="5m", so the model stays resident in
VRAM for the entire run rather than reloading per stage (a ~15 GB
disk→VRAM load each time). When the run finishes — or any stage
raises — _build_tree issues one throwaway keep_alive=0 call
that evicts the model so it doesn’t squat VRAM during the downstream
embed / /viz step. See Ollama tuning for the num_ctx
uniformity rationale and the num_ctx_l1 escape hatch for very long
audio.
Why split boundaries from summaries¶
Earlier versions of the agent asked one LLM call to do “find the L1
sections AND title and summarize each one.” That confused models on
long clips: section boundaries drifted as the model spent attention
on the prose. Splitting boundary detection (a constrained
[{s, e}] output) from summarization (per-section {title,
summary}) gives:
One focused prompt per call (boundaries OR prose).
A constant prompt prefix across the K summary calls, so the prefix-cache slot stays warm.
Independent retry: a bad boundary call can be replayed without redoing all the summarization work.