Audio pipeline design

The audio → topics pipeline lives in autorag.agent. It’s structured as five focused stages so each LLM call has one job and the heavy stages share an identical prompt prefix for cache reuse.

Stages

1. Whisper                              -> list[WordSpan]               1 call
2. L1 boundaries  (single LLM call)     -> list[{s,e}]                  1 LLM
3a Decide subdivide  (per long L1)      -> list[bool]                   N LLM
3b L2 boundaries  (per yes-L1, batched) -> list[list[{s,e}]]            M LLM (M<=N)
4. Summarize nodes  (per L1+L2, batched)-> {title,summary} per node     K LLM
5. L0 aggregate                         -> {title, summary}             1 LLM

Total LLM calls per clip: roughly 2 + N1_long + N1_yes + N1 + N2_total — about 20 calls for a seven-minute clip.

Boundary calls receive the transcript as a time-bucketed view (autorag.blocks.format_blocks(), 30-second windows by default, tunable via boundary_block_seconds — one MM:SS-MM:SS Speaker K: <words> line per turn instead of one timestamped line per word, which keeps the boundary prompts compact). They emit {s, e} as MM:SS strings copied straight from those range markers; autorag.agent._parse_ts() converts them back to float seconds before tiling — the model never does the arithmetic. Per-node summary calls operate on the slice’s plain text (no timestamps) and emit {title, summary}. The K = N1 + N2 summary calls share an identical prompt prefix so Ollama’s per-slot prefix cache pays once.

Final shape: {"topics": [L0]} with L0.children = [L1...], each L1.children = [L2...] or []. The L0 root is the explicit “what is this audio about” node.

Default LLM model: gemma4:latest (8B Q4_K_M, ~9.6 GB), a thinking-capable model. Override via the --llm-model flag on the CLI or the llm_model kwarg on the SDK methods.

All five stages do mechanical JSON extraction (boundaries, yes/no subdivide decisions, {title, summary}), so the agent sets reasoning=False by default — the same determinism/latency rationale as temperature=0.0. This sends think: false to Ollama and suppresses gemma4’s chain-of-thought preamble, which would otherwise be pure latency and a structured-output parse hazard. reasoning is an overridable kwarg on build_topic_runnable / build_agent / AutoRAG.generate_topics (default False); pass reasoning=True to trade latency for chain-of-thought, or with a non-thinking model where it is a harmless no-op.

Whisper backend

autorag.whisper_runner runs whisperX — faster-whisper (CTranslate2) for transcription plus a wav2vec2 forced-alignment pass for frame-accurate word timestamps.

After each transcribe_segment call:

  • The CTranslate2 model is removed from the module cache so Python GC can free VRAM.

  • The wav2vec2 alignment model is offloaded to CPU via PyTorch .to("cpu"); the next call restores it to CUDA.

  • On a CUDA error, the runner falls back to CPU.

Diarization

autorag.diarize uses pyannote/speaker-diarization-3.1, which is HuggingFace-gated. HF_TOKEN must be set. Without it (or on a load / runtime failure), every word is labelled "0" and the agent logs a warning — output then matches pre-diarization behaviour.

Each WordSpan carries a speaker field normalized to "0", "1", … in first-appearance order. Both transcript views the agent feeds the LLM build on autorag.blocks.group_by_speaker() to coalesce consecutive same-speaker spans into turns: the boundary stages use autorag.blocks.format_blocks() (MM:SS-MM:SS Speaker K: <words>) and the per-node summary input uses Speaker N: <words>, so the LLM always sees explicit turn-taking.

After each _run_diarization call the pyannote pipeline is offloaded to CPU and VRAM freed; _ensure_pipeline_on_cuda restores it on the next call.

LLM model residency

Whisper and pyannote are offloaded to CPU between calls; the Ollama LLM gets the opposite treatment. All five topic stages share one num_ctx and keep_alive="5m", so the model stays resident in VRAM for the entire run rather than reloading per stage (a ~15 GB disk→VRAM load each time). When the run finishes — or any stage raises — _build_tree issues one throwaway keep_alive=0 call that evicts the model so it doesn’t squat VRAM during the downstream embed / /viz step. See Ollama tuning for the num_ctx uniformity rationale and the num_ctx_l1 escape hatch for very long audio.

Why split boundaries from summaries

Earlier versions of the agent asked one LLM call to do “find the L1 sections AND title and summarize each one.” That confused models on long clips: section boundaries drifted as the model spent attention on the prose. Splitting boundary detection (a constrained [{s, e}] output) from summarization (per-section {title, summary}) gives:

  • One focused prompt per call (boundaries OR prose).

  • A constant prompt prefix across the K summary calls, so the prefix-cache slot stays warm.

  • Independent retry: a bad boundary call can be replayed without redoing all the summarization work.