Audio pipeline design ===================== The audio → topics pipeline lives in :mod:`autorag.agent`. It's structured as five focused stages so each LLM call has one job and the heavy stages share an identical prompt prefix for cache reuse. Stages ------ :: 1. Whisper -> list[WordSpan] 1 call 2. L1 boundaries (single LLM call) -> list[{s,e}] 1 LLM 3a Decide subdivide (per long L1) -> list[bool] N LLM 3b L2 boundaries (per yes-L1, batched) -> list[list[{s,e}]] M LLM (M<=N) 4. Summarize nodes (per L1+L2, batched)-> {title,summary} per node K LLM 5. L0 aggregate -> {title, summary} 1 LLM Total LLM calls per clip: roughly ``2 + N1_long + N1_yes + N1 + N2_total`` — about 20 calls for a seven-minute clip. Boundary calls receive the transcript as a time-bucketed view (:func:`autorag.blocks.format_blocks`, 30-second windows by default, tunable via ``boundary_block_seconds`` — one ``MM:SS-MM:SS Speaker K: `` line per turn instead of one timestamped line per word, which keeps the boundary prompts compact). They emit ``{s, e}`` as ``MM:SS`` strings copied straight from those range markers; :func:`autorag.agent._parse_ts` converts them back to float seconds before tiling — the model never does the arithmetic. Per-node summary calls operate on the slice's plain text (no timestamps) and emit ``{title, summary}``. The ``K = N1 + N2`` summary calls share an identical prompt prefix so Ollama's per-slot prefix cache pays once. Final shape: ``{"topics": [L0]}`` with ``L0.children = [L1...]``, each ``L1.children = [L2...]`` or ``[]``. The L0 root is the explicit "what is this audio about" node. Default LLM model: ``gemma4:latest`` (8B Q4_K_M, ~9.6 GB), a ``thinking``-capable model. Override via the ``--llm-model`` flag on the CLI or the ``llm_model`` kwarg on the SDK methods. All five stages do mechanical JSON extraction (boundaries, yes/no subdivide decisions, ``{title, summary}``), so the agent sets ``reasoning=False`` by default — the same determinism/latency rationale as ``temperature=0.0``. This sends ``think: false`` to Ollama and suppresses gemma4's chain-of-thought preamble, which would otherwise be pure latency and a structured-output parse hazard. ``reasoning`` is an overridable kwarg on ``build_topic_runnable`` / ``build_agent`` / ``AutoRAG.generate_topics`` (default ``False``); pass ``reasoning=True`` to trade latency for chain-of-thought, or with a non-thinking model where it is a harmless no-op. Whisper backend --------------- :mod:`autorag.whisper_runner` runs whisperX — faster-whisper (CTranslate2) for transcription plus a wav2vec2 forced-alignment pass for frame-accurate word timestamps. After each ``transcribe_segment`` call: * The CTranslate2 model is removed from the module cache so Python GC can free VRAM. * The wav2vec2 alignment model is offloaded to CPU via PyTorch ``.to("cpu")``; the next call restores it to CUDA. * On a CUDA error, the runner falls back to CPU. Diarization ----------- :mod:`autorag.diarize` uses ``pyannote/speaker-diarization-3.1``, which is HuggingFace-gated. ``HF_TOKEN`` must be set. Without it (or on a load / runtime failure), every word is labelled ``"0"`` and the agent logs a warning — output then matches pre-diarization behaviour. Each :data:`~autorag.types.WordSpan` carries a ``speaker`` field normalized to ``"0"``, ``"1"``, … in first-appearance order. Both transcript views the agent feeds the LLM build on :func:`autorag.blocks.group_by_speaker` to coalesce consecutive same-speaker spans into turns: the boundary stages use :func:`autorag.blocks.format_blocks` (``MM:SS-MM:SS Speaker K: ``) and the per-node summary input uses ``Speaker N: ``, so the LLM always sees explicit turn-taking. After each ``_run_diarization`` call the pyannote pipeline is offloaded to CPU and VRAM freed; ``_ensure_pipeline_on_cuda`` restores it on the next call. LLM model residency ------------------- Whisper and pyannote are offloaded to CPU between calls; the Ollama LLM gets the opposite treatment. All five topic stages share one ``num_ctx`` and ``keep_alive="5m"``, so the model stays resident in VRAM for the entire run rather than reloading per stage (a ~15 GB disk→VRAM load each time). When the run finishes — or any stage raises — ``_build_tree`` issues one throwaway ``keep_alive=0`` call that evicts the model so it doesn't squat VRAM during the downstream embed / ``/viz`` step. See :doc:`ollama-tuning` for the ``num_ctx`` uniformity rationale and the ``num_ctx_l1`` escape hatch for very long audio. Why split boundaries from summaries ----------------------------------- Earlier versions of the agent asked one LLM call to do "find the L1 sections AND title and summarize each one." That confused models on long clips: section boundaries drifted as the model spent attention on the prose. Splitting boundary detection (a constrained ``[{s, e}]`` output) from summarization (per-section ``{title, summary}``) gives: * One focused prompt per call (boundaries OR prose). * A constant prompt prefix across the K summary calls, so the prefix-cache slot stays warm. * Independent retry: a bad boundary call can be replayed without redoing all the summarization work.