Audio→topics agent (autorag.agent)¶
Multi-pass L0/L1/L2 topic extraction pipeline. Surfaces:
transcribe_audio()— Whisper + diarization →WordSpanlist.generate_topics()— pure-LLM topic extraction on a pre-computed transcript.build_topic_runnable()— the LangChainRunnable[list[WordSpan], TopicTree]used bygenerate_topics.build_agent()— the combined Whisper + diarization + topicsRunnable[Path | str, TranscriptionResult].
Most callers should go through AutoRAG rather
than importing this module directly. The pipeline design is documented
in Audio pipeline design.
Audio → hierarchical topic tree. The single agent for AutoRAG.
Multi-pass L0 / L1 / L2 extractor — each LLM stage has one focused job:
1. Whisper -> list[WordSpan] 1 call
2. L1 boundaries (single LLM call) -> list[{s,e}] 1 LLM
3a Decide subdivide (per long L1) -> list[bool] N LLM
3b L2 boundaries (per yes-L1, batched) -> list[list[{s,e}]] M LLM (M<=N)
4. Summarize nodes (per L1+L2, batched)-> {title,summary} per node K LLM
5. L0 aggregate -> {title, summary} 1 LLM
Final shape: {"topics": [L0]} with L0.children = [L1...], each
L1.children = [L2...] or []. The L0 root is the explicit “what is
this audio about” node.
Boundary calls receive a time-bucketed (format_blocks,
boundary_block_seconds, default 30s) transcript and
emit {s, e} as MM:SS strings, which we parse back to float seconds here
(never the LLM — no model-side arithmetic). Per-node summary calls operate on
the slice’s plain text (no timestamps) and emit {title, summary}. The
K=N1+N2 summary calls share an identical prompt prefix for cache reuse.
- autorag.agent.build_agent(*, whisper_model='base', language='en', llm_model='gemma4:latest', ollama_base_url=None, num_ctx_l1=8192, num_ctx_fanout=8192, max_concurrency=8, min_subdivide_duration_s=120.0, reasoning=False, boundary_block_seconds=30)[source]¶
Build a Runnable mapping audio file -> {transcription, topics:{topics:[L0]}}.
- autorag.agent.build_stage_handlers(*, llm_model='gemma4:latest', ollama_base_url=None, num_ctx_l1=8192, num_ctx_fanout=8192, max_concurrency=8, min_subdivide_duration_s=120.0, reasoning=False, boundary_block_seconds=30)[source]¶
Return the per-stage closures keyed by canonical stage name.
The distributed/queued pipeline (
autorag.services) runs one stage at a time, batched across many concurrent requests, so it needs the individual stage functions rather than the sequential_build_treethatbuild_topic_runnable()composes. Both share_build_stage_closures(), so the warm-chain construction and thekeep_alive=0eviction are identical to the in-process path.Keys:
"l1","decide","l2","summarize","l0"(the boundary/summary LLM stages) and"evict"(the zero-argkeep_alive=0model-eviction call the GPU arbiter owns once a distributed run’s L0 stage completes). Stage 1 (Whisper) and the persist stage are not LLM stages and live inautorag.whisper_runner/autorag.core.AutoRAG.
- autorag.agent.build_topic_runnable(*, llm_model='gemma4:latest', ollama_base_url=None, num_ctx_l1=8192, num_ctx_fanout=8192, max_concurrency=8, min_subdivide_duration_s=120.0, reasoning=False, boundary_block_seconds=30)[source]¶
Build a Runnable mapping list[WordSpan] -> TopicTree (L0/L1/L2 hierarchy).
Notes on Ollama settings (server-side, controlled outside this module):
Every stage uses the same num_ctx (num_ctx_fanout, default 8192) and keep_alive=”5m”, so the model stays resident across the sub-second inter-stage gaps. Ollama reloads a model whenever num_ctx changes between requests, so a uniform context size is what actually keeps it warm — there are zero mid-run reloads. After Stage 5 (and on any stage error) _build_tree issues one throwaway keep_alive=0 call that evicts the model so it doesn’t squat VRAM during the downstream embed/viz step. The finite 5-minute keep_alive is a crash-safety fallback: if the run dies before the explicit eviction, Ollama still unloads the model on its own.
temperature=0.0 plus identical system prompts per chain give per-slot prefix-cache hits across all calls inside a single chain. (This works with Ollama’s default per-slot cache — it does not require OLLAMA_MULTIUSER_CACHE, which must stay unset alongside the devcontainer’s FLASH_ATTENTION=1 + concurrent slots; see CLAUDE.md “Ollama tuning”.)
reasoning=False (default) disables thinking on thinking-capable models. The default gemma4:latest is a thinking model; all five stages do mechanical JSON extraction (boundaries / yes-no / {title, summary}) where a chain-of-thought preamble is pure latency and a structured-output parse hazard — the same rationale as temperature=0.0. Pass reasoning=True to benchmark the quality/latency trade-off (the agent-lab gemma4-thinking design) or with a non-thinking model where it is a no-op.
num_ctx_l1 is still overridable. The Stage 2 (L1) call sees the whole time-bucketed transcript; on very long audio (≈1 hr+) 8192 tokens can truncate it and degrade L1 boundaries. Raising num_ctx_l1 back to e.g. 16384 fixes that at the cost of exactly one model reload at the Stage 2→3a boundary (the L1 call then differs in num_ctx).
boundary_block_seconds (default 30) sizes the time-bucketed transcript fed to the L1/L2 boundary prompts. Smaller windows give more frequent MM:SS anchors (finer possible boundaries) but more lines (more boundary-prompt tokens); larger windows are terser but coarser. It does not affect the per-node summary input (plain text, no timestamps).
With OLLAMA_NUM_PARALLEL=1 the server serializes batched requests, so Stage 3a/3b wall-clock is N x per-call, not N/4 x per-call. Raising NUM_PARALLEL requires more VRAM (the server reserves all slots’ KV-cache up front at the request’s num_ctx). See CLAUDE.md “Ollama tuning notes”.
- autorag.agent.generate_topics(words, **kwargs)[source]¶
Build the topic runnable and invoke it once.