Audio→topics agent (`autorag.agent`)¶

Multi-pass L0/L1/L2 topic extraction pipeline. Surfaces:

transcribe_audio() — Whisper + diarization → WordSpan list.
generate_topics() — pure-LLM topic extraction on a pre-computed transcript.
build_topic_runnable() — the LangChain Runnable[list[WordSpan], TopicTree] used by generate_topics.
build_agent() — the combined Whisper + diarization + topics Runnable[Path | str, TranscriptionResult].

Most callers should go through AutoRAG rather than importing this module directly. The pipeline design is documented in Audio pipeline design.

Audio → hierarchical topic tree. The single agent for AutoRAG.

Multi-pass L0 / L1 / L2 extractor — each LLM stage has one focused job:

1. Whisper                              -> list[WordSpan]               1 call
2. L1 boundaries  (single LLM call)     -> list[{s,e}]                  1 LLM
3a Decide subdivide  (per long L1)      -> list[bool]                   N LLM
3b L2 boundaries  (per yes-L1, batched) -> list[list[{s,e}]]            M LLM (M<=N)
4. Summarize nodes  (per L1+L2, batched)-> {title,summary} per node     K LLM
5. L0 aggregate                         -> {title, summary}             1 LLM

Final shape: {"topics": [L0]} with L0.children = [L1...], each L1.children = [L2...] or []. The L0 root is the explicit “what is this audio about” node.

Boundary calls receive a time-bucketed (format_blocks, boundary_block_seconds, default 30s) transcript and emit {s, e} as MM:SS strings, which we parse back to float seconds here (never the LLM — no model-side arithmetic). Per-node summary calls operate on the slice’s plain text (no timestamps) and emit {title, summary}. The K=N1+N2 summary calls share an identical prompt prefix for cache reuse.

autorag.agent.build_agent(*, whisper_model='base', language='en', llm_model='gemma4:latest', ollama_base_url=None, num_ctx_l1=8192, num_ctx_fanout=8192, max_concurrency=8, min_subdivide_duration_s=120.0, reasoning=False, boundary_block_seconds=30)[source]¶

Build a Runnable mapping audio file -> {transcription, topics:{topics:[L0]}}.

Parameters:

whisper_model (str)
language (str | None)
llm_model (str)
ollama_base_url (str | None)
num_ctx_l1 (int)
num_ctx_fanout (int)
max_concurrency (int)
min_subdivide_duration_s (float)
reasoning (bool)
boundary_block_seconds (int)

Return type:

Runnable[Path | str, TranscriptionResult]

autorag.agent.build_stage_handlers(*, llm_model='gemma4:latest', ollama_base_url=None, num_ctx_l1=8192, num_ctx_fanout=8192, max_concurrency=8, min_subdivide_duration_s=120.0, reasoning=False, boundary_block_seconds=30)[source]¶

Return the per-stage closures keyed by canonical stage name.

The distributed/queued pipeline (autorag.services) runs one stage at a time, batched across many concurrent requests, so it needs the individual stage functions rather than the sequential _build_tree that build_topic_runnable() composes. Both share _build_stage_closures(), so the warm-chain construction and the keep_alive=0 eviction are identical to the in-process path.

Keys: "l1", "decide", "l2", "summarize", "l0" (the boundary/summary LLM stages) and "evict" (the zero-arg keep_alive=0 model-eviction call the GPU arbiter owns once a distributed run’s L0 stage completes). Stage 1 (Whisper) and the persist stage are not LLM stages and live in autorag.whisper_runner / autorag.core.AutoRAG.

Parameters:

llm_model (str)
ollama_base_url (str | None)
num_ctx_l1 (int)
num_ctx_fanout (int)
max_concurrency (int)
min_subdivide_duration_s (float)
reasoning (bool)
boundary_block_seconds (int)

Return type:

dict[str, Callable[..., Any]]

autorag.agent.build_topic_runnable(*, llm_model='gemma4:latest', ollama_base_url=None, num_ctx_l1=8192, num_ctx_fanout=8192, max_concurrency=8, min_subdivide_duration_s=120.0, reasoning=False, boundary_block_seconds=30)[source]¶

Build a Runnable mapping list[WordSpan] -> TopicTree (L0/L1/L2 hierarchy).

Notes on Ollama settings (server-side, controlled outside this module):

Every stage uses the same num_ctx (num_ctx_fanout, default 8192) and keep_alive=”5m”, so the model stays resident across the sub-second inter-stage gaps. Ollama reloads a model whenever num_ctx changes between requests, so a uniform context size is what actually keeps it warm — there are zero mid-run reloads. After Stage 5 (and on any stage error) _build_tree issues one throwaway keep_alive=0 call that evicts the model so it doesn’t squat VRAM during the downstream embed/viz step. The finite 5-minute keep_alive is a crash-safety fallback: if the run dies before the explicit eviction, Ollama still unloads the model on its own.
temperature=0.0 plus identical system prompts per chain give per-slot prefix-cache hits across all calls inside a single chain. (This works with Ollama’s default per-slot cache — it does not require OLLAMA_MULTIUSER_CACHE, which must stay unset alongside the devcontainer’s FLASH_ATTENTION=1 + concurrent slots; see CLAUDE.md “Ollama tuning”.)
reasoning=False (default) disables thinking on thinking-capable models. The default gemma4:latest is a thinking model; all five stages do mechanical JSON extraction (boundaries / yes-no / {title, summary}) where a chain-of-thought preamble is pure latency and a structured-output parse hazard — the same rationale as temperature=0.0. Pass reasoning=True to benchmark the quality/latency trade-off (the agent-lab gemma4-thinking design) or with a non-thinking model where it is a no-op.
num_ctx_l1 is still overridable. The Stage 2 (L1) call sees the whole time-bucketed transcript; on very long audio (≈1 hr+) 8192 tokens can truncate it and degrade L1 boundaries. Raising num_ctx_l1 back to e.g. 16384 fixes that at the cost of exactly one model reload at the Stage 2→3a boundary (the L1 call then differs in num_ctx).
boundary_block_seconds (default 30) sizes the time-bucketed transcript fed to the L1/L2 boundary prompts. Smaller windows give more frequent MM:SS anchors (finer possible boundaries) but more lines (more boundary-prompt tokens); larger windows are terser but coarser. It does not affect the per-node summary input (plain text, no timestamps).
With OLLAMA_NUM_PARALLEL=1 the server serializes batched requests, so Stage 3a/3b wall-clock is N x per-call, not N/4 x per-call. Raising NUM_PARALLEL requires more VRAM (the server reserves all slots’ KV-cache up front at the request’s num_ctx). See CLAUDE.md “Ollama tuning notes”.

Parameters:

llm_model (str)
ollama_base_url (str | None)
num_ctx_l1 (int)
num_ctx_fanout (int)
max_concurrency (int)
min_subdivide_duration_s (float)
reasoning (bool)
boundary_block_seconds (int)

Return type:

Runnable[list[WordSpan], TopicTree]

autorag.agent.generate_topics(words, **kwargs)[source]¶

Build the topic runnable and invoke it once.

Parameters:

words (list[WordSpan])
kwargs (Any)

Return type:

TopicTree

autorag.agent.transcribe(file, **kwargs)[source]¶

Build the agent and invoke it once.

Parameters:

file (Path | str)
kwargs (Any)

Return type:

TranscriptionResult

autorag.agent.transcribe_audio(file, *, whisper_model='base', language='en')[source]¶

Run Whisper + diarization on a local audio file, returning word spans.

Parameters:

file (Path | str)
whisper_model (str)
language (str | None)

Return type:

list[WordSpan]

Audio→topics agent (autorag.agent)¶

Audio→topics agent (`autorag.agent`)¶