Ollama tuning¶
The agent’s batched stages parallelize requests to Ollama, so the
relevant server-side knob is OLLAMA_NUM_PARALLEL. The right value
depends on whether you’re tuning for parallelism or for a bigger
single-stream model.
Tuning contract (single source of truth)¶
The host ollama service in docker-compose.yml owns the tuned set
as its only copy — each ${VAR:-default}-overridable from the
host .env, then applied by ./scripts/stack.sh up:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_NUM_PARALLEL=8
OLLAMA_MAX_LOADED_MODELS=1
These are server-side only; the Python agent still sets num_ctx and
keep_alive per request. (Ollama is a host compose service now, not
devcontainer-native; the thin devcontainer joins the stack’s shared
autorag-net network and reaches it by name at
http://ollama:11434.) Flash attention plus a q8_0 KV cache
roughly halve per-slot KV VRAM at near-lossless quality and cut
attention memory bandwidth, so the eight agent slots stay concurrent
and each call runs faster. MAX_LOADED_MODELS=1 pins the single
agent LLM so it is never evicted to load a second model. The sections
below explain the reasoning behind each value.
Note
./scripts/stack.sh up pulls the models after Ollama is
healthy — its ollama ls healthcheck reports healthy with zero
models. A bare docker compose up therefore races the pull and
every LLM stage fails with a 404 (“model … try pulling it first”).
The async pipeline rewrites this into a legible job error
(services.stages._legible_error) — “Ollama model not available …
./scripts/stack.sh up“ — instead of an opaque DLQ traceback.
It is a string-match only: no retry/DLQ topology change, and any
other error still falls back to repr.
The default agent LLM is gemma4:latest (8B Q4_K_M, ~9.6 GB), a
thinking-capable model. The agent disables thinking
(reasoning=False, sent to Ollama as think: false) for all five
mechanical-JSON stages — that is a client-side per-request setting, not
a server env knob, but it is the dominant gemma4 latency lever, so it
is noted here for anyone tuning for speed. Validation caveat: the
agent-lab LEDGER’s gemma4 rows were measured under Ollama’s default
server env (flash attention default-on, f16 KV), not this tuned
q8_0-KV + explicit FLASH_ATTENTION=1 + NUM_PARALLEL=4
combination. Gemma-family models use interleaved sliding-window
attention, historically a sensitive pairing with flash attention in
llama.cpp / Ollama. The settings are sound and each is overridable;
re-run bench.py to confirm gemma4 quality holds under them.
OLLAMA_NUM_PARALLEL¶
≥ 8 for the agent’s batched stages (Stage 3a “decide”, Stage 3b L2 boundaries, Stage 4 per-node summaries). Matches the langchain-side default
max_concurrency=8inAudioJobRequestand the in-process agent. Required forRunnable.batchto actually parallelize across that many calls.= 1 for one-shot calls on a bigger model. Ollama pre-reserves all
NUM_PARALLELslots’ KV cache at the configurednum_ctx, so 8 idle slots steal VRAM that the bigger model needs.
On a 24 GB GPU the default gemma4:latest (~9.6 GB) plus eight
num_ctx=8192 slots at the q8_0 KV cache still leaves room for
whisper’s ~6 GB tenancy budget on tenancy flip. The bump from 4→8 is
worth verifying per-host: while a job is mid-l1 run
docker exec autorag-ollama-1 nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader
and confirm ≥ 6 GiB free. Rollback if the new
autorag.gpu.preload.whisper_ct2.cuda span flips
preload.cuda_succeeded to false on this host: set
OLLAMA_NUM_PARALLEL=4 in .env and docker restart the
ollama service. The NUM_PARALLEL=1 case still applies to the
bigger gemma4:26b (the 25.8B sibling, ~17 GB): a single stream
gets the freed slot KV at num_ctx=8192 with full offload, and a
single-stream num_ctx=16384 still fits. Verify with ollama ps
after a load.
OLLAMA_FLASH_ATTENTION and OLLAMA_MULTIUSER_CACHE¶
Flash attention is on by default (see Devcontainer defaults), which
also unlocks the q8_0 KV cache. Do not combine
OLLAMA_FLASH_ATTENTION=1 with OLLAMA_MULTIUSER_CACHE=true and
concurrent slots — it triggers:
GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers")
Because the ollama compose service ships FLASH_ATTENTION=1 with
NUM_PARALLEL=8 (concurrent slots), MULTIUSER_CACHE must stay
unset — docker-compose.yml deliberately omits it. The per-slot
prefix cache still works without it, which is what the agent’s K
identical summary prompts benefit from.
Per-slot KV-cache sizing¶
Every stage uses the same context size — num_ctx=8192 — chosen to
fit the typical “8 slots × KV + ~9 GB model” budget on a 24 GB card.
With the default q8_0 KV cache each slot’s KV is roughly half its
f16 size, so doubling slot count from 4 to 8 still leaves headroom for
the whisper tenancy’s ~6 GB budget on tenancy flip. A uniform
num_ctx is
deliberate: Ollama reloads a model whenever num_ctx changes
between requests, so keeping it constant is what lets the model stay
resident across all five stages (see Model residency during a run
below).
num_ctx_l1 remains an overridable kwarg
(autorag.agent.build_topic_runnable() /
autorag.core.AutoRAG.generate_topics()). The Stage 2 (L1) call
sees the whole time-bucketed transcript; on very long audio
(≈1 hr+) 8192 tokens can truncate it and degrade L1 boundary quality.
Raising num_ctx_l1 back to e.g. 16384 fixes that, at the cost
of exactly one model reload at the Stage 2→3a boundary (the L1 call
then differs in num_ctx from the fan-out stages).
These values are conservative enough that bumping the LLM to the
bigger gemma4:26b (~17 GB) typically just needs NUM_PARALLEL=1
and no other changes.
Model residency during a run¶
The topic agent keeps the LLM resident in VRAM for the whole run instead of reloading it per stage. Two settings make that work:
keep_alive="5m"on every chat client — long enough to span the sub-second gaps between stages, so Ollama never unloads mid-run. It doubles as a crash-safety fallback: if the run dies before the explicit eviction below, Ollama still unloads the model on its own after five idle minutes.a uniform
num_ctxacross all stages (see Per-slot KV-cache sizing) — without this the 16 K→8 K transition at the Stage 2→3a boundary would force a reload even withkeep_aliveset.
When the run finishes (or any stage raises), _build_tree issues
one throwaway keep_alive=0 generation that evicts the model so it
doesn’t squat VRAM during the downstream embed / /viz step. This
is the LLM analogue of the whisper / pyannote “offload to CPU after
use” idiom.
Because all stages now share one num_ctx and the model stays
warm, OLLAMA_NUM_PARALLEL ≥ 8 is unambiguously beneficial: the
batched stages parallelize across all eight slots and there is no
per-stage reload cost to trade off against.
Resolving the Ollama URL¶
Both the agent (LLM chat) and autorag.embed.Embedder
(embeddings) read AUTORAG_OLLAMA_BASE_URL (default
http://localhost:11434). The compose workers and the thin
devcontainer all set it to http://ollama:11434 — they are on the
same shared autorag-net network, so the container-network service
name resolves from the sandbox too (no host.docker.internal hop).
The embedding model is separately controlled with
AUTORAG_EMBED_MODEL (default nomic-embed-text).