Ollama tuning ============= The agent's batched stages parallelize requests to Ollama, so the relevant server-side knob is ``OLLAMA_NUM_PARALLEL``. The right value depends on whether you're tuning for parallelism or for a bigger single-stream model. Tuning contract (single source of truth) ---------------------------------------- The host ``ollama`` service in ``docker-compose.yml`` owns the tuned set as its **only** copy — each ``${VAR:-default}``-overridable from the host ``.env``, then applied by ``./scripts/stack.sh up``: .. code-block:: bash OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_NUM_PARALLEL=8 OLLAMA_MAX_LOADED_MODELS=1 These are server-side only; the Python agent still sets ``num_ctx`` and ``keep_alive`` per request. (Ollama is a host compose service now, not devcontainer-native; the thin devcontainer joins the stack's shared ``autorag-net`` network and reaches it by name at ``http://ollama:11434``.) Flash attention plus a ``q8_0`` KV cache roughly halve per-slot KV VRAM at near-lossless quality and cut attention memory bandwidth, so the eight agent slots stay concurrent *and* each call runs faster. ``MAX_LOADED_MODELS=1`` pins the single agent LLM so it is never evicted to load a second model. The sections below explain the reasoning behind each value. .. note:: ``./scripts/stack.sh up`` pulls the models **after** Ollama is healthy — its ``ollama ls`` healthcheck reports healthy with *zero* models. A bare ``docker compose up`` therefore races the pull and every LLM stage fails with a 404 ("model … try pulling it first"). The async pipeline rewrites this into a legible job error (``services.stages._legible_error``) — *"Ollama model not available …* ``./scripts/stack.sh up``\ *"* — instead of an opaque DLQ traceback. It is a string-match only: no retry/DLQ topology change, and any other error still falls back to ``repr``. The default agent LLM is ``gemma4:latest`` (8B Q4_K_M, ~9.6 GB), a ``thinking``-capable model. The agent disables thinking (``reasoning=False``, sent to Ollama as ``think: false``) for all five mechanical-JSON stages — that is a client-side per-request setting, not a server env knob, but it is the dominant gemma4 latency lever, so it is noted here for anyone tuning for speed. **Validation caveat:** the agent-lab LEDGER's gemma4 rows were measured under Ollama's *default* server env (flash attention default-on, f16 KV), **not** this tuned ``q8_0``-KV + explicit ``FLASH_ATTENTION=1`` + ``NUM_PARALLEL=4`` combination. Gemma-family models use interleaved sliding-window attention, historically a sensitive pairing with flash attention in llama.cpp / Ollama. The settings are sound and each is overridable; re-run ``bench.py`` to confirm gemma4 quality holds under them. ``OLLAMA_NUM_PARALLEL`` ----------------------- * **≥ 8** for the agent's batched stages (Stage 3a "decide", Stage 3b L2 boundaries, Stage 4 per-node summaries). Matches the langchain-side default ``max_concurrency=8`` in :class:`~autorag.services.schemas.AudioJobRequest` and the in-process agent. Required for ``Runnable.batch`` to actually parallelize across that many calls. * **= 1** for one-shot calls on a *bigger* model. Ollama pre-reserves all ``NUM_PARALLEL`` slots' KV cache at the configured ``num_ctx``, so 8 idle slots steal VRAM that the bigger model needs. On a 24 GB GPU the default ``gemma4:latest`` (~9.6 GB) plus eight ``num_ctx=8192`` slots at the ``q8_0`` KV cache still leaves room for whisper's ~6 GB tenancy budget on tenancy flip. The bump from 4→8 is worth verifying per-host: while a job is mid-``l1`` run ``docker exec autorag-ollama-1 nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader`` and confirm ≥ 6 GiB free. Rollback if the new ``autorag.gpu.preload.whisper_ct2.cuda`` span flips ``preload.cuda_succeeded`` to false on this host: set ``OLLAMA_NUM_PARALLEL=4`` in ``.env`` and ``docker restart`` the ``ollama`` service. The ``NUM_PARALLEL=1`` case still applies to the bigger ``gemma4:26b`` (the 25.8B sibling, ~17 GB): a single stream gets the freed slot KV at ``num_ctx=8192`` with full offload, and a single-stream ``num_ctx=16384`` still fits. Verify with ``ollama ps`` after a load. ``OLLAMA_FLASH_ATTENTION`` and ``OLLAMA_MULTIUSER_CACHE`` --------------------------------------------------------- Flash attention is on by default (see *Devcontainer defaults*), which also unlocks the ``q8_0`` KV cache. **Do not** combine ``OLLAMA_FLASH_ATTENTION=1`` with ``OLLAMA_MULTIUSER_CACHE=true`` and concurrent slots — it triggers:: GGML_ASSERT(is_full && "seq_cp() is only supported for full KV buffers") Because the ``ollama`` compose service ships ``FLASH_ATTENTION=1`` with ``NUM_PARALLEL=8`` (concurrent slots), ``MULTIUSER_CACHE`` must stay unset — ``docker-compose.yml`` deliberately omits it. The per-slot prefix cache still works without it, which is what the agent's K identical summary prompts benefit from. Per-slot KV-cache sizing ------------------------ Every stage uses the same context size — ``num_ctx=8192`` — chosen to fit the typical "8 slots × KV + ~9 GB model" budget on a 24 GB card. With the default ``q8_0`` KV cache each slot's KV is roughly half its f16 size, so doubling slot count from 4 to 8 still leaves headroom for the whisper tenancy's ~6 GB budget on tenancy flip. A uniform ``num_ctx`` is deliberate: Ollama reloads a model whenever ``num_ctx`` changes between requests, so keeping it constant is what lets the model stay resident across all five stages (see *Model residency during a run* below). ``num_ctx_l1`` remains an overridable kwarg (:func:`autorag.agent.build_topic_runnable` / :meth:`autorag.core.AutoRAG.generate_topics`). The Stage 2 (L1) call sees the *whole* time-bucketed transcript; on very long audio (≈1 hr+) 8192 tokens can truncate it and degrade L1 boundary quality. Raising ``num_ctx_l1`` back to e.g. ``16384`` fixes that, at the cost of exactly one model reload at the Stage 2→3a boundary (the L1 call then differs in ``num_ctx`` from the fan-out stages). These values are conservative enough that bumping the LLM to the bigger ``gemma4:26b`` (~17 GB) typically just needs ``NUM_PARALLEL=1`` and no other changes. Model residency during a run ---------------------------- The topic agent keeps the LLM resident in VRAM for the whole run instead of reloading it per stage. Two settings make that work: * ``keep_alive="5m"`` on every chat client — long enough to span the sub-second gaps between stages, so Ollama never unloads mid-run. It doubles as a crash-safety fallback: if the run dies before the explicit eviction below, Ollama still unloads the model on its own after five idle minutes. * a uniform ``num_ctx`` across all stages (see *Per-slot KV-cache sizing*) — without this the 16 K→8 K transition at the Stage 2→3a boundary would force a reload even with ``keep_alive`` set. When the run finishes (or any stage raises), ``_build_tree`` issues one throwaway ``keep_alive=0`` generation that evicts the model so it doesn't squat VRAM during the downstream embed / ``/viz`` step. This is the LLM analogue of the whisper / pyannote "offload to CPU after use" idiom. Because all stages now share one ``num_ctx`` and the model stays warm, ``OLLAMA_NUM_PARALLEL`` ≥ 8 is unambiguously beneficial: the batched stages parallelize across all eight slots and there is no per-stage reload cost to trade off against. Resolving the Ollama URL ------------------------ Both the agent (LLM chat) and :class:`autorag.embed.Embedder` (embeddings) read ``AUTORAG_OLLAMA_BASE_URL`` (default ``http://localhost:11434``). The compose workers **and** the thin devcontainer all set it to ``http://ollama:11434`` — they are on the same shared ``autorag-net`` network, so the container-network service name resolves from the sandbox too (no ``host.docker.internal`` hop). The embedding model is separately controlled with ``AUTORAG_EMBED_MODEL`` (default ``nomic-embed-text``).