Running the HTTP server ======================= The ``autorag serve`` command runs a FastAPI server with uvicorn. Requires the ``[server]`` extra; the ``/viz`` endpoints additionally need ``[rag]``. .. code-block:: bash autorag serve --host 0.0.0.0 --port 8000 # With auto-reload (development) autorag serve --reload Endpoints --------- .. list-table:: :header-rows: 1 :widths: 8 22 70 * - Method - Path - Description * - ``GET`` - ``/health`` - Liveness probe — always ``{"status": "ok"}``. * - ``POST`` - ``/ingest`` - Document ingestion. Body: :class:`~autorag.schemas.IngestRequest`. * - ``POST`` - ``/query`` - RAG query. Body: :class:`~autorag.schemas.QueryRequest`. * - ``GET`` - ``/viz`` - React 3-D scatter (needs ``[rag]``). * - ``GET`` - ``/viz/data`` - UMAP coordinates + cluster labels + edges as JSON. * - ``GET`` - ``/viz/search`` - Semantic search over topic embeddings. * - ``GET`` - ``/viz-assets/*`` - Static file mount for the React bundle. * - ``POST`` - ``/jobs/audio`` - Enqueue an audio→topics job → ``202`` + ``job_id`` (async path; needs ``[broker,rag]``). * - ``GET`` - ``/jobs/{job_id}`` - Job status + per-stage state. * - ``GET`` - ``/jobs/{job_id}/result`` - The finished clip row; ``409`` until the job is ``done``. Calling the API --------------- .. code-block:: bash curl http://localhost:8000/health # {"status":"ok"} curl -X POST http://localhost:8000/ingest \ -H 'content-type: application/json' \ -d '{"paths": ["./notes"]}' curl -X POST http://localhost:8000/query \ -H 'content-type: application/json' \ -d '{"question":"What did we decide about retries?","top_k":5}' Async job pipeline ------------------ The synchronous endpoints above process inline and never need a broker. For running **many** audio→topics requests concurrently there is an optional async path behind the ``[broker]`` extra: a RabbitMQ broker, a single GPU worker (Whisper + every LLM stage), and an IO worker (the ``persist`` stage). It is fully decoupled — installing or running it changes nothing about the synchronous SDK / CLI / API. The repo-root ``docker-compose.yml`` is the **single source of truth** for the whole host stack — five services on a shared ``autorag-net`` network that the devcontainer also joins: .. list-table:: :header-rows: 1 :widths: 16 40 44 * - Service - What it is - Notes * - ``rabbitmq`` - ``rabbitmq:3-management`` broker - AMQP on ``:5672``, management UI on ``:15672``; healthcheck-gated so the workers wait for it to be ready. * - ``ollama`` - ``ollama/ollama`` LLM + embedding server - Now a compose service (no longer host/devcontainer-native). Owns the server-side tuning contract as its only copy; reserves the NVIDIA GPU; models persist in the ``ollama-models`` volume. * - ``gpu-worker`` - Whisper + every LLM stage - Lean ``.devcontainer/worker.Dockerfile`` (CUDA + extras, deps only). Only ``./src`` + ``./pyproject.toml`` are bind-mounted **read-only** (not the repo root — ``.env`` never enters a worker container) and run via ``uv run``, so a code edit needs only a restart — no rebuild. Reserves the GPU; ``replicas: 1`` — one physical GPU, do not scale it. * - ``io-worker`` - The ``persist`` stage (SQLite + Chroma writes) - Same image, no GPU. Shares ``./.stack-data`` with ``gpu-worker``. * - ``docker-socket-proxy`` - Filtered Docker-API control plane - Pinned ``tecnativa/docker-socket-proxy:0.3.0`` — the **only** container mounting ``/var/run/docker.sock`` (read-only, **no published port**, reachable only by service name on ``autorag-net``). All endpoint groups default-deny except the few ``docker compose ps|logs|restart`` need; ``BUILD``/``EXEC``/``IMAGES``/… are refused **at the proxy**, so there is no host-code-exec path. The devcontainer points ``DOCKER_HOST`` here — that *is* the per-edit control loop; there is no bespoke control service and no token. Deploying with ``scripts/stack.sh`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The stack is brought up on the **host** with one idempotent command (the devcontainer is a thin sandbox with no dockerd — only a CLI pointed at the read-only proxy, which refuses ``build``/``up`` — so the initial ``up`` is inherently host-side): .. code-block:: bash cp .env.example .env # optional — only for HF_TOKEN (diarization) ./scripts/stack.sh up # create net → build → wait healthy → pull models ./scripts/stack.sh rebuild # dependency-only image rebuild, then up ./scripts/stack.sh down -v # stop (and drop named volumes) ``up`` creates the shared ``autorag-net`` network, builds the images, waits for ``rabbitmq`` + ``ollama`` to report healthy, then pulls the LLM + embedding models into the ``ollama-models`` volume (first run is multi-GB / several minutes; ``AUTORAG_SKIP_MODEL_PULL=1`` opts out). Prerequisites: * **The NVIDIA container runtime**, for the ``ollama`` + ``gpu-worker`` device reservations. * **``HF_TOKEN``** (optional, in ``.env``) is passed through for diarization; without it every word is labelled ``"0"`` (same as the synchronous path). There is no control-plane token. .. note:: ``./scripts/stack.sh up`` does **not** start the HTTP API. The stack runs the broker, Ollama, the two workers and ``docker-socket-proxy``. To drive the pipeline you have two options: * Run ``autorag serve`` yourself (needs the ``[server]`` extra) pointed at the **same** ``AUTORAG_DB_PATH`` and ``AUTORAG_BROKER_URL`` as the workers, then use the ``/jobs/*`` HTTP endpoints below; or * Skip the API entirely and use the ``autorag jobs submit`` / ``autorag jobs status`` CLI, which talks to the broker directly. .. warning:: If the host ``autorag serve`` uses a *different* ``AUTORAG_DB_PATH`` than the workers (the default is ``~/.autorag/autorag.db``, **not** ``./.stack-data/autorag.db``), the broker still delivers the job but the worker's status writes land in a DB the API never reads — every ``/jobs/{id}`` is **stuck on** ``queued`` forever, even after the job has failed and dead-lettered. Put the absolute shared path in the repo ``.env`` (``autorag serve`` reads it via ``SettingsConfigDict``) so it can't drift:: AUTORAG_DB_PATH=/abs/path/to/repo/.stack-data/autorag.db Also: the ``gpu-worker`` image must install the ``[youtube]`` extra for URL inputs (``.devcontainer/worker.Dockerfile``) — without it the whisper stage ``MissingExtraError``\\ s on every YouTube job. With the API running (same DB + broker as the workers): .. code-block:: bash curl -X POST http://localhost:8000/jobs/audio \ -H 'content-type: application/json' \ -d '{"source":"https://youtu.be/VIDEO","title":"Demo"}' # {"job_id":"…","session_id":"…","status":"queued"} curl http://localhost:8000/jobs/JOB_ID # status + per-stage state curl http://localhost:8000/jobs/JOB_ID/result # the finished clip (409 until done) Or without an API server, straight from the CLI: .. code-block:: bash autorag jobs submit https://youtu.be/VIDEO --title Demo autorag jobs status JOB_ID Workers and whatever serves the API / ``/viz`` share state through the repo-local ``./.stack-data`` bind (``AUTORAG_DB_PATH=/data/autorag.db`` inside the containers) — a finished async job writes the **same** SQLite / Chroma rows a CLI run would, so ``/viz`` and every other reader work unchanged. Because ``.stack-data`` is a repo-local directory it is also visible from the devcontainer (at ``/workspace/autorag/.stack-data``) for direct cross-process job/error inspection. Without the ``[broker]`` / ``[rag]`` extras the ``/jobs/*`` endpoints return ``503`` with an install hint and the rest of the API is unaffected. See the "Async pipeline" section of ``CLAUDE.md`` for the architecture. Devcontainer (thin sandbox) and the per-edit loop ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The devcontainer is a **thin sandbox: no dockerd, no mounted socket, no GPU**. It **joins the stack's shared ``autorag-net`` network** (declared ``external`` in ``docker-compose.yml``; created idempotently by both ``./scripts/stack.sh`` and the devcontainer ``initializeCommand``), so it reaches services by name — ``AUTORAG_BROKER_URL=amqp://rabbitmq:5672``, ``AUTORAG_OLLAMA_BASE_URL=http://ollama:11434`` — and its **client-only** docker CLI points at the read-only ``docker-socket-proxy`` (``DOCKER_HOST=tcp://docker-socket-proxy:2375``). Its ``.devcontainer/check-stack.sh`` postStartCommand only *probes* via ``docker compose ps`` and **always exits 0** — the async path is optional and the synchronous SDK/CLI/API never need the stack. After a code edit, bounce a worker from inside the sandbox with plain compose verbs (the proxy enforces ps/logs/restart-only — no helper, no token): .. code-block:: bash docker compose -p autorag ps # per-service status docker compose -p autorag restart gpu-worker # picks up a bind-mounted edit docker compose -p autorag logs --tail 200 gpu-worker Workers ``uv run`` the bind-mounted live repo, so ``restart`` (not a rebuild) makes an edit live. The proxy refuses ``build``/``up``/``exec`` — a dependency change is the host-side ``./scripts/stack.sh rebuild``. The sandbox's project venv is ``UV_PROJECT_ENVIRONMENT=/opt/autorag-venv`` (outside the bind-mounted workspace) so host and sandbox never thrash a shared ``./.venv``. Mounting the app yourself ------------------------- If you need to embed AutoRAG inside a larger FastAPI app, import ``autorag.api:app`` directly and re-mount it or merge its routers: .. code-block:: python from fastapi import FastAPI from autorag.api import app as autorag_app parent = FastAPI() parent.mount("/autorag", autorag_app) The :func:`~autorag.api.get_rag` helper returns a process-wide ``AutoRAG`` singleton via ``functools.lru_cache``, so reusing the app across requests doesn't re-instantiate the embedder or vector store.