Running the HTTP server

The autorag serve command runs a FastAPI server with uvicorn. Requires the [server] extra; the /viz endpoints additionally need [rag].

autorag serve --host 0.0.0.0 --port 8000

# With auto-reload (development)
autorag serve --reload

Endpoints

Method

Path

Description

GET

/health

Liveness probe — always {"status": "ok"}.

POST

/ingest

Document ingestion. Body: IngestRequest.

POST

/query

RAG query. Body: QueryRequest.

GET

/viz

React 3-D scatter (needs [rag]).

GET

/viz/data

UMAP coordinates + cluster labels + edges as JSON.

GET

/viz/search

Semantic search over topic embeddings.

GET

/viz-assets/*

Static file mount for the React bundle.

POST

/jobs/audio

Enqueue an audio→topics job → 202 + job_id (async path; needs [broker,rag]).

GET

/jobs/{job_id}

Job status + per-stage state.

GET

/jobs/{job_id}/result

The finished clip row; 409 until the job is done.

Calling the API

curl http://localhost:8000/health
# {"status":"ok"}

curl -X POST http://localhost:8000/ingest \
     -H 'content-type: application/json' \
     -d '{"paths": ["./notes"]}'

curl -X POST http://localhost:8000/query \
     -H 'content-type: application/json' \
     -d '{"question":"What did we decide about retries?","top_k":5}'

Async job pipeline

The synchronous endpoints above process inline and never need a broker. For running many audio→topics requests concurrently there is an optional async path behind the [broker] extra: a RabbitMQ broker, a single GPU worker (Whisper + every LLM stage), and an IO worker (the persist stage). It is fully decoupled — installing or running it changes nothing about the synchronous SDK / CLI / API.

The repo-root docker-compose.yml is the single source of truth for the whole host stack — five services on a shared autorag-net network that the devcontainer also joins:

Service

What it is

Notes

rabbitmq

rabbitmq:3-management broker

AMQP on :5672, management UI on :15672; healthcheck-gated so the workers wait for it to be ready.

ollama

ollama/ollama LLM + embedding server

Now a compose service (no longer host/devcontainer-native). Owns the server-side tuning contract as its only copy; reserves the NVIDIA GPU; models persist in the ollama-models volume.

gpu-worker

Whisper + every LLM stage

Lean .devcontainer/worker.Dockerfile (CUDA + extras, deps only). Only ./src + ./pyproject.toml are bind-mounted read-only (not the repo root — .env never enters a worker container) and run via uv run, so a code edit needs only a restart — no rebuild. Reserves the GPU; replicas: 1 — one physical GPU, do not scale it.

io-worker

The persist stage (SQLite + Chroma writes)

Same image, no GPU. Shares ./.stack-data with gpu-worker.

docker-socket-proxy

Filtered Docker-API control plane

Pinned tecnativa/docker-socket-proxy:0.3.0 — the only container mounting /var/run/docker.sock (read-only, no published port, reachable only by service name on autorag-net). All endpoint groups default-deny except the few docker compose ps|logs|restart need; BUILD/EXEC/IMAGES/… are refused at the proxy, so there is no host-code-exec path. The devcontainer points DOCKER_HOST here — that is the per-edit control loop; there is no bespoke control service and no token.

Deploying with scripts/stack.sh

The stack is brought up on the host with one idempotent command (the devcontainer is a thin sandbox with no dockerd — only a CLI pointed at the read-only proxy, which refuses build/up — so the initial up is inherently host-side):

cp .env.example .env        # optional — only for HF_TOKEN (diarization)
./scripts/stack.sh up       # create net → build → wait healthy → pull models
./scripts/stack.sh rebuild  # dependency-only image rebuild, then up
./scripts/stack.sh down -v  # stop (and drop named volumes)

up creates the shared autorag-net network, builds the images, waits for rabbitmq + ollama to report healthy, then pulls the LLM + embedding models into the ollama-models volume (first run is multi-GB / several minutes; AUTORAG_SKIP_MODEL_PULL=1 opts out).

Prerequisites:

  • The NVIDIA container runtime, for the ollama + gpu-worker device reservations.

  • ``HF_TOKEN`` (optional, in .env) is passed through for diarization; without it every word is labelled "0" (same as the synchronous path). There is no control-plane token.

Note

./scripts/stack.sh up does not start the HTTP API. The stack runs the broker, Ollama, the two workers and docker-socket-proxy. To drive the pipeline you have two options:

  • Run autorag serve yourself (needs the [server] extra) pointed at the same AUTORAG_DB_PATH and AUTORAG_BROKER_URL as the workers, then use the /jobs/* HTTP endpoints below; or

  • Skip the API entirely and use the autorag jobs submit / autorag jobs status CLI, which talks to the broker directly.

Warning

If the host autorag serve uses a different AUTORAG_DB_PATH than the workers (the default is ~/.autorag/autorag.db, not ./.stack-data/autorag.db), the broker still delivers the job but the worker’s status writes land in a DB the API never reads — every /jobs/{id} is stuck on queued forever, even after the job has failed and dead-lettered. Put the absolute shared path in the repo .env (autorag serve reads it via SettingsConfigDict) so it can’t drift:

AUTORAG_DB_PATH=/abs/path/to/repo/.stack-data/autorag.db

Also: the gpu-worker image must install the [youtube] extra for URL inputs (.devcontainer/worker.Dockerfile) — without it the whisper stage MissingExtraError\ s on every YouTube job.

With the API running (same DB + broker as the workers):

curl -X POST http://localhost:8000/jobs/audio \
     -H 'content-type: application/json' \
     -d '{"source":"https://youtu.be/VIDEO","title":"Demo"}'
# {"job_id":"…","session_id":"…","status":"queued"}

curl http://localhost:8000/jobs/JOB_ID          # status + per-stage state
curl http://localhost:8000/jobs/JOB_ID/result   # the finished clip (409 until done)

Or without an API server, straight from the CLI:

autorag jobs submit https://youtu.be/VIDEO --title Demo
autorag jobs status JOB_ID

Workers and whatever serves the API / /viz share state through the repo-local ./.stack-data bind (AUTORAG_DB_PATH=/data/autorag.db inside the containers) — a finished async job writes the same SQLite / Chroma rows a CLI run would, so /viz and every other reader work unchanged. Because .stack-data is a repo-local directory it is also visible from the devcontainer (at /workspace/autorag/.stack-data) for direct cross-process job/error inspection. Without the [broker] / [rag] extras the /jobs/* endpoints return 503 with an install hint and the rest of the API is unaffected. See the “Async pipeline” section of CLAUDE.md for the architecture.

Devcontainer (thin sandbox) and the per-edit loop

The devcontainer is a thin sandbox: no dockerd, no mounted socket, no GPU. It joins the stack’s shared ``autorag-net`` network (declared external in docker-compose.yml; created idempotently by both ./scripts/stack.sh and the devcontainer initializeCommand), so it reaches services by name — AUTORAG_BROKER_URL=amqp://rabbitmq:5672, AUTORAG_OLLAMA_BASE_URL=http://ollama:11434 — and its client-only docker CLI points at the read-only docker-socket-proxy (DOCKER_HOST=tcp://docker-socket-proxy:2375). Its .devcontainer/check-stack.sh postStartCommand only probes via docker compose ps and always exits 0 — the async path is optional and the synchronous SDK/CLI/API never need the stack.

After a code edit, bounce a worker from inside the sandbox with plain compose verbs (the proxy enforces ps/logs/restart-only — no helper, no token):

docker compose -p autorag ps                  # per-service status
docker compose -p autorag restart gpu-worker  # picks up a bind-mounted edit
docker compose -p autorag logs --tail 200 gpu-worker

Workers uv run the bind-mounted live repo, so restart (not a rebuild) makes an edit live. The proxy refuses build/up/exec — a dependency change is the host-side ./scripts/stack.sh rebuild. The sandbox’s project venv is UV_PROJECT_ENVIRONMENT=/opt/autorag-venv (outside the bind-mounted workspace) so host and sandbox never thrash a shared ./.venv.

Mounting the app yourself

If you need to embed AutoRAG inside a larger FastAPI app, import autorag.api:app directly and re-mount it or merge its routers:

from fastapi import FastAPI
from autorag.api import app as autorag_app

parent = FastAPI()
parent.mount("/autorag", autorag_app)

The get_rag() helper returns a process-wide AutoRAG singleton via functools.lru_cache, so reusing the app across requests doesn’t re-instantiate the embedder or vector store.