Running the HTTP server¶
The autorag serve command runs a FastAPI server with uvicorn.
Requires the [server] extra; the /viz endpoints additionally
need [rag].
autorag serve --host 0.0.0.0 --port 8000
# With auto-reload (development)
autorag serve --reload
Endpoints¶
Method |
Path |
Description |
|---|---|---|
|
|
Liveness probe — always |
|
|
Document ingestion. Body:
|
|
|
RAG query. Body: |
|
|
React 3-D scatter (needs |
|
|
UMAP coordinates + cluster labels + edges as JSON. |
|
|
Semantic search over topic embeddings. |
|
|
Static file mount for the React bundle. |
|
|
Enqueue an audio→topics job → |
|
|
Job status + per-stage state. |
|
|
The finished clip row; |
Calling the API¶
curl http://localhost:8000/health
# {"status":"ok"}
curl -X POST http://localhost:8000/ingest \
-H 'content-type: application/json' \
-d '{"paths": ["./notes"]}'
curl -X POST http://localhost:8000/query \
-H 'content-type: application/json' \
-d '{"question":"What did we decide about retries?","top_k":5}'
Async job pipeline¶
The synchronous endpoints above process inline and never need a broker.
For running many audio→topics requests concurrently there is an
optional async path behind the [broker] extra: a RabbitMQ broker, a
single GPU worker (Whisper + every LLM stage), and an IO worker (the
persist stage). It is fully decoupled — installing or running it
changes nothing about the synchronous SDK / CLI / API.
The repo-root docker-compose.yml is the single source of truth
for the whole host stack — five services on a shared autorag-net
network that the devcontainer also joins:
Service |
What it is |
Notes |
|---|---|---|
|
|
AMQP on |
|
|
Now a compose service (no longer host/devcontainer-native). Owns
the server-side tuning contract as its only copy; reserves the
NVIDIA GPU; models persist in the |
|
Whisper + every LLM stage |
Lean |
|
The |
Same image, no GPU. Shares |
|
Filtered Docker-API control plane |
Pinned |
Deploying with scripts/stack.sh¶
The stack is brought up on the host with one idempotent command
(the devcontainer is a thin sandbox with no dockerd — only a CLI
pointed at the read-only proxy, which refuses build/up — so the
initial up is inherently host-side):
cp .env.example .env # optional — only for HF_TOKEN (diarization)
./scripts/stack.sh up # create net → build → wait healthy → pull models
./scripts/stack.sh rebuild # dependency-only image rebuild, then up
./scripts/stack.sh down -v # stop (and drop named volumes)
up creates the shared autorag-net network, builds the images,
waits for rabbitmq + ollama to report healthy, then pulls the
LLM + embedding models into the ollama-models volume (first run is
multi-GB / several minutes; AUTORAG_SKIP_MODEL_PULL=1 opts out).
Prerequisites:
The NVIDIA container runtime, for the
ollama+gpu-workerdevice reservations.``HF_TOKEN`` (optional, in
.env) is passed through for diarization; without it every word is labelled"0"(same as the synchronous path). There is no control-plane token.
Note
./scripts/stack.sh up does not start the HTTP API. The stack
runs the broker, Ollama, the two workers and docker-socket-proxy.
To drive the pipeline you have two options:
Run
autorag serveyourself (needs the[server]extra) pointed at the sameAUTORAG_DB_PATHandAUTORAG_BROKER_URLas the workers, then use the/jobs/*HTTP endpoints below; orSkip the API entirely and use the
autorag jobs submit/autorag jobs statusCLI, which talks to the broker directly.
Warning
If the host autorag serve uses a different AUTORAG_DB_PATH
than the workers (the default is ~/.autorag/autorag.db, not
./.stack-data/autorag.db), the broker still delivers the job but
the worker’s status writes land in a DB the API never reads — every
/jobs/{id} is stuck on queued forever, even after the job
has failed and dead-lettered. Put the absolute shared path in the
repo .env (autorag serve reads it via SettingsConfigDict)
so it can’t drift:
AUTORAG_DB_PATH=/abs/path/to/repo/.stack-data/autorag.db
Also: the gpu-worker image must install the [youtube] extra
for URL inputs (.devcontainer/worker.Dockerfile) — without it the
whisper stage MissingExtraError\ s on every YouTube job.
With the API running (same DB + broker as the workers):
curl -X POST http://localhost:8000/jobs/audio \
-H 'content-type: application/json' \
-d '{"source":"https://youtu.be/VIDEO","title":"Demo"}'
# {"job_id":"…","session_id":"…","status":"queued"}
curl http://localhost:8000/jobs/JOB_ID # status + per-stage state
curl http://localhost:8000/jobs/JOB_ID/result # the finished clip (409 until done)
Or without an API server, straight from the CLI:
autorag jobs submit https://youtu.be/VIDEO --title Demo
autorag jobs status JOB_ID
Workers and whatever serves the API / /viz share state through the
repo-local ./.stack-data bind (AUTORAG_DB_PATH=/data/autorag.db
inside the containers) — a finished async job writes the same SQLite
/ Chroma rows a CLI run would, so /viz and every other reader work
unchanged. Because .stack-data is a repo-local directory it is also
visible from the devcontainer (at /workspace/autorag/.stack-data)
for direct cross-process job/error inspection. Without the [broker]
/ [rag] extras the /jobs/* endpoints return 503 with an
install hint and the rest of the API is unaffected. See the “Async
pipeline” section of CLAUDE.md for the architecture.
Devcontainer (thin sandbox) and the per-edit loop¶
The devcontainer is a thin sandbox: no dockerd, no mounted socket, no
GPU. It joins the stack’s shared ``autorag-net`` network (declared
external in docker-compose.yml; created idempotently by both
./scripts/stack.sh and the devcontainer initializeCommand), so
it reaches services by name —
AUTORAG_BROKER_URL=amqp://rabbitmq:5672,
AUTORAG_OLLAMA_BASE_URL=http://ollama:11434 — and its client-only
docker CLI points at the read-only docker-socket-proxy
(DOCKER_HOST=tcp://docker-socket-proxy:2375). Its
.devcontainer/check-stack.sh postStartCommand only probes via
docker compose ps and always exits 0 — the async path is
optional and the synchronous SDK/CLI/API never need the stack.
After a code edit, bounce a worker from inside the sandbox with plain compose verbs (the proxy enforces ps/logs/restart-only — no helper, no token):
docker compose -p autorag ps # per-service status
docker compose -p autorag restart gpu-worker # picks up a bind-mounted edit
docker compose -p autorag logs --tail 200 gpu-worker
Workers uv run the bind-mounted live repo, so restart (not a
rebuild) makes an edit live. The proxy refuses build/up/exec
— a dependency change is the host-side ./scripts/stack.sh rebuild.
The sandbox’s project venv is UV_PROJECT_ENVIRONMENT=/opt/autorag-venv
(outside the bind-mounted workspace) so host and sandbox never thrash a
shared ./.venv.
Mounting the app yourself¶
If you need to embed AutoRAG inside a larger FastAPI app, import
autorag.api:app directly and re-mount it or merge its routers:
from fastapi import FastAPI
from autorag.api import app as autorag_app
parent = FastAPI()
parent.mount("/autorag", autorag_app)
The get_rag() helper returns a process-wide
AutoRAG singleton via functools.lru_cache, so reusing the app
across requests doesn’t re-instantiate the embedder or vector store.