Running the HTTP server
=======================

The ``autorag serve`` command runs a FastAPI server with uvicorn.
Requires the ``[server]`` extra; the ``/viz`` endpoints additionally
need ``[rag]``.

.. code-block:: bash

    autorag serve --host 0.0.0.0 --port 8000

    # With auto-reload (development)
    autorag serve --reload

Endpoints
---------

.. list-table::
   :header-rows: 1
   :widths: 8 22 70

   * - Method
     - Path
     - Description
   * - ``GET``
     - ``/health``
     - Liveness probe — always ``{"status": "ok"}``.
   * - ``POST``
     - ``/ingest``
     - Document ingestion. Body:
       :class:`~autorag.schemas.IngestRequest`.
   * - ``POST``
     - ``/query``
     - RAG query. Body: :class:`~autorag.schemas.QueryRequest`.
   * - ``GET``
     - ``/viz``
     - React 3-D scatter (needs ``[rag]``).
   * - ``GET``
     - ``/viz/data``
     - UMAP coordinates + cluster labels + edges as JSON.
   * - ``GET``
     - ``/viz/search``
     - Semantic search over topic embeddings.
   * - ``GET``
     - ``/viz-assets/*``
     - Static file mount for the React bundle.
   * - ``POST``
     - ``/jobs/audio``
     - Enqueue an audio→topics job → ``202`` + ``job_id`` (async path;
       needs ``[broker,rag]``).
   * - ``GET``
     - ``/jobs/{job_id}``
     - Job status + per-stage state.
   * - ``GET``
     - ``/jobs/{job_id}/result``
     - The finished clip row; ``409`` until the job is ``done``.

Calling the API
---------------

.. code-block:: bash

    curl http://localhost:8000/health
    # {"status":"ok"}

    curl -X POST http://localhost:8000/ingest \
         -H 'content-type: application/json' \
         -d '{"paths": ["./notes"]}'

    curl -X POST http://localhost:8000/query \
         -H 'content-type: application/json' \
         -d '{"question":"What did we decide about retries?","top_k":5}'

Async job pipeline
------------------

The synchronous endpoints above process inline and never need a broker.
For running **many** audio→topics requests concurrently there is an
optional async path behind the ``[broker]`` extra: a RabbitMQ broker, a
single GPU worker (Whisper + every LLM stage), and an IO worker (the
``persist`` stage). It is fully decoupled — installing or running it
changes nothing about the synchronous SDK / CLI / API.

The repo-root ``docker-compose.yml`` is the **single source of truth**
for the whole host stack — five services on a shared ``autorag-net``
network that the devcontainer also joins:

.. list-table::
   :header-rows: 1
   :widths: 16 40 44

   * - Service
     - What it is
     - Notes
   * - ``rabbitmq``
     - ``rabbitmq:3-management`` broker
     - AMQP on ``:5672``, management UI on ``:15672``; healthcheck-gated
       so the workers wait for it to be ready.
   * - ``ollama``
     - ``ollama/ollama`` LLM + embedding server
     - Now a compose service (no longer host/devcontainer-native). Owns
       the server-side tuning contract as its only copy; reserves the
       NVIDIA GPU; models persist in the ``ollama-models`` volume.
   * - ``gpu-worker``
     - Whisper + every LLM stage
     - Lean ``.devcontainer/worker.Dockerfile`` (CUDA + extras, deps
       only). Only ``./src`` + ``./pyproject.toml`` are bind-mounted
       **read-only** (not the repo root — ``.env`` never enters a worker
       container) and run via ``uv run``, so a code edit needs only a
       restart — no rebuild. Reserves the GPU; ``replicas: 1`` — one
       physical GPU, do not scale it.
   * - ``io-worker``
     - The ``persist`` stage (SQLite + Chroma writes)
     - Same image, no GPU. Shares ``./.stack-data`` with ``gpu-worker``.
   * - ``docker-socket-proxy``
     - Filtered Docker-API control plane
     - Pinned ``tecnativa/docker-socket-proxy:0.3.0`` — the **only**
       container mounting ``/var/run/docker.sock`` (read-only, **no
       published port**, reachable only by service name on
       ``autorag-net``). All endpoint groups default-deny except the few
       ``docker compose ps|logs|restart`` need;
       ``BUILD``/``EXEC``/``IMAGES``/… are refused **at the proxy**, so
       there is no host-code-exec path. The devcontainer points
       ``DOCKER_HOST`` here — that *is* the per-edit control loop; there
       is no bespoke control service and no token.

Deploying with ``scripts/stack.sh``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The stack is brought up on the **host** with one idempotent command
(the devcontainer is a thin sandbox with no dockerd — only a CLI
pointed at the read-only proxy, which refuses ``build``/``up`` — so the
initial ``up`` is inherently host-side):

.. code-block:: bash

    cp .env.example .env        # optional — only for HF_TOKEN (diarization)
    ./scripts/stack.sh up       # create net → build → wait healthy → pull models
    ./scripts/stack.sh rebuild  # dependency-only image rebuild, then up
    ./scripts/stack.sh down -v  # stop (and drop named volumes)

``up`` creates the shared ``autorag-net`` network, builds the images,
waits for ``rabbitmq`` + ``ollama`` to report healthy, then pulls the
LLM + embedding models into the ``ollama-models`` volume (first run is
multi-GB / several minutes; ``AUTORAG_SKIP_MODEL_PULL=1`` opts out).

Prerequisites:

* **The NVIDIA container runtime**, for the ``ollama`` + ``gpu-worker``
  device reservations.
* **``HF_TOKEN``** (optional, in ``.env``) is passed through for
  diarization; without it every word is labelled ``"0"`` (same as the
  synchronous path). There is no control-plane token.

.. note::

   ``./scripts/stack.sh up`` does **not** start the HTTP API. The stack
   runs the broker, Ollama, the two workers and ``docker-socket-proxy``.
   To drive the pipeline you have two options:

   * Run ``autorag serve`` yourself (needs the ``[server]`` extra)
     pointed at the **same** ``AUTORAG_DB_PATH`` and
     ``AUTORAG_BROKER_URL`` as the workers, then use the ``/jobs/*``
     HTTP endpoints below; or
   * Skip the API entirely and use the ``autorag jobs submit`` /
     ``autorag jobs status`` CLI, which talks to the broker directly.

.. warning::

   If the host ``autorag serve`` uses a *different* ``AUTORAG_DB_PATH``
   than the workers (the default is ``~/.autorag/autorag.db``, **not**
   ``./.stack-data/autorag.db``), the broker still delivers the job but
   the worker's status writes land in a DB the API never reads — every
   ``/jobs/{id}`` is **stuck on** ``queued`` forever, even after the job
   has failed and dead-lettered. Put the absolute shared path in the
   repo ``.env`` (``autorag serve`` reads it via ``SettingsConfigDict``)
   so it can't drift::

       AUTORAG_DB_PATH=/abs/path/to/repo/.stack-data/autorag.db

   Also: the ``gpu-worker`` image must install the ``[youtube]`` extra
   for URL inputs (``.devcontainer/worker.Dockerfile``) — without it the
   whisper stage ``MissingExtraError``\\ s on every YouTube job.

With the API running (same DB + broker as the workers):

.. code-block:: bash

    curl -X POST http://localhost:8000/jobs/audio \
         -H 'content-type: application/json' \
         -d '{"source":"https://youtu.be/VIDEO","title":"Demo"}'
    # {"job_id":"…","session_id":"…","status":"queued"}

    curl http://localhost:8000/jobs/JOB_ID          # status + per-stage state
    curl http://localhost:8000/jobs/JOB_ID/result   # the finished clip (409 until done)

Or without an API server, straight from the CLI:

.. code-block:: bash

    autorag jobs submit https://youtu.be/VIDEO --title Demo
    autorag jobs status JOB_ID

Workers and whatever serves the API / ``/viz`` share state through the
repo-local ``./.stack-data`` bind (``AUTORAG_DB_PATH=/data/autorag.db``
inside the containers) — a finished async job writes the **same** SQLite
/ Chroma rows a CLI run would, so ``/viz`` and every other reader work
unchanged. Because ``.stack-data`` is a repo-local directory it is also
visible from the devcontainer (at ``/workspace/autorag/.stack-data``)
for direct cross-process job/error inspection. Without the ``[broker]``
/ ``[rag]`` extras the ``/jobs/*`` endpoints return ``503`` with an
install hint and the rest of the API is unaffected. See the "Async
pipeline" section of ``CLAUDE.md`` for the architecture.

Devcontainer (thin sandbox) and the per-edit loop
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The devcontainer is a **thin sandbox: no dockerd, no mounted socket, no
GPU**. It **joins the stack's shared ``autorag-net`` network** (declared
``external`` in ``docker-compose.yml``; created idempotently by both
``./scripts/stack.sh`` and the devcontainer ``initializeCommand``), so
it reaches services by name —
``AUTORAG_BROKER_URL=amqp://rabbitmq:5672``,
``AUTORAG_OLLAMA_BASE_URL=http://ollama:11434`` — and its **client-only**
docker CLI points at the read-only ``docker-socket-proxy``
(``DOCKER_HOST=tcp://docker-socket-proxy:2375``). Its
``.devcontainer/check-stack.sh`` postStartCommand only *probes* via
``docker compose ps`` and **always exits 0** — the async path is
optional and the synchronous SDK/CLI/API never need the stack.

After a code edit, bounce a worker from inside the sandbox with plain
compose verbs (the proxy enforces ps/logs/restart-only — no helper, no
token):

.. code-block:: bash

    docker compose -p autorag ps                  # per-service status
    docker compose -p autorag restart gpu-worker  # picks up a bind-mounted edit
    docker compose -p autorag logs --tail 200 gpu-worker

Workers ``uv run`` the bind-mounted live repo, so ``restart`` (not a
rebuild) makes an edit live. The proxy refuses ``build``/``up``/``exec``
— a dependency change is the host-side ``./scripts/stack.sh rebuild``.
The sandbox's project venv is ``UV_PROJECT_ENVIRONMENT=/opt/autorag-venv``
(outside the bind-mounted workspace) so host and sandbox never thrash a
shared ``./.venv``.

Mounting the app yourself
-------------------------

If you need to embed AutoRAG inside a larger FastAPI app, import
``autorag.api:app`` directly and re-mount it or merge its routers:

.. code-block:: python

    from fastapi import FastAPI
    from autorag.api import app as autorag_app

    parent = FastAPI()
    parent.mount("/autorag", autorag_app)

The :func:`~autorag.api.get_rag` helper returns a process-wide
``AutoRAG`` singleton via ``functools.lru_cache``, so reusing the app
across requests doesn't re-instantiate the embedder or vector store.