Stress testing the async pipeline
=================================

``scripts/stress.py`` + ``scripts/stress-suite.sh`` drive the
broker-backed pipeline with concurrent ``POST /jobs/audio`` requests
and snapshot Prometheus + Jaeger into a structured artifact directory.
The point is to exercise the parallelism surfaces (broker fan-out,
``OLLAMA_NUM_PARALLEL``, ``max_concurrency``, the GPU arbiter) and
**persist the raw observability data** so deeper dashboards can be
built off the same run later.

Scenarios
---------

The suite runs three scenarios, each writing to its own subdirectory
of ``stress-runs/<ISO-8601>/``.

* **Soak / correctness (A)** — 12 jobs at concurrency 4. Pass criteria:
  12/12 reach ``status=done``, every job emits all seven stages
  (``whisper → l1 → decide → summarize_l1 → l2 → summarize_l2 → l0``)
  in its ``stage_states`` row, no exceptions in worker logs.

* **Parallelization A/B (B)** — same 12-job workload, run once per
  ``OLLAMA_NUM_PARALLEL`` setting. **Manual** between legs because the
  knob is a server-boot env var on the ``ollama`` container and the
  devcontainer's docker-socket-proxy does not allow ``compose
  --force-recreate``; the recreate must happen on the host. See
  :ref:`stress-ab-procedure` below.

* **Saturation ramp (C)** — concurrency 4 → 8 → 16 → 24, eight jobs
  per level. Stops at the first level with any failures so the
  artifact captures the *first* breaking point cleanly rather than
  cascading errors across levels.

Prerequisites
-------------

The harness assumes the full async stack is up with observability and
the FastAPI server is reachable:

.. code-block:: shell

   # on the HOST:
   ./scripts/stack.sh up --with-observability

   # inside the devcontainer (or wherever the dev venv lives):
   AUTORAG_OTEL_ENABLED=true uv run autorag serve --host 0.0.0.0 --port 8000 &

The harness verifies ``http://localhost:8000/health`` before
submitting; the Prometheus (``:9090``) and Jaeger (``:16686``) probes
are soft — the run continues without them but the
``prometheus/`` / ``jaeger/`` artifact dirs will be empty.

Running
-------

End-to-end (soak + ramp):

.. code-block:: shell

   ./scripts/stress-suite.sh

Single scenarios:

.. code-block:: shell

   ./scripts/stress-suite.sh --soak-only
   ./scripts/stress-suite.sh --ramp-only
   ./scripts/stress-suite.sh --ab np-8

Fine-grained (Typer):

.. code-block:: shell

   uv run python scripts/stress.py soak --out stress-runs/manual/soak
   uv run python scripts/stress.py ramp --out stress-runs/manual/ramp --levels 4,8,16
   uv run python scripts/stress.py ab   --out stress-runs/manual/ab/np-8 --num-parallel 8
   uv run python scripts/stress.py capture --out stress-runs/manual/adhoc \
        --start "$(date -d '5 min ago' +%s)" --end "$(date +%s)"

Fixture staging
---------------

The worker container bind-mounts only ``./src`` and ``./.stack-data``
(see ``docker-compose.yml``), so the test fixtures at ``tests/*.webm``
are not directly visible. The harness stages copies (or hardlinks,
where the FS permits) into ``.stack-data/fixtures/<prefix>/`` and
submits the worker-visible path ``/data/fixtures/<prefix>/NNN.webm``.

One staged file per submitted job — re-using the same source string
across jobs collides on ``derive_session_id`` and collapses the
parallel workload to a single session, masking races. The staged dir
is reused on subsequent runs (idempotent) and is gitignored under
``stress-runs/`` is not — but ``.stack-data/`` already is.

Artifact layout
---------------

::

   stress-runs/2026-05-20T22-30-00Z/
     report.md                       # top-level summary + links
     .git-sha                        # repo state at run start
     soak/
       jobs.json                     # per-job timings, status, stage_states
       manifest.json                 # scenario metadata + Prom/Jaeger window
       report.md                     # per-scenario summary
       prometheus/
         stage_duration_bucket.json  # raw query_range envelopes, 5s step
         stage_duration_count.json
         stage_duration_sum.json
         queue_wait_bucket.json
         queue_wait_count.json
         queue_wait_sum.json
         gpu_tenancy_bucket.json
         gpu_tenancy_transitions.json
         jobs_completed.json
         rabbitmq_queue_ready.json
       jaeger/
         autorag-api.json
         autorag-gpu-worker.json
         autorag-io-worker.json
     ramp/
       by-concurrency/lvl-04/  (same layout per level)
       by-concurrency/lvl-08/
       by-concurrency/lvl-16/
       by-concurrency/lvl-24/
       report.md
     ab/                             # only present when --ab was used
       np-8/  (same layout)
       np-4/  (same layout)

Everything under ``stress-runs/`` is local-only; the ``.gitignore``
entry keeps it out of commits.

.. _stress-ab-procedure:

A/B procedure (manual)
----------------------

``OLLAMA_NUM_PARALLEL`` is baked into the ollama container's env at
boot. The devcontainer cannot recreate the container through the
docker-socket-proxy (``compose up --force-recreate`` requires the
``BUILD``/``CONTAINER_CREATE`` endpoints, which are denied). So the
two legs are explicit:

.. code-block:: shell

   # HOST — leg 1: current default (8)
   OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama
   # DEVCONTAINER:
   ./scripts/stress-suite.sh --ab np-8

   # HOST — leg 2: rollback comparison (4)
   OLLAMA_NUM_PARALLEL=4 docker compose -p autorag up -d --force-recreate ollama
   # DEVCONTAINER:
   ./scripts/stress-suite.sh --ab np-4

   # HOST — restore the default
   OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama

Each leg writes to ``stress-runs/<ts>/ab/np-N/``. Compare side-by-side
through the ``report.md`` of each leg (per-stage mean duration) and the
``prometheus/queue_wait_*.json`` series (queue pressure should
*increase* at np=4 if 8 was actually helping).