Stress testing the async pipeline ================================= ``scripts/stress.py`` + ``scripts/stress-suite.sh`` drive the broker-backed pipeline with concurrent ``POST /jobs/audio`` requests and snapshot Prometheus + Jaeger into a structured artifact directory. The point is to exercise the parallelism surfaces (broker fan-out, ``OLLAMA_NUM_PARALLEL``, ``max_concurrency``, the GPU arbiter) and **persist the raw observability data** so deeper dashboards can be built off the same run later. Scenarios --------- The suite runs three scenarios, each writing to its own subdirectory of ``stress-runs//``. * **Soak / correctness (A)** — 12 jobs at concurrency 4. Pass criteria: 12/12 reach ``status=done``, every job emits all seven stages (``whisper → l1 → decide → summarize_l1 → l2 → summarize_l2 → l0``) in its ``stage_states`` row, no exceptions in worker logs. * **Parallelization A/B (B)** — same 12-job workload, run once per ``OLLAMA_NUM_PARALLEL`` setting. **Manual** between legs because the knob is a server-boot env var on the ``ollama`` container and the devcontainer's docker-socket-proxy does not allow ``compose --force-recreate``; the recreate must happen on the host. See :ref:`stress-ab-procedure` below. * **Saturation ramp (C)** — concurrency 4 → 8 → 16 → 24, eight jobs per level. Stops at the first level with any failures so the artifact captures the *first* breaking point cleanly rather than cascading errors across levels. Prerequisites ------------- The harness assumes the full async stack is up with observability and the FastAPI server is reachable: .. code-block:: shell # on the HOST: ./scripts/stack.sh up --with-observability # inside the devcontainer (or wherever the dev venv lives): AUTORAG_OTEL_ENABLED=true uv run autorag serve --host 0.0.0.0 --port 8000 & The harness verifies ``http://localhost:8000/health`` before submitting; the Prometheus (``:9090``) and Jaeger (``:16686``) probes are soft — the run continues without them but the ``prometheus/`` / ``jaeger/`` artifact dirs will be empty. Running ------- End-to-end (soak + ramp): .. code-block:: shell ./scripts/stress-suite.sh Single scenarios: .. code-block:: shell ./scripts/stress-suite.sh --soak-only ./scripts/stress-suite.sh --ramp-only ./scripts/stress-suite.sh --ab np-8 Fine-grained (Typer): .. code-block:: shell uv run python scripts/stress.py soak --out stress-runs/manual/soak uv run python scripts/stress.py ramp --out stress-runs/manual/ramp --levels 4,8,16 uv run python scripts/stress.py ab --out stress-runs/manual/ab/np-8 --num-parallel 8 uv run python scripts/stress.py capture --out stress-runs/manual/adhoc \ --start "$(date -d '5 min ago' +%s)" --end "$(date +%s)" Fixture staging --------------- The worker container bind-mounts only ``./src`` and ``./.stack-data`` (see ``docker-compose.yml``), so the test fixtures at ``tests/*.webm`` are not directly visible. The harness stages copies (or hardlinks, where the FS permits) into ``.stack-data/fixtures//`` and submits the worker-visible path ``/data/fixtures//NNN.webm``. One staged file per submitted job — re-using the same source string across jobs collides on ``derive_session_id`` and collapses the parallel workload to a single session, masking races. The staged dir is reused on subsequent runs (idempotent) and is gitignored under ``stress-runs/`` is not — but ``.stack-data/`` already is. Artifact layout --------------- :: stress-runs/2026-05-20T22-30-00Z/ report.md # top-level summary + links .git-sha # repo state at run start soak/ jobs.json # per-job timings, status, stage_states manifest.json # scenario metadata + Prom/Jaeger window report.md # per-scenario summary prometheus/ stage_duration_bucket.json # raw query_range envelopes, 5s step stage_duration_count.json stage_duration_sum.json queue_wait_bucket.json queue_wait_count.json queue_wait_sum.json gpu_tenancy_bucket.json gpu_tenancy_transitions.json jobs_completed.json rabbitmq_queue_ready.json jaeger/ autorag-api.json autorag-gpu-worker.json autorag-io-worker.json ramp/ by-concurrency/lvl-04/ (same layout per level) by-concurrency/lvl-08/ by-concurrency/lvl-16/ by-concurrency/lvl-24/ report.md ab/ # only present when --ab was used np-8/ (same layout) np-4/ (same layout) Everything under ``stress-runs/`` is local-only; the ``.gitignore`` entry keeps it out of commits. .. _stress-ab-procedure: A/B procedure (manual) ---------------------- ``OLLAMA_NUM_PARALLEL`` is baked into the ollama container's env at boot. The devcontainer cannot recreate the container through the docker-socket-proxy (``compose up --force-recreate`` requires the ``BUILD``/``CONTAINER_CREATE`` endpoints, which are denied). So the two legs are explicit: .. code-block:: shell # HOST — leg 1: current default (8) OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama # DEVCONTAINER: ./scripts/stress-suite.sh --ab np-8 # HOST — leg 2: rollback comparison (4) OLLAMA_NUM_PARALLEL=4 docker compose -p autorag up -d --force-recreate ollama # DEVCONTAINER: ./scripts/stress-suite.sh --ab np-4 # HOST — restore the default OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama Each leg writes to ``stress-runs//ab/np-N/``. Compare side-by-side through the ``report.md`` of each leg (per-stage mean duration) and the ``prometheus/queue_wait_*.json`` series (queue pressure should *increase* at np=4 if 8 was actually helping).