Stress testing the async pipeline¶

scripts/stress.py + scripts/stress-suite.sh drive the broker-backed pipeline with concurrent POST /jobs/audio requests and snapshot Prometheus + Jaeger into a structured artifact directory. The point is to exercise the parallelism surfaces (broker fan-out, OLLAMA_NUM_PARALLEL, max_concurrency, the GPU arbiter) and persist the raw observability data so deeper dashboards can be built off the same run later.

Scenarios¶

The suite runs three scenarios, each writing to its own subdirectory of stress-runs/<ISO-8601>/.

Soak / correctness (A) — 12 jobs at concurrency 4. Pass criteria: 12/12 reach status=done, every job emits all seven stages (whisper → l1 → decide → summarize_l1 → l2 → summarize_l2 → l0) in its stage_states row, no exceptions in worker logs.
Parallelization A/B (B) — same 12-job workload, run once per OLLAMA_NUM_PARALLEL setting. Manual between legs because the knob is a server-boot env var on the ollama container and the devcontainer’s docker-socket-proxy does not allow compose --force-recreate; the recreate must happen on the host. See A/B procedure (manual) below.
Saturation ramp (C) — concurrency 4 → 8 → 16 → 24, eight jobs per level. Stops at the first level with any failures so the artifact captures the first breaking point cleanly rather than cascading errors across levels.

Prerequisites¶

The harness assumes the full async stack is up with observability and the FastAPI server is reachable:

# on the HOST:
./scripts/stack.sh up --with-observability

# inside the devcontainer (or wherever the dev venv lives):
AUTORAG_OTEL_ENABLED=true uv run autorag serve --host 0.0.0.0 --port 8000 &

The harness verifies http://localhost:8000/health before submitting; the Prometheus (:9090) and Jaeger (:16686) probes are soft — the run continues without them but the prometheus/ / jaeger/ artifact dirs will be empty.

Running¶

End-to-end (soak + ramp):

./scripts/stress-suite.sh

Single scenarios:

./scripts/stress-suite.sh --soak-only
./scripts/stress-suite.sh --ramp-only
./scripts/stress-suite.sh --ab np-8

Fine-grained (Typer):

uv run python scripts/stress.py soak --out stress-runs/manual/soak
uv run python scripts/stress.py ramp --out stress-runs/manual/ramp --levels 4,8,16
uv run python scripts/stress.py ab   --out stress-runs/manual/ab/np-8 --num-parallel 8
uv run python scripts/stress.py capture --out stress-runs/manual/adhoc \
     --start "$(date -d '5 min ago' +%s)" --end "$(date +%s)"

Fixture staging¶

The worker container bind-mounts only ./src and ./.stack-data (see docker-compose.yml), so the test fixtures at tests/*.webm are not directly visible. The harness stages copies (or hardlinks, where the FS permits) into .stack-data/fixtures/<prefix>/ and submits the worker-visible path /data/fixtures/<prefix>/NNN.webm.

One staged file per submitted job — re-using the same source string across jobs collides on derive_session_id and collapses the parallel workload to a single session, masking races. The staged dir is reused on subsequent runs (idempotent) and is gitignored under stress-runs/ is not — but .stack-data/ already is.

Artifact layout¶

stress-runs/2026-05-20T22-30-00Z/
  report.md                       # top-level summary + links
  .git-sha                        # repo state at run start
  soak/
    jobs.json                     # per-job timings, status, stage_states
    manifest.json                 # scenario metadata + Prom/Jaeger window
    report.md                     # per-scenario summary
    prometheus/
      stage_duration_bucket.json  # raw query_range envelopes, 5s step
      stage_duration_count.json
      stage_duration_sum.json
      queue_wait_bucket.json
      queue_wait_count.json
      queue_wait_sum.json
      gpu_tenancy_bucket.json
      gpu_tenancy_transitions.json
      jobs_completed.json
      rabbitmq_queue_ready.json
    jaeger/
      autorag-api.json
      autorag-gpu-worker.json
      autorag-io-worker.json
  ramp/
    by-concurrency/lvl-04/  (same layout per level)
    by-concurrency/lvl-08/
    by-concurrency/lvl-16/
    by-concurrency/lvl-24/
    report.md
  ab/                             # only present when --ab was used
    np-8/  (same layout)
    np-4/  (same layout)

Everything under stress-runs/ is local-only; the .gitignore entry keeps it out of commits.

A/B procedure (manual)¶

OLLAMA_NUM_PARALLEL is baked into the ollama container’s env at boot. The devcontainer cannot recreate the container through the docker-socket-proxy (compose up --force-recreate requires the BUILD/CONTAINER_CREATE endpoints, which are denied). So the two legs are explicit:

# HOST — leg 1: current default (8)
OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama
# DEVCONTAINER:
./scripts/stress-suite.sh --ab np-8

# HOST — leg 2: rollback comparison (4)
OLLAMA_NUM_PARALLEL=4 docker compose -p autorag up -d --force-recreate ollama
# DEVCONTAINER:
./scripts/stress-suite.sh --ab np-4

# HOST — restore the default
OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama

Each leg writes to stress-runs/<ts>/ab/np-N/. Compare side-by-side through the report.md of each leg (per-stage mean duration) and the prometheus/queue_wait_*.json series (queue pressure should increase at np=4 if 8 was actually helping).