Stress testing the async pipeline¶
scripts/stress.py + scripts/stress-suite.sh drive the
broker-backed pipeline with concurrent POST /jobs/audio requests
and snapshot Prometheus + Jaeger into a structured artifact directory.
The point is to exercise the parallelism surfaces (broker fan-out,
OLLAMA_NUM_PARALLEL, max_concurrency, the GPU arbiter) and
persist the raw observability data so deeper dashboards can be
built off the same run later.
Scenarios¶
The suite runs three scenarios, each writing to its own subdirectory
of stress-runs/<ISO-8601>/.
Soak / correctness (A) — 12 jobs at concurrency 4. Pass criteria: 12/12 reach
status=done, every job emits all seven stages (whisper → l1 → decide → summarize_l1 → l2 → summarize_l2 → l0) in itsstage_statesrow, no exceptions in worker logs.Parallelization A/B (B) — same 12-job workload, run once per
OLLAMA_NUM_PARALLELsetting. Manual between legs because the knob is a server-boot env var on theollamacontainer and the devcontainer’s docker-socket-proxy does not allowcompose --force-recreate; the recreate must happen on the host. See A/B procedure (manual) below.Saturation ramp (C) — concurrency 4 → 8 → 16 → 24, eight jobs per level. Stops at the first level with any failures so the artifact captures the first breaking point cleanly rather than cascading errors across levels.
Prerequisites¶
The harness assumes the full async stack is up with observability and the FastAPI server is reachable:
# on the HOST:
./scripts/stack.sh up --with-observability
# inside the devcontainer (or wherever the dev venv lives):
AUTORAG_OTEL_ENABLED=true uv run autorag serve --host 0.0.0.0 --port 8000 &
The harness verifies http://localhost:8000/health before
submitting; the Prometheus (:9090) and Jaeger (:16686) probes
are soft — the run continues without them but the
prometheus/ / jaeger/ artifact dirs will be empty.
Running¶
End-to-end (soak + ramp):
./scripts/stress-suite.sh
Single scenarios:
./scripts/stress-suite.sh --soak-only
./scripts/stress-suite.sh --ramp-only
./scripts/stress-suite.sh --ab np-8
Fine-grained (Typer):
uv run python scripts/stress.py soak --out stress-runs/manual/soak
uv run python scripts/stress.py ramp --out stress-runs/manual/ramp --levels 4,8,16
uv run python scripts/stress.py ab --out stress-runs/manual/ab/np-8 --num-parallel 8
uv run python scripts/stress.py capture --out stress-runs/manual/adhoc \
--start "$(date -d '5 min ago' +%s)" --end "$(date +%s)"
Fixture staging¶
The worker container bind-mounts only ./src and ./.stack-data
(see docker-compose.yml), so the test fixtures at tests/*.webm
are not directly visible. The harness stages copies (or hardlinks,
where the FS permits) into .stack-data/fixtures/<prefix>/ and
submits the worker-visible path /data/fixtures/<prefix>/NNN.webm.
One staged file per submitted job — re-using the same source string
across jobs collides on derive_session_id and collapses the
parallel workload to a single session, masking races. The staged dir
is reused on subsequent runs (idempotent) and is gitignored under
stress-runs/ is not — but .stack-data/ already is.
Artifact layout¶
stress-runs/2026-05-20T22-30-00Z/
report.md # top-level summary + links
.git-sha # repo state at run start
soak/
jobs.json # per-job timings, status, stage_states
manifest.json # scenario metadata + Prom/Jaeger window
report.md # per-scenario summary
prometheus/
stage_duration_bucket.json # raw query_range envelopes, 5s step
stage_duration_count.json
stage_duration_sum.json
queue_wait_bucket.json
queue_wait_count.json
queue_wait_sum.json
gpu_tenancy_bucket.json
gpu_tenancy_transitions.json
jobs_completed.json
rabbitmq_queue_ready.json
jaeger/
autorag-api.json
autorag-gpu-worker.json
autorag-io-worker.json
ramp/
by-concurrency/lvl-04/ (same layout per level)
by-concurrency/lvl-08/
by-concurrency/lvl-16/
by-concurrency/lvl-24/
report.md
ab/ # only present when --ab was used
np-8/ (same layout)
np-4/ (same layout)
Everything under stress-runs/ is local-only; the .gitignore
entry keeps it out of commits.
A/B procedure (manual)¶
OLLAMA_NUM_PARALLEL is baked into the ollama container’s env at
boot. The devcontainer cannot recreate the container through the
docker-socket-proxy (compose up --force-recreate requires the
BUILD/CONTAINER_CREATE endpoints, which are denied). So the
two legs are explicit:
# HOST — leg 1: current default (8)
OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama
# DEVCONTAINER:
./scripts/stress-suite.sh --ab np-8
# HOST — leg 2: rollback comparison (4)
OLLAMA_NUM_PARALLEL=4 docker compose -p autorag up -d --force-recreate ollama
# DEVCONTAINER:
./scripts/stress-suite.sh --ab np-4
# HOST — restore the default
OLLAMA_NUM_PARALLEL=8 docker compose -p autorag up -d --force-recreate ollama
Each leg writes to stress-runs/<ts>/ab/np-N/. Compare side-by-side
through the report.md of each leg (per-stage mean duration) and the
prometheus/queue_wait_*.json series (queue pressure should
increase at np=4 if 8 was actually helping).