Observability (autorag.otel)¶
AutoRAG ships an OpenTelemetry integration that emits traces for
the async pipeline (end-to-end job hop from API → gpu-worker →
io-worker, with one autorag.stage.<name> span per stage and
nested GPU-arbiter spans) and metrics for stage durations, attempt
counts, DLQ rates, and GPU tenancy flips. The integration is opt-in:
when AUTORAG_OTEL_ENABLED=false (the default) autorag.otel.initialize_otel()
short-circuits without importing any opentelemetry modules, so a
base install and a [broker]-only install both keep booting without
the new [observability] extra.
Quick start¶
Install the extra (or rerun
uv sync --all-extras):pip install 'autorag[broker,rag,observability]'
Switch OTel on in the host environment:
echo 'AUTORAG_OTEL_ENABLED=true' >> .env
Bring up the in-stack collector + Jaeger + Prometheus + Grafana via the new compose profile:
./scripts/stack.sh up --with-observability
Without the profile the workers still try to push to
http://otel-collector:4317; the BatchSpanProcessor retries silently
and pipeline work is never blocked, but you also won’t see anything in
Jaeger / Prometheus.
The dashboards land on (everything is bound to 127.0.0.1):
Jaeger — http://localhost:16686
Prometheus — http://localhost:9090
Grafana (anonymous admin) — http://localhost:3000
Span catalog¶
All spans live under one of two tracers:
autorag.services— pipeline + transport spans.autorag.gpu— GPU-arbiter spans.
Span |
Emitter |
Notable attributes |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
Nested children of |
None — they exist to attribute the time of each substep on the transition. |
|
|
|
|
|
|
|
|
Stepwise breakdown of the whisper stage’s wall time —
|
W3C trace context is propagated across the RabbitMQ envelope manually
in autorag.services.broker.RabbitBroker via
autorag.otel.inject_amqp_headers() and
autorag.otel.extract_amqp_context(). The contrib
opentelemetry-instrumentation-pika package only wraps
basic_publish and the basic_consume callback path; the broker
uses pull-mode basic_get for its drain loop, so doing the
inject/extract ourselves is the only reliable way to keep one end-to-end
trace across the API → gpu-worker → io-worker hops.
The handler-side publish (the next-stage hop and the bounded retry in
autorag.services.stages.drain_and_dispatch()) runs after the
stage span’s with block has exited, so StageOutcome carries
the stage span’s OTel context (captured at outcome-construction time)
and drain_and_dispatch re-attaches it around each broker.publish
via the small _publish_in_ctx helper. Without this re-attach,
inject_amqp_headers() would publish with no active span and every
downstream stage would start a fresh trace.
Metric catalog¶
Every metric is published under the autorag meter.
Name |
Type |
Unit |
Labels |
|---|---|---|---|
|
Counter |
1 |
|
|
Counter |
1 |
|
|
Histogram |
s |
|
|
Histogram |
1 |
|
|
Counter |
1 |
|
|
Counter |
1 |
|
|
Histogram |
s |
|
|
Histogram |
s |
|
Reading the per-LLM-call span (autorag.llm.call)¶
Without autorag.otel_callbacks.OTelSpanCallbackHandler the
batched LLM stages (decide, l2, summarize_l1,
summarize_l2) render in Jaeger as a single autorag.stage.<n>
span with N anonymous httpx children — the Runnable.batch
fan-out is nesting correctly under the stage thanks to
ThreadingInstrumentor, but the per-item calls have no label. The
callback is wired through RunnableConfig in
autorag.agent._build_stage_closures(), so every chat-model
invocation gets one autorag.llm.call span tagged with its stage and
the Ollama timings.
Read the Jaeger waterfall as:
llm.ollama.eval_duration_ms— pure GPU token-generation time.llm.ollama.prompt_eval_duration_ms— GPU prompt-eval time.llm.ollama.load_duration_ms— Ollama model swap (≈0 whenkeep_alive="5m"keeps the model warm; non-zero means a reload cost — usually a stage that changednum_ctx).httpxchild span duration −total_duration_ms≈ network round-trip to the Ollama server.autorag.llm.callduration −httpxchild duration ≈ local LangChain Python overhead (prompt-template build, structured-output parse/validate).
Queue depth is not an autorag metric — Prometheus scrapes
RabbitMQ’s built-in rabbitmq_prometheus exporter on :15692 and
exposes the queue counts as rabbitmq_queue_messages_ready{queue=…}.
The starter Grafana dashboard joins those series with the autorag
metrics so the per-stage histogram and the queue depth can be read
side-by-side. A companion autorag-queue-wait dashboard breaks
out the idle-time view — total wait per stage, p50/p95/p99 quantiles,
and a wait-duration heatmap — for when the question is “where is
backpressure piling up?” rather than “how fast is each stage running?”.
The scrape target uses the plugin’s
/metrics/per-object endpoint (set in
observability/prometheus.yml); the default /metrics emits only
aggregated totals with no queue= label, which would break the
per-queue Grafana panel.
Service-name initialisation¶
autorag.otel.initialize_otel() is idempotent — the module-level
_initialized bool means the first call wins for the lifetime of the
process. Each long-running process therefore calls it exactly once with
its own service name: autorag-api from the FastAPI lifespan,
autorag-gpu-worker / autorag-io-worker from each worker’s
main(), and autorag-cli from the Typer @app.callback. The
callback skips its init when the invoked subcommand is serve;
otherwise the callback would win the race and every API span would
register as autorag-cli.
Settings¶
The fields below live on autorag.config.Settings; their
environment variables use the standard AUTORAG_ prefix.
Setting |
Default |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|