Observability (`autorag.otel`)¶

AutoRAG ships an OpenTelemetry integration that emits traces for the async pipeline (end-to-end job hop from API → gpu-worker → io-worker, with one autorag.stage.<name> span per stage and nested GPU-arbiter spans) and metrics for stage durations, attempt counts, DLQ rates, and GPU tenancy flips. The integration is opt-in: when AUTORAG_OTEL_ENABLED=false (the default) autorag.otel.initialize_otel() short-circuits without importing any opentelemetry modules, so a base install and a [broker]-only install both keep booting without the new [observability] extra.

Quick start¶

Install the extra (or rerun uv sync --all-extras):

pip install 'autorag[broker,rag,observability]'

Switch OTel on in the host environment:

echo 'AUTORAG_OTEL_ENABLED=true' >> .env

Bring up the in-stack collector + Jaeger + Prometheus + Grafana via the new compose profile:
```
./scripts/stack.sh up --with-observability
```

Without the profile the workers still try to push to http://otel-collector:4317; the BatchSpanProcessor retries silently and pipeline work is never blocked, but you also won’t see anything in Jaeger / Prometheus.

The dashboards land on (everything is bound to 127.0.0.1):

Jaeger — http://localhost:16686
Prometheus — http://localhost:9090
Grafana (anonymous admin) — http://localhost:3000

Span catalog¶

All spans live under one of two tracers:

autorag.services — pipeline + transport spans.
autorag.gpu — GPU-arbiter spans.

Span	Emitter	Notable attributes
`autorag.job.submit`	`autorag.services.broker.submit_audio_job()`	`job.id`, `job.session_id`, `job.source`, `job.source_kind` (`youtube`/`file`), `job.llm_model`, `job.whisper_model`
`autorag.stage.<name>`	`autorag.services.stages.handle_batch()` (queue path) + `autorag.services.runner.run_job_in_process()` (in-process)	`job.id`, `stage.name`, `stage.attempt`, `messaging.system=rabbitmq`, `messaging.destination.name=stage.<name>`, `messaging.operation=process`; the persist span additionally carries `result.session_id`.
`autorag.gpu.acquire`	`autorag.services.model_manager.GpuArbiter.acquire()`	`gpu.tenant.previous`, `gpu.tenant.target`, `gpu.tenant.transition`
`autorag.gpu.evict.llm` / `.offload.whisper` / `.enforce_budget`	Nested children of `autorag.gpu.acquire`	None — they exist to attribute the time of each substep on the transition.
`autorag.queue.wait.<stage>`	`autorag.services.stages._handle_one()` — emitted retroactively between AMQP context attach and the stage span, parented under the publisher’s context. The bar in the Jaeger waterfall is “how long did this message sit in `stage.<name>`?”.	`stage.name`, `messaging.system=rabbitmq`, `messaging.destination.name=stage.<name>`, `job.id`
`autorag.llm.call`	`autorag.otel_callbacks.OTelSpanCallbackHandler` — one per LangChain chat-model call. Threads under the stage span via `ThreadingInstrumentor` even from `Runnable.batch`’s worker pool.	`llm.stage` (which agent stage emitted it), `llm.model`, `llm.input.chars`, `llm.input.message_count`, `llm.ollama.total_duration_ms`, `llm.ollama.load_duration_ms`, `llm.ollama.prompt_eval_duration_ms`, `llm.ollama.eval_duration_ms`, `llm.ollama.prompt_eval_count`, `llm.ollama.eval_count`, `llm.usage.{input_tokens,output_tokens,total_tokens}`
`autorag.whisper.<step>` / `autorag.pyannote.<step>` / `autorag.gpu.preload.<target>`	`autorag.agent` (`get_model`, `transcribe_segment`, `diarize_file`, `assign_speakers`), `autorag.whisper_runner` (`load_audio`, `load_model` cache-miss, `ct2_transcribe`, `get_align_model`, `align`, `offload_align`), `autorag.diarize` (`ensure_on_cuda`, `inference`, `offload`), `autorag.services.model_manager.GpuArbiter` preload methods.	Stepwise breakdown of the whisper stage’s wall time — `cache.hit`, `audio.duration_s`, `model.compute_type`, `model.device`, `transcribe.realtime_factor`, `align.restored_from` (`cuda`/`cpu_to_cuda_restore`/`fresh_load`), `preload.cuda_attempted`/`preload.cuda_succeeded`.

W3C trace context is propagated across the RabbitMQ envelope manually in autorag.services.broker.RabbitBroker via autorag.otel.inject_amqp_headers() and autorag.otel.extract_amqp_context(). The contrib opentelemetry-instrumentation-pika package only wraps basic_publish and the basic_consume callback path; the broker uses pull-mode basic_get for its drain loop, so doing the inject/extract ourselves is the only reliable way to keep one end-to-end trace across the API → gpu-worker → io-worker hops.

The handler-side publish (the next-stage hop and the bounded retry in autorag.services.stages.drain_and_dispatch()) runs after the stage span’s with block has exited, so StageOutcome carries the stage span’s OTel context (captured at outcome-construction time) and drain_and_dispatch re-attaches it around each broker.publish via the small _publish_in_ctx helper. Without this re-attach, inject_amqp_headers() would publish with no active span and every downstream stage would start a fresh trace.

Metric catalog¶

Every metric is published under the autorag meter.

Name	Type	Unit	Labels
`autorag.jobs.submitted`	Counter	1	`job.source_kind`
`autorag.jobs.completed`	Counter	1	`status` (`done`/`failed`)
`autorag.stage.duration`	Histogram	s	`stage.name`, `outcome` (`ok`/`error`)
`autorag.stage.attempts`	Histogram	1	`stage.name`
`autorag.stage.dlq`	Counter	1	`stage.name`, `error.class`
`autorag.gpu.tenancy.transitions`	Counter	1	`from_tenant`, `to_tenant`
`autorag.gpu.tenancy.duration`	Histogram	s	`tenant`
`autorag.queue.wait.duration`	Histogram	s	`stage.name`

Reading the per-LLM-call span (`autorag.llm.call`)¶

Without autorag.otel_callbacks.OTelSpanCallbackHandler the batched LLM stages (decide, l2, summarize_l1, summarize_l2) render in Jaeger as a single autorag.stage.<n> span with N anonymous httpx children — the Runnable.batch fan-out is nesting correctly under the stage thanks to ThreadingInstrumentor, but the per-item calls have no label. The callback is wired through RunnableConfig in autorag.agent._build_stage_closures(), so every chat-model invocation gets one autorag.llm.call span tagged with its stage and the Ollama timings.

Read the Jaeger waterfall as:

llm.ollama.eval_duration_ms — pure GPU token-generation time.
llm.ollama.prompt_eval_duration_ms — GPU prompt-eval time.
llm.ollama.load_duration_ms — Ollama model swap (≈0 when keep_alive="5m" keeps the model warm; non-zero means a reload cost — usually a stage that changed num_ctx).
httpx child span duration − total_duration_ms ≈ network round-trip to the Ollama server.
autorag.llm.call duration − httpx child duration ≈ local LangChain Python overhead (prompt-template build, structured-output parse/validate).

Queue depth is not an autorag metric — Prometheus scrapes RabbitMQ’s built-in rabbitmq_prometheus exporter on :15692 and exposes the queue counts as rabbitmq_queue_messages_ready{queue=…}. The starter Grafana dashboard joins those series with the autorag metrics so the per-stage histogram and the queue depth can be read side-by-side. A companion autorag-queue-wait dashboard breaks out the idle-time view — total wait per stage, p50/p95/p99 quantiles, and a wait-duration heatmap — for when the question is “where is backpressure piling up?” rather than “how fast is each stage running?”. The scrape target uses the plugin’s /metrics/per-object endpoint (set in observability/prometheus.yml); the default /metrics emits only aggregated totals with no queue= label, which would break the per-queue Grafana panel.

Service-name initialisation¶

autorag.otel.initialize_otel() is idempotent — the module-level _initialized bool means the first call wins for the lifetime of the process. Each long-running process therefore calls it exactly once with its own service name: autorag-api from the FastAPI lifespan, autorag-gpu-worker / autorag-io-worker from each worker’s main(), and autorag-cli from the Typer @app.callback. The callback skips its init when the invoked subcommand is serve; otherwise the callback would win the race and every API span would register as autorag-cli.

Settings¶

The fields below live on autorag.config.Settings; their environment variables use the standard AUTORAG_ prefix.

Setting	Default
`otel_enabled`	`False`
`otel_service_name`	`autorag`
`otel_exporter_endpoint`	`http://localhost:4317`
`otel_metric_export_interval_ms`	`15000`
`otel_environment`	`dev`
`otel_resource_attributes`	`""` (`"k=v,k2=v2"`)

Observability (autorag.otel)¶