Observability (autorag.otel)

AutoRAG ships an OpenTelemetry integration that emits traces for the async pipeline (end-to-end job hop from API → gpu-workerio-worker, with one autorag.stage.<name> span per stage and nested GPU-arbiter spans) and metrics for stage durations, attempt counts, DLQ rates, and GPU tenancy flips. The integration is opt-in: when AUTORAG_OTEL_ENABLED=false (the default) autorag.otel.initialize_otel() short-circuits without importing any opentelemetry modules, so a base install and a [broker]-only install both keep booting without the new [observability] extra.

Quick start

  1. Install the extra (or rerun uv sync --all-extras):

    pip install 'autorag[broker,rag,observability]'
    
  2. Switch OTel on in the host environment:

    echo 'AUTORAG_OTEL_ENABLED=true' >> .env
    
  3. Bring up the in-stack collector + Jaeger + Prometheus + Grafana via the new compose profile:

    ./scripts/stack.sh up --with-observability
    

Without the profile the workers still try to push to http://otel-collector:4317; the BatchSpanProcessor retries silently and pipeline work is never blocked, but you also won’t see anything in Jaeger / Prometheus.

The dashboards land on (everything is bound to 127.0.0.1):

Span catalog

All spans live under one of two tracers:

  • autorag.services — pipeline + transport spans.

  • autorag.gpu — GPU-arbiter spans.

Span

Emitter

Notable attributes

autorag.job.submit

autorag.services.broker.submit_audio_job()

job.id, job.session_id, job.source, job.source_kind (youtube/file), job.llm_model, job.whisper_model

autorag.stage.<name>

autorag.services.stages.handle_batch() (queue path) + autorag.services.runner.run_job_in_process() (in-process)

job.id, stage.name, stage.attempt, messaging.system=rabbitmq, messaging.destination.name=stage.<name>, messaging.operation=process; the persist span additionally carries result.session_id.

autorag.gpu.acquire

autorag.services.model_manager.GpuArbiter.acquire()

gpu.tenant.previous, gpu.tenant.target, gpu.tenant.transition

autorag.gpu.evict.llm / .offload.whisper / .enforce_budget

Nested children of autorag.gpu.acquire

None — they exist to attribute the time of each substep on the transition.

autorag.queue.wait.<stage>

autorag.services.stages._handle_one() — emitted retroactively between AMQP context attach and the stage span, parented under the publisher’s context. The bar in the Jaeger waterfall is “how long did this message sit in stage.<name>?”.

stage.name, messaging.system=rabbitmq, messaging.destination.name=stage.<name>, job.id

autorag.llm.call

autorag.otel_callbacks.OTelSpanCallbackHandler — one per LangChain chat-model call. Threads under the stage span via ThreadingInstrumentor even from Runnable.batch’s worker pool.

llm.stage (which agent stage emitted it), llm.model, llm.input.chars, llm.input.message_count, llm.ollama.total_duration_ms, llm.ollama.load_duration_ms, llm.ollama.prompt_eval_duration_ms, llm.ollama.eval_duration_ms, llm.ollama.prompt_eval_count, llm.ollama.eval_count, llm.usage.{input_tokens,output_tokens,total_tokens}

autorag.whisper.<step> / autorag.pyannote.<step> / autorag.gpu.preload.<target>

autorag.agent (get_model, transcribe_segment, diarize_file, assign_speakers), autorag.whisper_runner (load_audio, load_model cache-miss, ct2_transcribe, get_align_model, align, offload_align), autorag.diarize (ensure_on_cuda, inference, offload), autorag.services.model_manager.GpuArbiter preload methods.

Stepwise breakdown of the whisper stage’s wall time — cache.hit, audio.duration_s, model.compute_type, model.device, transcribe.realtime_factor, align.restored_from (cuda/cpu_to_cuda_restore/fresh_load), preload.cuda_attempted/preload.cuda_succeeded.

W3C trace context is propagated across the RabbitMQ envelope manually in autorag.services.broker.RabbitBroker via autorag.otel.inject_amqp_headers() and autorag.otel.extract_amqp_context(). The contrib opentelemetry-instrumentation-pika package only wraps basic_publish and the basic_consume callback path; the broker uses pull-mode basic_get for its drain loop, so doing the inject/extract ourselves is the only reliable way to keep one end-to-end trace across the API → gpu-workerio-worker hops.

The handler-side publish (the next-stage hop and the bounded retry in autorag.services.stages.drain_and_dispatch()) runs after the stage span’s with block has exited, so StageOutcome carries the stage span’s OTel context (captured at outcome-construction time) and drain_and_dispatch re-attaches it around each broker.publish via the small _publish_in_ctx helper. Without this re-attach, inject_amqp_headers() would publish with no active span and every downstream stage would start a fresh trace.

Metric catalog

Every metric is published under the autorag meter.

Name

Type

Unit

Labels

autorag.jobs.submitted

Counter

1

job.source_kind

autorag.jobs.completed

Counter

1

status (done/failed)

autorag.stage.duration

Histogram

s

stage.name, outcome (ok/error)

autorag.stage.attempts

Histogram

1

stage.name

autorag.stage.dlq

Counter

1

stage.name, error.class

autorag.gpu.tenancy.transitions

Counter

1

from_tenant, to_tenant

autorag.gpu.tenancy.duration

Histogram

s

tenant

autorag.queue.wait.duration

Histogram

s

stage.name

Reading the per-LLM-call span (autorag.llm.call)

Without autorag.otel_callbacks.OTelSpanCallbackHandler the batched LLM stages (decide, l2, summarize_l1, summarize_l2) render in Jaeger as a single autorag.stage.<n> span with N anonymous httpx children — the Runnable.batch fan-out is nesting correctly under the stage thanks to ThreadingInstrumentor, but the per-item calls have no label. The callback is wired through RunnableConfig in autorag.agent._build_stage_closures(), so every chat-model invocation gets one autorag.llm.call span tagged with its stage and the Ollama timings.

Read the Jaeger waterfall as:

  • llm.ollama.eval_duration_ms — pure GPU token-generation time.

  • llm.ollama.prompt_eval_duration_ms — GPU prompt-eval time.

  • llm.ollama.load_duration_ms — Ollama model swap (≈0 when keep_alive="5m" keeps the model warm; non-zero means a reload cost — usually a stage that changed num_ctx).

  • httpx child span duration − total_duration_ms ≈ network round-trip to the Ollama server.

  • autorag.llm.call duration − httpx child duration ≈ local LangChain Python overhead (prompt-template build, structured-output parse/validate).

Queue depth is not an autorag metric — Prometheus scrapes RabbitMQ’s built-in rabbitmq_prometheus exporter on :15692 and exposes the queue counts as rabbitmq_queue_messages_ready{queue=…}. The starter Grafana dashboard joins those series with the autorag metrics so the per-stage histogram and the queue depth can be read side-by-side. A companion autorag-queue-wait dashboard breaks out the idle-time view — total wait per stage, p50/p95/p99 quantiles, and a wait-duration heatmap — for when the question is “where is backpressure piling up?” rather than “how fast is each stage running?”. The scrape target uses the plugin’s /metrics/per-object endpoint (set in observability/prometheus.yml); the default /metrics emits only aggregated totals with no queue= label, which would break the per-queue Grafana panel.

Service-name initialisation

autorag.otel.initialize_otel() is idempotent — the module-level _initialized bool means the first call wins for the lifetime of the process. Each long-running process therefore calls it exactly once with its own service name: autorag-api from the FastAPI lifespan, autorag-gpu-worker / autorag-io-worker from each worker’s main(), and autorag-cli from the Typer @app.callback. The callback skips its init when the invoked subcommand is serve; otherwise the callback would win the race and every API span would register as autorag-cli.

Settings

The fields below live on autorag.config.Settings; their environment variables use the standard AUTORAG_ prefix.

Setting

Default

otel_enabled

False

otel_service_name

autorag

otel_exporter_endpoint

http://localhost:4317

otel_metric_export_interval_ms

15000

otel_environment

dev

otel_resource_attributes

"" ("k=v,k2=v2")