Observability (``autorag.otel``) ================================ AutoRAG ships an OpenTelemetry integration that emits **traces** for the async pipeline (end-to-end job hop from API → ``gpu-worker`` → ``io-worker``, with one ``autorag.stage.`` span per stage and nested GPU-arbiter spans) and **metrics** for stage durations, attempt counts, DLQ rates, and GPU tenancy flips. The integration is opt-in: when ``AUTORAG_OTEL_ENABLED=false`` (the default) :func:`autorag.otel.initialize_otel` short-circuits without importing any ``opentelemetry`` modules, so a base install and a ``[broker]``-only install both keep booting without the new ``[observability]`` extra. Quick start ----------- #. Install the extra (or rerun ``uv sync --all-extras``): .. code-block:: bash pip install 'autorag[broker,rag,observability]' #. Switch OTel on in the host environment: .. code-block:: bash echo 'AUTORAG_OTEL_ENABLED=true' >> .env #. Bring up the in-stack collector + Jaeger + Prometheus + Grafana via the new compose profile: .. code-block:: bash ./scripts/stack.sh up --with-observability Without the profile the workers still try to push to ``http://otel-collector:4317``; the BatchSpanProcessor retries silently and pipeline work is never blocked, but you also won't see anything in Jaeger / Prometheus. The dashboards land on (everything is bound to ``127.0.0.1``): * **Jaeger** — http://localhost:16686 * **Prometheus** — http://localhost:9090 * **Grafana** (anonymous admin) — http://localhost:3000 Span catalog ------------ All spans live under one of two tracers: * ``autorag.services`` — pipeline + transport spans. * ``autorag.gpu`` — GPU-arbiter spans. .. list-table:: :header-rows: 1 :widths: 28 24 48 * - Span - Emitter - Notable attributes * - ``autorag.job.submit`` - :func:`autorag.services.broker.submit_audio_job` - ``job.id``, ``job.session_id``, ``job.source``, ``job.source_kind`` (``youtube``/``file``), ``job.llm_model``, ``job.whisper_model`` * - ``autorag.stage.`` - :func:`autorag.services.stages.handle_batch` (queue path) + :func:`autorag.services.runner.run_job_in_process` (in-process) - ``job.id``, ``stage.name``, ``stage.attempt``, ``messaging.system=rabbitmq``, ``messaging.destination.name=stage.``, ``messaging.operation=process``; the persist span additionally carries ``result.session_id``. * - ``autorag.gpu.acquire`` - :meth:`autorag.services.model_manager.GpuArbiter.acquire` - ``gpu.tenant.previous``, ``gpu.tenant.target``, ``gpu.tenant.transition`` * - ``autorag.gpu.evict.llm`` / ``.offload.whisper`` / ``.enforce_budget`` - Nested children of ``autorag.gpu.acquire`` - None — they exist to attribute the time of each substep on the transition. * - ``autorag.queue.wait.`` - :func:`autorag.services.stages._handle_one` — emitted *retroactively* between AMQP context attach and the stage span, parented under the publisher's context. The bar in the Jaeger waterfall is "how long did this message sit in ``stage.``?". - ``stage.name``, ``messaging.system=rabbitmq``, ``messaging.destination.name=stage.``, ``job.id`` * - ``autorag.llm.call`` - :class:`autorag.otel_callbacks.OTelSpanCallbackHandler` — one per LangChain chat-model call. Threads under the stage span via ``ThreadingInstrumentor`` even from ``Runnable.batch``'s worker pool. - ``llm.stage`` (which agent stage emitted it), ``llm.model``, ``llm.input.chars``, ``llm.input.message_count``, ``llm.ollama.total_duration_ms``, ``llm.ollama.load_duration_ms``, ``llm.ollama.prompt_eval_duration_ms``, ``llm.ollama.eval_duration_ms``, ``llm.ollama.prompt_eval_count``, ``llm.ollama.eval_count``, ``llm.usage.{input_tokens,output_tokens,total_tokens}`` * - ``autorag.whisper.`` / ``autorag.pyannote.`` / ``autorag.gpu.preload.`` - :mod:`autorag.agent` (``get_model``, ``transcribe_segment``, ``diarize_file``, ``assign_speakers``), :mod:`autorag.whisper_runner` (``load_audio``, ``load_model`` cache-miss, ``ct2_transcribe``, ``get_align_model``, ``align``, ``offload_align``), :mod:`autorag.diarize` (``ensure_on_cuda``, ``inference``, ``offload``), :class:`autorag.services.model_manager.GpuArbiter` preload methods. - Stepwise breakdown of the whisper stage's wall time — ``cache.hit``, ``audio.duration_s``, ``model.compute_type``, ``model.device``, ``transcribe.realtime_factor``, ``align.restored_from`` (``cuda``/``cpu_to_cuda_restore``/``fresh_load``), ``preload.cuda_attempted``/``preload.cuda_succeeded``. W3C trace context is propagated across the RabbitMQ envelope **manually** in :class:`autorag.services.broker.RabbitBroker` via :func:`autorag.otel.inject_amqp_headers` and :func:`autorag.otel.extract_amqp_context`. The contrib ``opentelemetry-instrumentation-pika`` package only wraps ``basic_publish`` and the ``basic_consume`` callback path; the broker uses pull-mode ``basic_get`` for its drain loop, so doing the inject/extract ourselves is the only reliable way to keep one end-to-end trace across the API → ``gpu-worker`` → ``io-worker`` hops. The handler-side publish (the next-stage hop and the bounded retry in :func:`autorag.services.stages.drain_and_dispatch`) runs *after* the stage span's ``with`` block has exited, so :class:`StageOutcome` carries the stage span's OTel context (captured at outcome-construction time) and ``drain_and_dispatch`` re-attaches it around each ``broker.publish`` via the small ``_publish_in_ctx`` helper. Without this re-attach, :func:`inject_amqp_headers` would publish with no active span and every downstream stage would start a fresh trace. Metric catalog -------------- Every metric is published under the ``autorag`` meter. .. list-table:: :header-rows: 1 :widths: 30 14 10 46 * - Name - Type - Unit - Labels * - ``autorag.jobs.submitted`` - Counter - 1 - ``job.source_kind`` * - ``autorag.jobs.completed`` - Counter - 1 - ``status`` (``done``/``failed``) * - ``autorag.stage.duration`` - Histogram - s - ``stage.name``, ``outcome`` (``ok``/``error``) * - ``autorag.stage.attempts`` - Histogram - 1 - ``stage.name`` * - ``autorag.stage.dlq`` - Counter - 1 - ``stage.name``, ``error.class`` * - ``autorag.gpu.tenancy.transitions`` - Counter - 1 - ``from_tenant``, ``to_tenant`` * - ``autorag.gpu.tenancy.duration`` - Histogram - s - ``tenant`` * - ``autorag.queue.wait.duration`` - Histogram - s - ``stage.name`` Reading the per-LLM-call span (``autorag.llm.call``) ---------------------------------------------------- Without :class:`autorag.otel_callbacks.OTelSpanCallbackHandler` the batched LLM stages (``decide``, ``l2``, ``summarize_l1``, ``summarize_l2``) render in Jaeger as a single ``autorag.stage.`` span with N anonymous ``httpx`` children — the ``Runnable.batch`` fan-out *is* nesting correctly under the stage thanks to ``ThreadingInstrumentor``, but the per-item calls have no label. The callback is wired through ``RunnableConfig`` in :func:`autorag.agent._build_stage_closures`, so every chat-model invocation gets one ``autorag.llm.call`` span tagged with its stage and the Ollama timings. Read the Jaeger waterfall as: * ``llm.ollama.eval_duration_ms`` — pure GPU token-generation time. * ``llm.ollama.prompt_eval_duration_ms`` — GPU prompt-eval time. * ``llm.ollama.load_duration_ms`` — Ollama model swap (≈0 when ``keep_alive="5m"`` keeps the model warm; non-zero means a reload cost — usually a stage that changed ``num_ctx``). * ``httpx`` child span duration − ``total_duration_ms`` ≈ network round-trip to the Ollama server. * ``autorag.llm.call`` duration − ``httpx`` child duration ≈ local LangChain Python overhead (prompt-template build, structured-output parse/validate). Queue depth is **not** an autorag metric — Prometheus scrapes RabbitMQ's built-in ``rabbitmq_prometheus`` exporter on ``:15692`` and exposes the queue counts as ``rabbitmq_queue_messages_ready{queue=…}``. The starter Grafana dashboard joins those series with the autorag metrics so the per-stage histogram and the queue depth can be read side-by-side. A companion ``autorag-queue-wait`` dashboard breaks out the idle-time view — total wait per stage, p50/p95/p99 quantiles, and a wait-duration heatmap — for when the question is "where is backpressure piling up?" rather than "how fast is each stage running?". The scrape target uses the plugin's ``/metrics/per-object`` endpoint (set in ``observability/prometheus.yml``); the default ``/metrics`` emits only aggregated totals with no ``queue=`` label, which would break the per-queue Grafana panel. Service-name initialisation --------------------------- :func:`autorag.otel.initialize_otel` is idempotent — the module-level ``_initialized`` bool means the first call wins for the lifetime of the process. Each long-running process therefore calls it exactly once with its own service name: ``autorag-api`` from the FastAPI lifespan, ``autorag-gpu-worker`` / ``autorag-io-worker`` from each worker's ``main()``, and ``autorag-cli`` from the Typer ``@app.callback``. The callback skips its init when the invoked subcommand is ``serve``; otherwise the callback would win the race and every API span would register as ``autorag-cli``. Settings -------- The fields below live on :class:`autorag.config.Settings`; their environment variables use the standard ``AUTORAG_`` prefix. .. list-table:: :header-rows: 1 * - Setting - Default * - ``otel_enabled`` - ``False`` * - ``otel_service_name`` - ``autorag`` * - ``otel_exporter_endpoint`` - ``http://localhost:4317`` * - ``otel_metric_export_interval_ms`` - ``15000`` * - ``otel_environment`` - ``dev`` * - ``otel_resource_attributes`` - ``""`` (``"k=v,k2=v2"``)