Observability (``autorag.otel``)
================================

AutoRAG ships an OpenTelemetry integration that emits **traces** for
the async pipeline (end-to-end job hop from API → ``gpu-worker`` →
``io-worker``, with one ``autorag.stage.<name>`` span per stage and
nested GPU-arbiter spans) and **metrics** for stage durations, attempt
counts, DLQ rates, and GPU tenancy flips. The integration is opt-in:
when ``AUTORAG_OTEL_ENABLED=false`` (the default) :func:`autorag.otel.initialize_otel`
short-circuits without importing any ``opentelemetry`` modules, so a
base install and a ``[broker]``-only install both keep booting without
the new ``[observability]`` extra.

Quick start
-----------

#. Install the extra (or rerun ``uv sync --all-extras``):

   .. code-block:: bash

      pip install 'autorag[broker,rag,observability]'

#. Switch OTel on in the host environment:

   .. code-block:: bash

      echo 'AUTORAG_OTEL_ENABLED=true' >> .env

#. Bring up the in-stack collector + Jaeger + Prometheus + Grafana via
   the new compose profile:

   .. code-block:: bash

      ./scripts/stack.sh up --with-observability

Without the profile the workers still try to push to
``http://otel-collector:4317``; the BatchSpanProcessor retries silently
and pipeline work is never blocked, but you also won't see anything in
Jaeger / Prometheus.

The dashboards land on (everything is bound to ``127.0.0.1``):

* **Jaeger** — http://localhost:16686
* **Prometheus** — http://localhost:9090
* **Grafana** (anonymous admin) — http://localhost:3000

Span catalog
------------

All spans live under one of two tracers:

* ``autorag.services`` — pipeline + transport spans.
* ``autorag.gpu`` — GPU-arbiter spans.

.. list-table::
   :header-rows: 1
   :widths: 28 24 48

   * - Span
     - Emitter
     - Notable attributes
   * - ``autorag.job.submit``
     - :func:`autorag.services.broker.submit_audio_job`
     - ``job.id``, ``job.session_id``, ``job.source``,
       ``job.source_kind`` (``youtube``/``file``), ``job.llm_model``,
       ``job.whisper_model``
   * - ``autorag.stage.<name>``
     - :func:`autorag.services.stages.handle_batch` (queue path) +
       :func:`autorag.services.runner.run_job_in_process` (in-process)
     - ``job.id``, ``stage.name``, ``stage.attempt``,
       ``messaging.system=rabbitmq``,
       ``messaging.destination.name=stage.<name>``,
       ``messaging.operation=process``; the persist span additionally
       carries ``result.session_id``.
   * - ``autorag.gpu.acquire``
     - :meth:`autorag.services.model_manager.GpuArbiter.acquire`
     - ``gpu.tenant.previous``, ``gpu.tenant.target``,
       ``gpu.tenant.transition``
   * - ``autorag.gpu.evict.llm`` / ``.offload.whisper`` /
       ``.enforce_budget``
     - Nested children of ``autorag.gpu.acquire``
     - None — they exist to attribute the time of each substep on the
       transition.
   * - ``autorag.queue.wait.<stage>``
     - :func:`autorag.services.stages._handle_one` — emitted
       *retroactively* between AMQP context attach and the stage span,
       parented under the publisher's context. The bar in the Jaeger
       waterfall is "how long did this message sit in
       ``stage.<name>``?".
     - ``stage.name``, ``messaging.system=rabbitmq``,
       ``messaging.destination.name=stage.<name>``, ``job.id``
   * - ``autorag.llm.call``
     - :class:`autorag.otel_callbacks.OTelSpanCallbackHandler` —
       one per LangChain chat-model call. Threads under the stage span
       via ``ThreadingInstrumentor`` even from ``Runnable.batch``'s
       worker pool.
     - ``llm.stage`` (which agent stage emitted it),
       ``llm.model``, ``llm.input.chars``,
       ``llm.input.message_count``,
       ``llm.ollama.total_duration_ms``,
       ``llm.ollama.load_duration_ms``,
       ``llm.ollama.prompt_eval_duration_ms``,
       ``llm.ollama.eval_duration_ms``,
       ``llm.ollama.prompt_eval_count``,
       ``llm.ollama.eval_count``,
       ``llm.usage.{input_tokens,output_tokens,total_tokens}``
   * - ``autorag.whisper.<step>`` /
       ``autorag.pyannote.<step>`` /
       ``autorag.gpu.preload.<target>``
     - :mod:`autorag.agent` (``get_model``,
       ``transcribe_segment``, ``diarize_file``, ``assign_speakers``),
       :mod:`autorag.whisper_runner` (``load_audio``,
       ``load_model`` cache-miss, ``ct2_transcribe``,
       ``get_align_model``, ``align``, ``offload_align``),
       :mod:`autorag.diarize` (``ensure_on_cuda``, ``inference``,
       ``offload``), :class:`autorag.services.model_manager.GpuArbiter`
       preload methods.
     - Stepwise breakdown of the whisper stage's wall time —
       ``cache.hit``, ``audio.duration_s``,
       ``model.compute_type``, ``model.device``,
       ``transcribe.realtime_factor``,
       ``align.restored_from`` (``cuda``/``cpu_to_cuda_restore``/``fresh_load``),
       ``preload.cuda_attempted``/``preload.cuda_succeeded``.

W3C trace context is propagated across the RabbitMQ envelope **manually**
in :class:`autorag.services.broker.RabbitBroker` via
:func:`autorag.otel.inject_amqp_headers` and
:func:`autorag.otel.extract_amqp_context`. The contrib
``opentelemetry-instrumentation-pika`` package only wraps
``basic_publish`` and the ``basic_consume`` callback path; the broker
uses pull-mode ``basic_get`` for its drain loop, so doing the
inject/extract ourselves is the only reliable way to keep one end-to-end
trace across the API → ``gpu-worker`` → ``io-worker`` hops.

The handler-side publish (the next-stage hop and the bounded retry in
:func:`autorag.services.stages.drain_and_dispatch`) runs *after* the
stage span's ``with`` block has exited, so :class:`StageOutcome` carries
the stage span's OTel context (captured at outcome-construction time)
and ``drain_and_dispatch`` re-attaches it around each ``broker.publish``
via the small ``_publish_in_ctx`` helper. Without this re-attach,
:func:`inject_amqp_headers` would publish with no active span and every
downstream stage would start a fresh trace.

Metric catalog
--------------

Every metric is published under the ``autorag`` meter.

.. list-table::
   :header-rows: 1
   :widths: 30 14 10 46

   * - Name
     - Type
     - Unit
     - Labels
   * - ``autorag.jobs.submitted``
     - Counter
     - 1
     - ``job.source_kind``
   * - ``autorag.jobs.completed``
     - Counter
     - 1
     - ``status`` (``done``/``failed``)
   * - ``autorag.stage.duration``
     - Histogram
     - s
     - ``stage.name``, ``outcome`` (``ok``/``error``)
   * - ``autorag.stage.attempts``
     - Histogram
     - 1
     - ``stage.name``
   * - ``autorag.stage.dlq``
     - Counter
     - 1
     - ``stage.name``, ``error.class``
   * - ``autorag.gpu.tenancy.transitions``
     - Counter
     - 1
     - ``from_tenant``, ``to_tenant``
   * - ``autorag.gpu.tenancy.duration``
     - Histogram
     - s
     - ``tenant``
   * - ``autorag.queue.wait.duration``
     - Histogram
     - s
     - ``stage.name``

Reading the per-LLM-call span (``autorag.llm.call``)
----------------------------------------------------

Without :class:`autorag.otel_callbacks.OTelSpanCallbackHandler` the
batched LLM stages (``decide``, ``l2``, ``summarize_l1``,
``summarize_l2``) render in Jaeger as a single ``autorag.stage.<n>``
span with N anonymous ``httpx`` children — the ``Runnable.batch``
fan-out *is* nesting correctly under the stage thanks to
``ThreadingInstrumentor``, but the per-item calls have no label. The
callback is wired through ``RunnableConfig`` in
:func:`autorag.agent._build_stage_closures`, so every chat-model
invocation gets one ``autorag.llm.call`` span tagged with its stage and
the Ollama timings.

Read the Jaeger waterfall as:

* ``llm.ollama.eval_duration_ms`` — pure GPU token-generation time.
* ``llm.ollama.prompt_eval_duration_ms`` — GPU prompt-eval time.
* ``llm.ollama.load_duration_ms`` — Ollama model swap (≈0 when
  ``keep_alive="5m"`` keeps the model warm; non-zero means a reload
  cost — usually a stage that changed ``num_ctx``).
* ``httpx`` child span duration − ``total_duration_ms`` ≈ network
  round-trip to the Ollama server.
* ``autorag.llm.call`` duration − ``httpx`` child duration ≈ local
  LangChain Python overhead (prompt-template build, structured-output
  parse/validate).

Queue depth is **not** an autorag metric — Prometheus scrapes
RabbitMQ's built-in ``rabbitmq_prometheus`` exporter on ``:15692`` and
exposes the queue counts as ``rabbitmq_queue_messages_ready{queue=…}``.
The starter Grafana dashboard joins those series with the autorag
metrics so the per-stage histogram and the queue depth can be read
side-by-side. A companion ``autorag-queue-wait`` dashboard breaks
out the idle-time view — total wait per stage, p50/p95/p99 quantiles,
and a wait-duration heatmap — for when the question is "where is
backpressure piling up?" rather than "how fast is each stage running?".
The scrape target uses the plugin's
``/metrics/per-object`` endpoint (set in
``observability/prometheus.yml``); the default ``/metrics`` emits only
aggregated totals with no ``queue=`` label, which would break the
per-queue Grafana panel.

Service-name initialisation
---------------------------

:func:`autorag.otel.initialize_otel` is idempotent — the module-level
``_initialized`` bool means the first call wins for the lifetime of the
process. Each long-running process therefore calls it exactly once with
its own service name: ``autorag-api`` from the FastAPI lifespan,
``autorag-gpu-worker`` / ``autorag-io-worker`` from each worker's
``main()``, and ``autorag-cli`` from the Typer ``@app.callback``. The
callback skips its init when the invoked subcommand is ``serve``;
otherwise the callback would win the race and every API span would
register as ``autorag-cli``.

Settings
--------

The fields below live on :class:`autorag.config.Settings`; their
environment variables use the standard ``AUTORAG_`` prefix.

.. list-table::
   :header-rows: 1

   * - Setting
     - Default
   * - ``otel_enabled``
     - ``False``
   * - ``otel_service_name``
     - ``autorag``
   * - ``otel_exporter_endpoint``
     - ``http://localhost:4317``
   * - ``otel_metric_export_interval_ms``
     - ``15000``
   * - ``otel_environment``
     - ``dev``
   * - ``otel_resource_attributes``
     - ``""`` (``"k=v,k2=v2"``)