Changelog

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

Added

  • observability/grafana/provisioning/dashboards/autorag-queue-wait.json — a second auto-provisioned Grafana dashboard focused on stage-queue idle time. Pairs autorag_queue_wait_duration_seconds_* rollups with rabbitmq_queue_messages_{ready,unacked} to expose total idle seconds per stage, avg wait per message, idle/work ratio, p50/p95/p99 quantiles, and a wait-duration heatmap.

  • Per-LLM-call OTel spans via a new autorag.otel_callbacks.OTelSpanCallbackHandler, wired through every batched stage’s RunnableConfig in agent._build_stage_closures. One autorag.llm.call span per chat-model run, tagged with llm.stage and stamped with Ollama’s total/load/prompt_eval/eval_duration_ms and token counts so the Jaeger waterfall attributes time to GPU eval vs. prompt-eval vs. model-load vs. network round-trip.

  • Stepwise whisper-stage spans: autorag.whisper.{get_model,load_audio,load_model,ct2_transcribe, get_align_model,align,offload_align}, autorag.pyannote.{ensure_on_cuda,inference,offload}, and autorag.gpu.preload.{align,pyannote,whisper_ct2}. Attributes include cache.hit, audio.duration_s, model.compute_type, transcribe.realtime_factor, align.restored_from, and preload.cuda_{attempted,succeeded}.

  • autorag.queue.wait.<stage> retroactive span + autorag.queue.wait.duration histogram (labelled by stage.name). RabbitBroker.publish stamps an autorag-publish-ts-ns AMQP header (plus the coarser pika.BasicProperties.timestamp as fallback); get_batch extracts it onto a new Delivery.publish_ts_ns; stages._handle_one opens the queue-wait span between AMQP-context-attach and the stage span so the Jaeger waterfall reads prev stage queue.wait stage. The InMemoryBroker mirrors the stamping so the test path records a non-zero wait.

  • autorag.otel.bind_current_context(fn) — public helper that wraps fn so a worker-thread call inherits the caller’s OTel context. Used at every ThreadPoolExecutor.submit site in the new preload / warm-up / offload fan-outs so child spans parent under the calling thread’s current span instead of becoming orphan roots in Jaeger. Safe no-op when opentelemetry-api is not installed.

  • New autorag.gpu.preload.fanout parent span around the three boot-time preloads (align / pyannote / whisper-CT2 CPU); paired sibling autorag.gpu.preload.whisper_ct2.cpu and autorag.gpu.preload.whisper_ct2.cuda spans (replacing the single autorag.gpu.preload.whisper_ct2 span — see Changed).

Changed

  • whisperX CT2 model now stays resident on CUDA for the worker’s lifetime: whisper_runner.transcribe_segment no longer destroys the cache per call, GpuArbiter._default_offload_whisper no longer drops it on the whisper -> llm flip (only the torch parts of the stack — wav2vec2 align + pyannote — go to CPU), and GpuArbiter._preload_whisper_ct2 now also builds the CUDA fp16 instance up-front when vram_probe shows headroom. autorag.whisper.load_model should fire at most once per worker boot — its presence on job ≥ 2 is a regression signal.

  • Async-pipeline parallelism: three new ThreadPoolExecutor fan-outs eliminate sequential waits visible in the prior trace. GpuArbiter.preload runs the three CPU-side preloads (align, pyannote, whisper-CT2 int8) concurrently under autorag.gpu.preload.fanout; the CUDA fp16 CT2 build is split into a new _preload_whisper_ct2_cuda that runs after the join (single CUDA-driver step). GpuArbiter._default_offload_whisper offloads wav2vec2 and pyannote concurrently on tenancy flip. agent._run_whisper overlaps the wav2vec2-align and pyannote CPU→CUDA restores with whisper_runner.transcribe_segment (the longest leg) via a 2-worker warm-up pool; warm-up failures fall through to the existing inline restore. The original autorag.gpu.preload.whisper_ct2 span is renamed to sibling .cpu and .cuda spans.

  • OLLAMA_NUM_PARALLEL default 4→8 in docker-compose.yml, mirrored by the AudioJobRequest.max_concurrency default and the in-process defaults on agent.build_agent / build_topic_runnable / core.AutoRAG.generate_topics / autorag generate-topics --max-concurrency. Each extra slot reserves a KV-cache copy at q8_0; verify VRAM mid-l1 via docker exec autorag-ollama-1 nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader and roll back via OLLAMA_NUM_PARALLEL=4 in .env (restart ollama only) if the new autorag.gpu.preload.whisper_ct2.cuda span flips preload.cuda_succeeded false on a tight-VRAM host.

  • Broker fan-out: summarize split into summarize_l1summarize_l2. StageName.summarize is removed; summarize_l1 (L1-node titles/summaries) fans out from l1 in parallel with decide + l2, and summarize_l2 (L2-child titles/summaries) runs after l2. l0 joins both: a new stages._try_emit_l0(ctx, msg) reads the per-job stage_states row and publishes the L0 message exactly once after BOTH summarize stages reach done (and never if either flips to error — the job is already observably failed). NEXT_STAGE: StageName -> StageName | None becomes NEXT_STAGES: StageName -> list[StageName] to express the fan-out edge; StageOutcome.next_message becomes next_messages: list[StageMessage]. decide and l2 switch from _save_state (whole-row overwrite) to a new _merge_state read-modify-write helper so they don’t clobber the disjoint l1_summaries subkey that the concurrent summarize_l1 writes; l0 then merges l1_summaries + l2_summaries back onto the live l1 node list before aggregating. Single-GPU-worker invariant makes the read+write race-free; a future multi-worker topology would need a BEGIN IMMEDIATE claim on _try_emit_l0.

    One-time RabbitMQ cleanup on existing deployments: _declare_topology will auto-create the new stage.summarize_l1 / stage.summarize_l2 queues, but the old stage.summarize / stage.summarize.dlq linger. Drain any in-flight stage.summarize messages before deploying (they’ll fail to validate against the new enum), then run docker exec autorag-rabbitmq-1 rabbitmqctl delete_queue stage.summarize stage.summarize.dlq once on the live broker.

Fixed

  • ./scripts/stack.sh up no longer silently degrades when .stack-data is root-owned (the failure mode behind the gpu-worker “attempt to write a readonly database” reconnect loop, where every job sticks on queued). prepare_stack_data now uses chown -R (so existing files inside the dir are also fixed), sudo -n (so the non-tty up path fails fast instead of prompting), adds a docker-based chown fallback that works even when the host user can’t sudo (containers run as root in their own namespace), and dies loudly with an actionable hint if all paths fail instead of pretending the stack came up clean.

  • Database._upsert_clip_identity no longer raises IndexError when the row already exists. sqlite_utils.insert(ignore=True) reads back the row by last_rowid, which is 0 on the ignored conflict and errors on the lookup; the conflict path is the whole point of ignore=True, so the exception is now swallowed.

0.10.0 - 2026-05-20

Added

  • OpenTelemetry traces + metrics across the async pipeline (autorag.otel, new [observability] extra — not folded into [all]). Off by default — AUTORAG_OTEL_ENABLED=false makes initialize_otel short-circuit before importing any opentelemetry.* symbols, so base install and [broker]-only install keep booting unchanged. Opt-in adds one autorag.stage.<name> span per stage in services.stages.handle_batch and in the in-process services.runner.run_job_in_process, an autorag.job.submit root span in services.broker.submit_audio_job, nested autorag.gpu.acquire / .evict.llm / .offload.whisper / .enforce_budget spans on GpuArbiter.acquire, and seven custom metrics (autorag.jobs.{submitted,completed}, autorag.stage.{duration,attempts,dlq}, autorag.gpu.tenancy.{transitions,duration}). W3C trace context rides AMQP via a new Delivery.otel_ctx field + manual inject_amqp_headers / extract_amqp_context calls in RabbitBroker (the contrib pika instrumentor wraps only basic_publish/basic_consume, not the pull-mode basic_get the broker uses). The handler-side publish runs in drain_and_dispatch after the stage span has closed, so StageOutcome carries the stage span’s OTel context (captured at outcome-construction time) and drain_and_dispatch re-attaches it around broker.publish via a small _publish_in_ctx helper — without that re-attach the injected traceparent would have no parent and every stage hop would start a fresh trace. The Typer @app.callback skips OTel init when ctx.invoked_subcommand == "serve" so the API process registers spans under autorag-api instead of autorag-cli (initialize_otel is idempotent — first call wins).

  • New opt-in compose profile observability: otel-collector, jaeger, prometheus (scrapes RabbitMQ’s :15692 rabbitmq_prometheus plugin on /metrics/per-object so queue depth is labeled by queue), and grafana with a starter AutoRAG Pipeline dashboard. All bound to 127.0.0.1. Brought up with ./scripts/stack.sh up --with-observability.

  • Settings gains otel_enabled, otel_service_name, otel_exporter_endpoint, otel_metric_export_interval_ms, otel_environment, otel_resource_attributes (all AUTORAG_OTEL_*).

Changed

  • Single-command host stack. docker-compose.yml is now the one source of truth for the entire heavy stack: it adds an ollama service (sole owner of the server-side tuning contract), a read-only tecnativa/docker-socket-proxy as the whole control plane (ps/logs/restart only — no build/exec/run/published port), and a lean .devcontainer/worker.Dockerfile for the gpu/io workers (deps baked; ./src + pyproject.toml bind-mounted read-only and uv run, so a code edit is a worker restart, not a rebuild). The host brings everything up with ./scripts/stack.sh up; the thin devcontainer joins the shared autorag-net and drives the stack through the proxy (docker compose -p autorag ps|logs|restart). .devcontainer/start-broker.sh / start-ollama.sh are removed — deployment is documented in the new SETUP.md.

  • [audio] now pins torch~=2.8.0 and adds torchcodec>=0.7,<0.8 (whisperx 3.8.5 transitively caps it <0.8); [diarize] requires pyannote.audio>=4.0.0; the imageio-ffmpeg floor is relaxed to >=0.4.0. worker.Dockerfile installs libpython3.12t64 because torchcodec’s FFmpeg-6 native lib dlopens the shared CPython library, which Ubuntu 24.04’s python3.12 package does not pull in.

  • A job needing an Ollama model the server has not pulled now fails with an actionable “Ollama model not available … ./scripts/stack.sh up message instead of an opaque dead-letter traceback (services.stages._legible_error).

  • autorag-gpu-worker / autorag-io-worker reconnect to RabbitMQ with backoff on a transient broker fault, keeping preloaded models warm (no cold reload) instead of exiting.

Fixed

  • AutoRAG.persist_topics / autorag generate-topics no longer null the stored transcription when persisting topics. The clip Database now uses column-scoped sqlite_utils upserts instead of a per-instance pydantic_sqlite read-modify-write, so a second Database instance (a separate worker process, or AutoRAG’s second persist call) cannot overwrite the transcript another instance wrote; create_clip is first-writer-wins.

  • Async /jobs/audio pipeline: the persist stage no longer crashes the persist-only IO worker (it built LLM handlers before the persist branch), and persisted topics are no longer orphaned to a second clip row for YouTube URLs (_default_persist now forwards source_url so the session id canonicalises to the whisper row).

0.9.0 - 2026-05-19

Added

  • Async, RabbitMQ-driven, GPU-aware pipeline — a new optional [broker] extra (pika) and an autorag.services package that runs many audio→topics requests concurrently alongside the unchanged synchronous SDK / CLI / API (which keep their direct in-process path and never need a broker):

    • RabbitMQ work-queue-per-stage topology + dead-letter exchange with bounded handler-driven retry; a dependency-free InMemoryBroker; submit_audio_job.

    • One autorag-gpu-worker (owns whisper + every LLM stage) and an autorag-io-worker (owns persist). GpuArbiter CPU-preloads model standbys and smart-unloads the prior GPU tenant on demand, reusing the existing whisper_runner / diarize offload primitives (whisperX CT2 is destroy+rebuilt — it is not a movable torch module).

    • Durable JobStore (jobs table in the existing SQLite DB, cross-process readable); transcripts travel by session_id reference (services.blobs), never in messages; the evolving tree lives in the job row.

    • POST /jobs/audio (202 + job_id), GET /jobs/{id}, GET /jobs/{id}/result added to autorag.api, plus optional autorag jobs submit / autorag jobs status CLI subcommands. The handlers import autorag.services lazily → clean 503 when [broker] / [rag] are absent; import autorag.services stays base-install safe (no torch / chromadb / pika).

    • autorag.agent.build_stage_handlers() exposes the per-stage closures, sharing _build_stage_closures with build_topic_runnable so the distributed and in-process paths build identical warm Ollama chains and the same keep_alive=0 eviction. New AUTORAG_BROKER_URL setting; persistence.load_clip (cross-process clip read); repo-root docker-compose.yml and a best-effort devcontainer start-broker.sh.

Changed

  • The bundled /viz frontend’s build tooling and runtime stack were upgraded across several majors: Vite 5→8 (Rolldown bundler), TypeScript 5→6, @vitejs/plugin-react 4→6, React 18→19, @react-three/fiber 8→9, @react-three/drei 9→10, three 0.165→0.184, and zustand 4→5. The committed src/autorag/static/viz/ bundle was rebuilt in lockstep; /viz behaviour and the Python public API are unchanged. react / react-dom are now pinned ~19.2.6 (tilde, not caret) because @react-three/fiber@9 peers require react >=19 <19.3.

0.8.0 - 2026-05-16

Added

  • autorag generate-topics now exposes the LLM tuning knobs that AutoRAG.generate_topics already accepted: --num-ctx-l1, --num-ctx-fanout, --max-concurrency, --min-subdivide-duration-s, and --reasoning/--no-reasoning. Forwarded 1:1 to the facade with the same defaults (8192 / 8192 / 4 / 120.0 / False); ollama_base_url stays env-only via AUTORAG_OLLAMA_BASE_URL.

  • New boundary_block_seconds tuning kwarg (default 30) on AutoRAG.generate_topics / agent.build_topic_runnable / agent.build_agent, exposed as --boundary-block-seconds on autorag generate-topics. Sizes the time-bucketed transcript fed to the L1/L2 boundary prompts (was the hardcoded private _BOUNDARY_BLOCK_SECONDS); smaller windows give finer MM:SS anchors at the cost of more boundary-prompt tokens.

Changed

  • Default topic LLM is now gemma4:latest (8B Q4_K_M, ~9.6 GB), replacing qwen2.5:14b-instruct-q8_0, across AutoRAG.generate_topics / agent.build_topic_runnable / build_agent and the autorag generate-topics CLI. gemma4:latest is a thinking-capable model; because all five stages do mechanical JSON extraction, the agent disables thinking by default. New overridable reasoning: bool = False kwarg on build_topic_runnable / build_agent / AutoRAG.generate_topics (sends think: false to Ollama on thinking models; harmless no-op otherwise) — pass reasoning=True to trade latency for chain-of-thought. The lighter default also frees VRAM: the 4 agent slots + model now sit at ~11 GB on a 24 GB card (was ~15 GB+ for the qwen 14B).

  • The topic agent now keeps the Ollama model resident in VRAM for the whole run instead of cold-reloading it (~15 GB) at every stage boundary. All five stages share one num_ctx and keep_alive="5m" (Ollama reloads on any num_ctx change, so a uniform size is what keeps it warm); _build_tree issues one throwaway keep_alive=0 call after the run — or on a stage error — to evict the model so it doesn’t squat VRAM during the downstream embed/viz step. Substantially cuts topic-generation wall-clock.

  • num_ctx_l1 now defaults to 8192 (was 16384) in AutoRAG.generate_topics / agent.build_topic_runnable / build_agent, so the L1 call shares the fan-out context size. Trade-off: on very long audio (≈1 hr+) the L1 transcript can truncate at 8192 and degrade boundary quality — raise num_ctx_l1 back to 16384 to restore fidelity, at the cost of one model reload at the Stage 2→3a boundary.

  • Transcription now defaults to English. --language defaults to en on autorag transcribe / generate-topics / blocks, and the language parameter defaults to "en" on AutoRAG.transcribe / AutoRAG.transcribe_blocks / agent.transcribe_audio / agent.build_agent (was Whisper auto-detect). Behavior change for SDK consumers relying on auto-detect: pass language=None (SDK) or --language "" (CLI) to restore it.

0.7.0 - 2026-05-15

Added

  • GET /viz now renders the interactive 3-D topic constellation: per-level glowing points, clip/cluster coloring, additive knowledge-graph edges, a pointer tooltip, two-way rail↔scene hover sync, and debounced semantic search with click-to-focus. The React rewrite had previously shipped only the left rail, so the page showed no embeddings; the r3f scene (frontend/src/three/) is now implemented and the committed bundle rebuilt. UMAP coordinates are recentred/scaled in three/layout.ts (raw /viz/data coords are not origin-centred), and an error boundary keeps the rail usable if WebGL is unavailable.

  • Hosted documentation at https://autologger.github.io/AutoRAG/, published to GitHub Pages on every push to main (.github/workflows/docs.yml).

  • autorag.blocks.mmss(t) — public MM:SS second-formatter (promoted from the private _mmss), now exported in autorag.blocks.__all__.

Changed

  • The topic agent’s L1/L2 boundary detection now feeds the LLM a 30-second time-bucketed transcript via blocks.format_blocks (one MM:SS-MM:SS Speaker K: <words> line per turn) instead of one timestamped line per word, and the boundary LLM emits MM:SS offsets that agent._parse_ts converts back to seconds in code. Cuts boundary-prompt size sharply; AutoRAG.generate_topics / build_agent signatures and the Runnable[list[WordSpan], TopicTree] contract are unchanged.

Removed

  • src/autorag/static/viz.html — the original vanilla Three.js /viz page. It was orphaned once /viz switched to the React bundle (viz.py serves static/viz/index.html, never this file) and had been shipping unused in the wheel via the static/ glob.

Fixed

  • IngestRequest (POST /ingest) is no longer left “not fully defined”: pathlib.Path is imported at runtime again so Pydantic can resolve the paths field. Restores IngestRequest.model_rebuild(), FastAPI OpenAPI schema generation, and the Sphinx autodoc build.

  • The strict docs build no longer fails under --all-extras: transformers (pulled transitively by langchain_core, a base dep) is now mocked in autodoc_mock_imports, so base+docs and all-extras builds take the same path.

0.6.0 - 2026-05-12

Changed

  • Replaced openai-whisper with whisperX (faster-whisper / CTranslate2 backend + wav2vec2 forced-alignment pass). Transcription is ~4× faster and word-level timestamps are frame-accurate rather than Whisper-estimated. The [audio] extra now pulls whisperx instead of openai-whisper; the public API (AutoRAG.transcribe, WordSpan shape) is unchanged.

0.5.0 - 2026-05-11

Changed

  • AutoRAG.generate_topics() now applies collapse_lone_children before returning, so callers always receive a normalized TopicTree regardless of whether persist_topics is called. persist_topics no longer collapses the tree itself.

Fixed

  • Suppress spurious pyannote UserWarning about std() degrees of freedom from StatsPool on single-frame diarization segments; the warning was harmless (pyannote handles the NaN internally) but polluted log output.

0.4.0 - 2026-05-11

Added

  • AutoRAG.generate_topics(words, ...)TopicTree: pure LLM topic extraction on pre-computed list[WordSpan], no audio involved.

  • AutoRAG.persist_topics(file, topics, ...): stores the topic tree to SQLite and embeds topic titles into Chroma. Call after persist_transcription.

  • build_topic_runnable() in agent.py — LangChain Runnable[list[WordSpan], TopicTree] (Whisper-free; build_agent wraps it).

  • agent.transcribe_audio(file)list[WordSpan] and agent.generate_topics(words)TopicTree as standalone module-level helpers (lower-level alternatives to the AutoRAG facade).

  • autorag generate-topics CLI command: transcribes (or reads from cache), generates LLM topics, and persists transcription + topics + embeddings.

Changed

  • AutoRAG.transcribe() now returns list[WordSpan] instead of TranscriptionResult; call generate_topics() separately for the LLM topic tree.

  • AutoRAG.persist_transcription() now stores word spans only; call persist_topics() to persist the topic tree and Chroma embeddings.

  • autorag transcribe CLI now only transcribes and persists word spans (no LLM topic generation). Use autorag generate-topics for the full pipeline.

Removed

  • abs_s field removed from WordSpan dict construction in agent.py (was redundant with s and was never declared in the WordSpan TypedDict).

0.3.3 - 2026-05-11

Fixed

  • Whisper and pyannote pipeline VRAM is released immediately after inference: transcribe_segment and _run_diarization now move their models to CPU and call torch.cuda.empty_cache() so Ollama’s LLM stages start with the GPU unencumbered. Both modules restore to CUDA automatically on the next call.

0.3.2 - 2026-05-10

Changed

  • /viz rail (header / stats / legend / size legend / controls / search / topic list) now renders from the React app, fed by a typed useVizData() hook hitting /viz/data. Color-mode and edges-visible state are held in a Zustand store (frontend/src/state/vizStore.ts) so the canvas (Phase C+) can read the same toggles. Phase B: DOM only — <canvas>, raycast, tooltip, and search wiring are still in the unmodified viz.html until later phases land them in frontend/src/three/.

0.3.1 - 2026-05-10

Changed

  • /viz is now served from a Vite-built React + TypeScript bundle under src/autorag/static/viz/index.html, mounted alongside a new /viz-assets static route. Source lives in the new top-level frontend/ directory (outside src/autorag/ so uv/ruff/mypy don’t scan TypeScript). Phase A: scaffold + FastAPI wiring only — the existing Three.js scene is preserved in viz.html and will be ported to react-three-fiber in subsequent commits.

0.3.0 - 2026-05-10

Added

  • transcribe accepts YouTube URLs via the [youtube] extra; URL is downloaded to a temp .webm through autorag.audio_source.resolve_audio_input (lazy yt_dlp import).

  • AudioSource carries source_url, video_id, title, upload_date, duration_s, and uploader lifted from yt-dlp’s info dict. The CLI forwards these to persist_transcription.

  • autorag.blocks.format_blocks (re-exported as from autorag import format_blocks) renders a WordSpan list as N-second time blocks with one MM:SS-MM:SS Speaker K: ... line per speaker turn. Pure stdlib — callable from a base install.

  • AutoRAG.transcribe_blocks(file, seconds=10, ...) returns the same formatted output, reading from the SQLite cache when available and otherwise running the full transcribe + persist pipeline first. Requires [rag] for the cache path, [audio,diarize] (+ [youtube] for URLs) on cache miss.

  • autorag blocks SOURCE [-n SECONDS] CLI command wrapping transcribe_blocks.

  • autorag.persistence.derive_session_id(file_or_url) and load_transcription(db, session_id) expose the session-id derivation and the cached-transcription read path as base-safe public helpers.

Changed

  • session_id is derived deterministically from the canonical YouTube URL (youtu.be / m.youtube.com / www.youtube.com variants collapse to one form) so re-runs overwrite the same SQLite row.

  • Renamed remaining AUTOLOGGER_* env vars to AUTORAG_*; devcontainer mount updated to match.

  • Clip created_at and absolute event timestamps anchor to the YouTube upload_date (midnight UTC) when present, instead of the temp-file mtime.

  • default_title_from(source) moved from cli.py (private _default_title_from) to autorag.audio_source as a public helper.

  • group_by_speaker moved from agent.py to autorag.blocks and is now part of the public surface; agent._format_transcript re-imports it from there.

0.2.0 - 2026-05-10

Added

  • SDK facade from autorag import AutoRAG with flat methods (transcribe, build_agent, persist_transcription, ingest, query).

  • Pip-installable from GitHub: pip install "autorag[...] @ git+https://github.com/AutoLogger/AutoRAG@v0.2.0".

  • Optional extras: [audio], [diarize], [rag], [server], [all]. MissingExtraError is raised with a friendly hint when an extra is missing.

  • Speaker diarization via pyannote/speaker-diarization-3.1 (gated by [diarize] + HF_TOKEN). Each WordSpan carries a speaker field.

  • Unified multi-pass L0/L1/L2 topic agent in src/autorag/agent.py, with boundary detection separated from per-node summarization.

  • GitHub Actions CI: lint/type-check, full-extras tests, and an SDK base-install regression guard for the lazy-import contract.

Changed

  • All LLM and embedding calls migrated to langchain-ollama.

  • Topic embeddings moved from a SQLite column into a persistent Chroma store.

  • Default topic model is qwen2.5:14b-instruct-q8_0.

Removed

  • Non-Ollama LLM providers.

  • Unused replace_existing parameter from the transcription flow.