Transcription and topic extraction

AutoRAG’s audio pipeline turns an audio file (or YouTube URL) into:

  1. A list of timestamped WordSpan records via whisperX (faster-whisper + wav2vec2 forced alignment).

  2. Speaker labels on every word via pyannote (when [diarize] is installed and HF_TOKEN is set).

  3. A 3-level hierarchical TopicTree produced by an LLM in five focused passes.

Steps 1 + 2 live behind AutoRAG.transcribe. Step 3 is AutoRAG.generate_topics.

Transcribe a local file

from autorag import AutoRAG

rag = AutoRAG()
words = rag.transcribe("meeting.wav", whisper_model="base", language="en")
print(words[:3])
# [{'w': ' Hello', 's': 0.0, 'e': 0.4, 'speaker': '0'}, …]
  • whisper_model accepts the standard Whisper sizes (tiny, base, small, medium, large).

  • language defaults to English ("en"); pass language=None (SDK) or --language "" (CLI) to let Whisper auto-detect.

  • Each WordSpan carries the word token, its start/end seconds, and the diarization-assigned speaker id ("0"-indexed in first-appearance order; always "0" when diarization is disabled).

The CTranslate2 model is unloaded after each call so the next run starts from a clean VRAM budget; the wav2vec2 alignment model is parked on CPU and re-uploaded on the next call.

Extract topics

topics = rag.generate_topics(words)
print(topics["topics"][0]["title"])

Internally the agent issues five distinct LLM call sets — L1 boundaries, “should this L1 subdivide?”, L2 boundaries, per-node summarization, and an L0 aggregate — for roughly 2 + N1_long + N1_yes + N1 + N2_total total calls. See Audio pipeline design for why the boundaries-vs- summaries split is structured that way.

Persist

The persistence layer requires the [rag] extra:

rag.persist_transcription("meeting.wav", words, title="Weekly sync")
rag.persist_topics("meeting.wav", topics, words=words, title="Weekly sync")

Session ids are stable: a local path maps to the UUID-5 of its resolved path, and a YouTube URL collapses to a canonical https://www.youtube.com/watch?v=<id> form. Re-running on the same input overwrites the existing row instead of duplicating it.

Cached, dependency-free reads

Once a clip is in SQLite, AutoRAG.transcribe_blocks can read it back without loading Whisper or pyannote — only the [rag] extra is required for the cache hit. [audio]/[diarize] (and [youtube] for URLs) are imported lazily only when the cache misses.

blocks_text = rag.transcribe_blocks("meeting.wav", seconds=10)
# 00:00-00:08 Speaker 1: Hello, welcome to the standup …

If you already have a WordSpan list in hand, autorag.blocks.format_blocks does the same formatting with no deps at all.