autorag.db

SQLite-backed database for audio clip transcription and topic storage.

class autorag.db.AudioClip(**data)[source]

Bases: BaseModel

One row of the audio_clips SQLite table.

Fields transcription and topics are JSON-encoded strings; use Database.get_clip() to fetch and decode them. The whisper_model / provider / llm_model columns are populated by Database.finalize_topics() to record which backends produced the stored data.

Parameters:
  • data (Any)

  • id (str)

  • title (str)

  • file_path (str)

  • created_at (str)

  • transcription (str | None)

  • topics (str | None)

  • whisper_model (str | None)

  • provider (str | None)

  • llm_model (str | None)

id: str
title: str
file_path: str
created_at: str
transcription: str | None
topics: str | None
whisper_model: str | None
provider: str | None
llm_model: str | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.db.Database(db_path)[source]

Bases: object

SQLite façade for AudioRAG clip state, keyed by session_id.

Every audio_clips read/write goes through the raw sqlite_utils table handle as a column-scoped upsert / raw read. It deliberately does not round-trip the row through pydantic_sqlite’s model registry: that registry is per-instance and in-process, so a freshly-constructed Database (a different worker process, or just a second AutoRAG persist call) would not see another instance’s rows — and the old read-modify-write via model_from_table + a full-object add then upserted a blank model over the on-disk row, silently nulling the transcript a different process had written. Column-scoped upserts touch only the columns a method owns, so the clobber is impossible by construction regardless of instance or process.

pydantic_sqlite is still used for the separate jobs table (see autorag.services.jobs.JobStore), where a whole-record write is the intended semantics; self.db is kept for that reuse.

Creates the SQLite file (and any missing parent directories) and the audio_clips schema on construction.

Parameters:

db_path (Path)

add_analytics_event(session_id, *, category, message, metadata, marked_at_utc)[source]

Build the analytics-event dict written into a clip’s topics JSON.

Does not touch the database itself — callers accumulate the returned dicts and pass them to finalize_topics().

Parameters:
Return type:

dict[str, Any]

create_clip(session_id, *, title, file_path, created_at)[source]

Insert an AudioClip row if one doesn’t already exist.

First-writer-wins (INSERT OR IGNORE): a no-op when the session_id is already present, so a later create_clip with a different title/path never overwrites the original — and, crucially, never resets the transcription / topics a different process wrote in between. Only the four identity columns are written; the rest default to NULL.

Parameters:
  • session_id (str)

  • title (str)

  • file_path (str)

  • created_at (str)

Return type:

None

store_transcription(session_id, words)[source]

Persist a JSON-encoded WordSpan list on the clip.

Column-scoped: touches only transcription. Create-if-absent (upsert), so a row is materialised even if create_clip has not run yet, and a concurrent finalize_topics cannot lose it.

Parameters:
Return type:

None

finalize_topics(session_id, transcript_end_s, *, events, provider, llm_model, whisper_model)[source]

Flatten topic events, compute durations, and write them to the clip.

Within each L1/L2 level, duration_s is derived from the gap to the next sibling (or to transcript_end_s for the last node). The provider / llm_model / whisper_model columns record which backends produced the data. Column-scoped upsert (create-if-absent): touches only those columns, so the transcription written by an earlier stage/process survives.

Parameters:
Return type:

None

get_clip(session_id)[source]

Return the clip as a plain dict, or None if missing.

Parameters:

session_id (str)

Return type:

dict[str, Any] | None

list_clips()[source]

Return every clip row as a plain dict (empty list on error).

Return type:

list[dict[str, Any]]