`autorag.db`¶

SQLite-backed database for audio clip transcription and topic storage.

class autorag.db.AudioClip(**data)[source]¶

Bases: BaseModel

One row of the audio_clips SQLite table.

Fields transcription and topics are JSON-encoded strings; use Database.get_clip() to fetch and decode them. The whisper_model / provider / llm_model columns are populated by Database.finalize_topics() to record which backends produced the stored data.

Parameters:

data (Any)
id (str)
title (str)
file_path (str)
created_at (str)
transcription (str | None)
topics (str | None)
whisper_model (str | None)
provider (str | None)
llm_model (str | None)

id: str¶

title: str¶

file_path: str¶

created_at: str¶

transcription: str | None¶

topics: str | None¶

whisper_model: str | None¶

provider: str | None¶

llm_model: str | None¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.db.Database(db_path)[source]¶

Bases: object

SQLite façade for AudioRAG clip state, keyed by session_id.

Every audio_clips read/write goes through the raw sqlite_utils table handle as a column-scoped upsert / raw read. It deliberately does not round-trip the row through pydantic_sqlite’s model registry: that registry is per-instance and in-process, so a freshly-constructed Database (a different worker process, or just a second AutoRAG persist call) would not see another instance’s rows — and the old read-modify-write via model_from_table + a full-object add then upserted a blank model over the on-disk row, silently nulling the transcript a different process had written. Column-scoped upserts touch only the columns a method owns, so the clobber is impossible by construction regardless of instance or process.

pydantic_sqlite is still used for the separate jobs table (see autorag.services.jobs.JobStore), where a whole-record write is the intended semantics; self.db is kept for that reuse.

Creates the SQLite file (and any missing parent directories) and the audio_clips schema on construction.

Parameters:: db_path (Path)

add_analytics_event(session_id, *, category, message, metadata, marked_at_utc)[source]¶

Build the analytics-event dict written into a clip’s topics JSON.

Does not touch the database itself — callers accumulate the returned dicts and pass them to finalize_topics().

Parameters:

session_id (str)
category (str)
message (str)
metadata (dict[str, Any])
marked_at_utc (Any)

Return type:

dict[str, Any]

create_clip(session_id, *, title, file_path, created_at)[source]¶

Insert an AudioClip row if one doesn’t already exist.

First-writer-wins (INSERT OR IGNORE): a no-op when the session_id is already present, so a later create_clip with a different title/path never overwrites the original — and, crucially, never resets the transcription / topics a different process wrote in between. Only the four identity columns are written; the rest default to NULL.

Parameters:

session_id (str)
title (str)
file_path (str)
created_at (str)

Return type:

None

store_transcription(session_id, words)[source]¶

Persist a JSON-encoded WordSpan list on the clip.

Column-scoped: touches only transcription. Create-if-absent (upsert), so a row is materialised even if create_clip has not run yet, and a concurrent finalize_topics cannot lose it.

Parameters:

session_id (str)
words (list[dict[str, Any]])

Return type:

None

finalize_topics(session_id, transcript_end_s, *, events, provider, llm_model, whisper_model)[source]¶

Flatten topic events, compute durations, and write them to the clip.

Within each L1/L2 level, duration_s is derived from the gap to the next sibling (or to transcript_end_s for the last node). The provider / llm_model / whisper_model columns record which backends produced the data. Column-scoped upsert (create-if-absent): touches only those columns, so the transcription written by an earlier stage/process survives.

Parameters:

session_id (str)
transcript_end_s (float)
events (list[dict[str, Any]])
provider (str)
llm_model (str)
whisper_model (str)

Return type:

None

get_clip(session_id)[source]¶

Return the clip as a plain dict, or None if missing.

Parameters:: session_id (str)
Return type:: dict[str, Any] | None

list_clips()[source]¶

Return every clip row as a plain dict (empty list on error).

Return type:: list[dict[str, Any]]

autorag.db¶

`autorag.db`¶