Types, schemas, config, formatting

Dependency-free utility modules safe to import from a base install:

  • autorag.typesTypedDict shapes for transcripts and topic trees (WordSpan, TopicDict, TopicTree, TranscriptionResult).

  • autorag.schemas — Pydantic request/response models used by the HTTP API.

  • autorag.config — Settings via pydantic-settings.

  • autorag.blocks — Time-bucketed, speaker-grouped transcript formatter (stdlib only).

Types (autorag.types)

Public typed-dict shapes for the audio→topics pipeline.

Kept dependency-free so SDK consumers can reference these types without forcing the optional [audio] / [diarize] extras (langchain, whisper, pyannote) to be importable.

class autorag.types.WordSpan[source]

Bases: TypedDict

One word emitted by the transcription pipeline.

Keys: w (word), s/e (start/end seconds), segment_id (Whisper segment id), and speaker (string id assigned by diarization; "0" when diarization is disabled).

w: str
s: float
e: float
segment_id: str
speaker: str
class autorag.types.TopicDict[source]

Bases: TypedDict

One node in the L0/L1/L2 topic tree.

title: str
summary: str
s: float
e: float
children: list[TopicDict]
class autorag.types.TopicTree[source]

Bases: TypedDict

Container returned by autorag.core.AutoRAG.generate_topics().

topics: list[TopicDict]
class autorag.types.TranscriptionResult[source]

Bases: TypedDict

Combined transcript + topics, the output of build_agent.

transcription: list[WordSpan]
topics: TopicTree

Schemas (autorag.schemas)

Pydantic request/response and entity models for the RAG pipeline.

These models double as the on-the-wire schema for the HTTP API (autorag.api) and as the in-process value types passed between the embedder, store, retriever, and generator.

class autorag.schemas.Document(**data)[source]

Bases: BaseModel

One ingested source document, before chunking.

Parameters:
id: str
source: str
text: str
metadata: dict[str, Any]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.Chunk(**data)[source]

Bases: BaseModel

A retrieval-sized piece of a Document.

embedding is filled in by Embedder and remains None until the chunk has been embedded.

Parameters:
id: str
doc_id: str
text: str
metadata: dict[str, Any]
embedding: list[float] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.Retrieved(**data)[source]

Bases: BaseModel

A chunk plus its similarity score from a vector-store search.

Parameters:
chunk: Chunk
score: float
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.QueryRequest(**data)[source]

Bases: BaseModel

Request body for POST /query.

Parameters:
  • data (Any)

  • question (str)

  • top_k (int | None)

question: str
top_k: int | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.QueryResponse(**data)[source]

Bases: BaseModel

Response body for POST /query: generated answer plus its sources.

Parameters:
answer: str
sources: list[Retrieved]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.IngestRequest(**data)[source]

Bases: BaseModel

Request body for POST /ingest: filesystem paths to ingest.

Parameters:
paths: list[str | Path]
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.IngestResponse(**data)[source]

Bases: BaseModel

Response body for POST /ingest: counts of documents and chunks.

Parameters:
ingested: int
chunks: int
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Config (autorag.config)

Process-wide settings loaded from environment variables and .env.

Every field is prefixed with AUTORAG_ in the environment — e.g. AUTORAG_TOP_K=8 overrides Settings.top_k. Unrecognized variables are ignored so callers can share an environment with other tools.

class autorag.config.Settings(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, **values)[source]

Bases: BaseSettings

Default knobs for the RAG pipeline.

Parameters:
model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': '.env', 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'AUTORAG_', 'env_prefix_target': 'variable', 'extra': 'ignore', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

chunk_size: int

Target character count for each chunk.

chunk_overlap: int

Character overlap between adjacent chunks.

top_k: int

Default number of chunks to retrieve per query.

model: str

Default LLM model name for generation.

db_path: Path

Location of the SQLite clip database.

broker_url: str

RabbitMQ URL for the async pipeline (autorag.services).

Only consulted by the [broker] async path (workers + /jobs/* endpoints); every synchronous endpoint and CLI command ignores it.

otel_enabled: bool

Master switch for OpenTelemetry traces + metrics. When False (the default), autorag.otel.initialize_otel() is a no-op and no opentelemetry.* modules are imported.

otel_service_name: str

Base service.name resource attribute. Each long-running process (autorag-api / autorag-gpu-worker / autorag-io-worker / autorag-cli) overrides it via the explicit initialize_otel argument, so this only matters if a caller drops that argument.

otel_exporter_endpoint: str

OTLP/gRPC endpoint for the span + metric exporters. Defaults to the host-side collector port; the compose workers override it to http://otel-collector:4317 over autorag-net.

otel_metric_export_interval_ms: int

Periodic-metric-reader export interval. Matches the Prometheus 15-second scrape, so a metric is at most one interval stale.

otel_environment: str

Value of the deployment.environment resource attribute.

otel_resource_attributes: str

Additional resource attributes as key=val,key2=val2. Parsed at initialisation time and merged onto the built-in resource. Empty by default — callers usually rely on the explicit fields above.

autorag.config.get_settings()[source]

Build a Settings instance from the current environment.

Return type:

Settings

Block formatter (autorag.blocks)

Pure-stdlib transcript-formatting helpers.

Kept dependency-free so a base install (no [audio] / [rag]) can call format_blocks() on any autorag.types.WordSpan list it already has — e.g. one loaded straight from the SQLite cache or built externally.

autorag.blocks.format_blocks(transcription, seconds)[source]

Render transcription as N-second time blocks with per-turn speaker lines.

Buckets each WordSpan into [floor(s/N)*N, floor(s/N)*N + N). Within each non-empty bucket, groups consecutive same-speaker spans into turns via group_by_speaker() and emits one line per turn: MM:SS-MM:SS Speaker K: <words> where K is int(speaker) + 1 (1-indexed display; non-numeric labels pass through verbatim). Skips empty buckets; separates non-empty buckets by one blank line. No trailing newline.

A turn whose words span multiple buckets produces one line per bucket — each line covers only that bucket’s portion of the turn.

Raises:

ValueError – if seconds <= 0.

Parameters:
Return type:

str

autorag.blocks.group_by_speaker(spans)[source]

Walk spans in order; coalesce consecutive same-speaker runs.

Words missing a speaker key are treated as speaker “0”, which keeps single-speaker behavior identical to pre-diarization output.

Parameters:

spans (list[WordSpan])

Return type:

list[tuple[str, list[WordSpan]]]

autorag.blocks.mmss(t)[source]

Format t seconds as MM:SS (minutes may exceed 99 for long audio).

Floors to whole seconds and clamps negatives to 00:00. Inverse of autorag.agent._parse_ts() at second resolution.

Parameters:

t (float)

Return type:

str