Types, schemas, config, formatting¶

Dependency-free utility modules safe to import from a base install:

autorag.types — TypedDict shapes for transcripts and topic trees (WordSpan, TopicDict, TopicTree, TranscriptionResult).
autorag.schemas — Pydantic request/response models used by the HTTP API.
autorag.config — Settings via pydantic-settings.
autorag.blocks — Time-bucketed, speaker-grouped transcript formatter (stdlib only).

Types (`autorag.types`)¶

Public typed-dict shapes for the audio→topics pipeline.

Kept dependency-free so SDK consumers can reference these types without forcing the optional [audio] / [diarize] extras (langchain, whisper, pyannote) to be importable.

class autorag.types.WordSpan[source]¶

Bases: TypedDict

One word emitted by the transcription pipeline.

Keys: w (word), s/e (start/end seconds), segment_id (Whisper segment id), and speaker (string id assigned by diarization; "0" when diarization is disabled).

w: str¶

s: float¶

e: float¶

segment_id: str¶

speaker: str¶

class autorag.types.TopicDict[source]¶

Bases: TypedDict

One node in the L0/L1/L2 topic tree.

title: str¶

summary: str¶

s: float¶

e: float¶

children: list[TopicDict]¶

class autorag.types.TopicTree[source]¶

Bases: TypedDict

Container returned by autorag.core.AutoRAG.generate_topics().

topics: list[TopicDict]¶

class autorag.types.TranscriptionResult[source]¶

Bases: TypedDict

Combined transcript + topics, the output of build_agent.

transcription: list[WordSpan]¶

topics: TopicTree¶

Schemas (`autorag.schemas`)¶

Pydantic request/response and entity models for the RAG pipeline.

These models double as the on-the-wire schema for the HTTP API (autorag.api) and as the in-process value types passed between the embedder, store, retriever, and generator.

class autorag.schemas.Document(**data)[source]¶

Bases: BaseModel

One ingested source document, before chunking.

Parameters:

data (Any)
id (str)
source (str)
text (str)
metadata (dict[str, Any])

id: str¶

source: str¶

text: str¶

metadata: dict[str, Any]¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.Chunk(**data)[source]¶

Bases: BaseModel

A retrieval-sized piece of a Document.

embedding is filled in by Embedder and remains None until the chunk has been embedded.

Parameters:

data (Any)
id (str)
doc_id (str)
text (str)
metadata (dict[str, Any])
embedding (list[float] | None)

id: str¶

doc_id: str¶

text: str¶

metadata: dict[str, Any]¶

embedding: list[float] | None¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.Retrieved(**data)[source]¶

Bases: BaseModel

A chunk plus its similarity score from a vector-store search.

Parameters:

data (Any)
chunk (Chunk)
score (float)

chunk: Chunk¶

score: float¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.QueryRequest(**data)[source]¶

Bases: BaseModel

Request body for POST /query.

Parameters:

data (Any)
question (str)
top_k (int | None)

question: str¶

top_k: int | None¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.QueryResponse(**data)[source]¶

Bases: BaseModel

Response body for POST /query: generated answer plus its sources.

Parameters:

data (Any)
answer (str)
sources (list[Retrieved])

answer: str¶

sources: list[Retrieved]¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.IngestRequest(**data)[source]¶

Bases: BaseModel

Request body for POST /ingest: filesystem paths to ingest.

Parameters:

data (Any)
paths (list[str | Path])

paths: list[str | Path]¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class autorag.schemas.IngestResponse(**data)[source]¶

Bases: BaseModel

Response body for POST /ingest: counts of documents and chunks.

Parameters:

data (Any)
ingested (int)
chunks (int)

ingested: int¶

chunks: int¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Config (`autorag.config`)¶

Process-wide settings loaded from environment variables and .env.

Every field is prefixed with AUTORAG_ in the environment — e.g. AUTORAG_TOP_K=8 overrides Settings.top_k. Unrecognized variables are ignored so callers can share an environment with other tools.

class autorag.config.Settings(_case_sensitive=None, _nested_model_default_partial_update=None, _env_prefix=None, _env_prefix_target=None, _env_file=PosixPath('.'), _env_file_encoding=None, _env_ignore_empty=None, _env_nested_delimiter=None, _env_nested_max_split=None, _env_parse_none_str=None, _env_parse_enums=None, _cli_prog_name=None, _cli_parse_args=None, _cli_settings_source=None, _cli_parse_none_str=None, _cli_hide_none_type=None, _cli_avoid_json=None, _cli_enforce_required=None, _cli_use_class_docs_for_groups=None, _cli_exit_on_error=None, _cli_prefix=None, _cli_flag_prefix_char=None, _cli_implicit_flags=None, _cli_ignore_unknown_args=None, _cli_kebab_case=None, _cli_shortcuts=None, _secrets_dir=None, _build_sources=None, **values)[source]¶

Bases: BaseSettings

Default knobs for the RAG pipeline.

Parameters:

_case_sensitive (bool | None)
_nested_model_default_partial_update (bool | None)
_env_prefix (str | None)
_env_prefix_target (Optional[Literal['variable', 'alias', 'all']])
_env_file (Path | str | Sequence[Path | str] | None)
_env_file_encoding (str | None)
_env_ignore_empty (bool | None)
_env_nested_delimiter (str | None)
_env_nested_max_split (int | None)
_env_parse_none_str (str | None)
_env_parse_enums (bool | None)
_cli_prog_name (str | None)
_cli_parse_args (bool | list[str] | tuple[str, ...] | None)
_cli_settings_source (Optional[CliSettingsSource[Any]])
_cli_parse_none_str (str | None)
_cli_hide_none_type (bool | None)
_cli_avoid_json (bool | None)
_cli_enforce_required (bool | None)
_cli_use_class_docs_for_groups (bool | None)
_cli_exit_on_error (bool | None)
_cli_prefix (str | None)
_cli_flag_prefix_char (str | None)
_cli_implicit_flags (Union[bool, Literal['dual', 'toggle'], None])
_cli_ignore_unknown_args (bool | None)
_cli_kebab_case (Union[bool, Literal['all', 'no_enums'], None])
_cli_shortcuts (Mapping[str, str | list[str]] | None)
_secrets_dir (Path | str | Sequence[Path | str] | None)
_build_sources (tuple[tuple[PydanticBaseSettingsSource, ...], dict[str, Any]] | None)
values (Any)
chunk_size (int)
chunk_overlap (int)
top_k (int)
model (str)
db_path (Path)
broker_url (str)
otel_enabled (bool)
otel_service_name (str)
otel_exporter_endpoint (str)
otel_metric_export_interval_ms (int)
otel_environment (str)
otel_resource_attributes (str)

model_config: ClassVar[SettingsConfigDict] = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': '.env', 'env_file_encoding': None, 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'AUTORAG_', 'env_prefix_target': 'variable', 'extra': 'ignore', 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

chunk_size: int¶: Target character count for each chunk.

chunk_overlap: int¶: Character overlap between adjacent chunks.

top_k: int¶: Default number of chunks to retrieve per query.

model: str¶: Default LLM model name for generation.

db_path: Path¶: Location of the SQLite clip database.

broker_url: str¶

RabbitMQ URL for the async pipeline (autorag.services).

Only consulted by the [broker] async path (workers + /jobs/* endpoints); every synchronous endpoint and CLI command ignores it.

otel_enabled: bool¶: Master switch for OpenTelemetry traces + metrics. When False (the default), autorag.otel.initialize_otel() is a no-op and no opentelemetry.* modules are imported.

otel_service_name: str¶: Base service.name resource attribute. Each long-running process (autorag-api / autorag-gpu-worker / autorag-io-worker / autorag-cli) overrides it via the explicit initialize_otel argument, so this only matters if a caller drops that argument.

otel_exporter_endpoint: str¶: OTLP/gRPC endpoint for the span + metric exporters. Defaults to the host-side collector port; the compose workers override it to http://otel-collector:4317 over autorag-net.

otel_metric_export_interval_ms: int¶: Periodic-metric-reader export interval. Matches the Prometheus 15-second scrape, so a metric is at most one interval stale.

otel_environment: str¶: Value of the deployment.environment resource attribute.

otel_resource_attributes: str¶: Additional resource attributes as key=val,key2=val2. Parsed at initialisation time and merged onto the built-in resource. Empty by default — callers usually rely on the explicit fields above.

autorag.config.get_settings()[source]¶

Build a Settings instance from the current environment.

Return type:: Settings

Block formatter (`autorag.blocks`)¶

Pure-stdlib transcript-formatting helpers.

Kept dependency-free so a base install (no [audio] / [rag]) can call format_blocks() on any autorag.types.WordSpan list it already has — e.g. one loaded straight from the SQLite cache or built externally.

autorag.blocks.format_blocks(transcription, seconds)[source]¶

Render transcription as N-second time blocks with per-turn speaker lines.

Buckets each WordSpan into [floor(s/N)*N, floor(s/N)*N + N). Within each non-empty bucket, groups consecutive same-speaker spans into turns via group_by_speaker() and emits one line per turn: MM:SS-MM:SS Speaker K: <words> where K is int(speaker) + 1 (1-indexed display; non-numeric labels pass through verbatim). Skips empty buckets; separates non-empty buckets by one blank line. No trailing newline.

A turn whose words span multiple buckets produces one line per bucket — each line covers only that bucket’s portion of the turn.

Raises:

ValueError – if seconds <= 0.

Parameters:

transcription (list[WordSpan])
seconds (int)

Return type:

str

autorag.blocks.group_by_speaker(spans)[source]¶

Walk spans in order; coalesce consecutive same-speaker runs.

Words missing a speaker key are treated as speaker “0”, which keeps single-speaker behavior identical to pre-diarization output.

Parameters:: spans (list[WordSpan])
Return type:: list[tuple[str, list[WordSpan]]]

autorag.blocks.mmss(t)[source]¶

Format t seconds as MM:SS (minutes may exceed 99 for long audio).

Floors to whole seconds and clamps negatives to 00:00. Inverse of autorag.agent._parse_ts() at second resolution.

Parameters:: t (float)
Return type:: str

Types, schemas, config, formatting¶

Types (autorag.types)¶

Schemas (autorag.schemas)¶

Config (autorag.config)¶

Block formatter (autorag.blocks)¶

Types (`autorag.types`)¶

Schemas (`autorag.schemas`)¶

Config (`autorag.config`)¶

Block formatter (`autorag.blocks`)¶