autorag.whisper_runner

Lazy-loaded whisperX model cache with CUDA-preferred / CPU-fallback device selection.

The main transcription model (CTranslate2 / faster-whisper backend) is removed from the module cache after each run so Python GC can free VRAM; the smaller wav2vec2 alignment model is offloaded to CPU after aligning and restored on the next call (PyTorch .to() round-trip). Both are re-created from local HF cache on the next pipeline run, which is fast (<1 s for models already downloaded).

autorag.whisper_runner.get_model(size, device_hint=None)[source]

Return a cached whisperX model for size.

device_hint is advisory: ignored when the process is already CPU-pinned.

Parameters:
Return type:

Any

autorag.whisper_runner.resolved_device()[source]

Return the device most recently used (or the preference if nothing loaded yet).

Return type:

str

autorag.whisper_runner.transcribe_segment(model, file_path, language)[source]

Transcribe file_path and return frame-aligned word dicts.

Each dict: {"w": str, "s": float, "e": float, "p": float}. The alignment pass uses wav2vec2 for frame-accurate word timestamps; if it fails the unaligned faster-whisper timestamps are used as a fallback.

Parameters:
Return type:

list[dict[str, Any]]