Skip to content

feat: add fast ASR backend#2938

Open
BBC-Esq wants to merge 10 commits intodocling-project:mainfrom
BBC-Esq:add-fast-asr-backend
Open

feat: add fast ASR backend#2938
BBC-Esq wants to merge 10 commits intodocling-project:mainfrom
BBC-Esq:add-fast-asr-backend

Conversation

@BBC-Esq
Copy link

@BBC-Esq BBC-Esq commented Jan 31, 2026

Add Fast WhisperS2T (based on ctranslate2) ASR Backend

This PR adds support for the whisper-s2t-reborn library as a high-performance ASR backend option, providing significantly faster transcription speeds. Within the "pipeline," it automatically chooses this backend if the criteria are met - e.g. cuda, etc. Previously, it was eithe MLX with a simple fallback to vanilla openai whisper. Vanilla whisper is not the final fallback if neither MLX nor WhisperS2T are available.

##New Files/Classes:

  • InlineAsrWhisperS2TOptions - Configuration class for WhisperS2T with options for compute type, batch size, beam size, and more
  • _WhisperS2TModel - Model wrapper class implementing the WhisperS2T transcription pipeline

Modified Files:

  • pipeline_options_asr_model.py - Added WHISPER_S2T to InferenceAsrFramework enum; Added InlineAsrWhisperS2TOptions configuration class
  • asr_model_specs.py - Added 12 pre-configured model specs (tiny, base, small, medium, large-v3 + distilled variants); Extended AsrModelType enum
  • asr_pipeline.py - Added _WhisperS2TModel class with device parsing fix for CTranslate2 compatibility; Updated AsrPipeline to handle new backend
  • cli/main.py - Added CLI support for all WhisperS2T model variants

Notes

  • Includes _parse_device() helper to handle device strings like "cuda:0"("cuda", 0) for CTranslate2 compatibility
  • Supported devices: CPU, CUDA (MPS not supported by CTranslate2)

    Ctranslate2 is currently working on RocM support as well...

I will give you the script I used to test this below. However, PLEASE note that I always use a "set_cuda_paths" that relies on pip-installed versions of CUDA/cuDNN. This function sets the relevant cuda-related paths to the pip-installed locations so I don't have to install/reinstall various versions of CUDA. If you want to test with the stereotypical systemwide installation don't use "set_cuda_paths."

That said, to run the test script: (1) create a venv, (2) activate it, (3) run the following command, and (4) run the script pasted below:

pip install whisper-s2t-reborn nvidia-cuda-runtime-cu12==12.8.90 nvidia-cublas-cu12==12.8.4.1 nvidia-cudnn-cu12==9.10.2.21 https://download.pytorch.org/whl/cu128/torch-2.9.1%2Bcu128-cp312-cp312-win_amd64.whl
HERE IS THE SCRIPT:
import os
import sys
import platform
import json
import traceback
import time
from pathlib import Path
from typing import Optional
from dataclasses import dataclass, field
from enum import Enum, auto

def set_cuda_paths():
    if platform.system() != "Windows":
        return
    venv_base = Path(sys.executable).parent.parent
    nvidia_base = venv_base / 'Lib' / 'site-packages' / 'nvidia'
    if not nvidia_base.exists():
        return
    cuda_path_runtime = nvidia_base / 'cuda_runtime' / 'bin'
    cuda_path_runtime_lib = nvidia_base / 'cuda_runtime' / 'lib' / 'x64'
    cuda_path_runtime_include = nvidia_base / 'cuda_runtime' / 'include'
    cublas_path = nvidia_base / 'cublas' / 'bin'
    cudnn_path = nvidia_base / 'cudnn' / 'bin'
    nvrtc_path = nvidia_base / 'cuda_nvrtc' / 'bin'
    nvcc_path = nvidia_base / 'cuda_nvcc' / 'bin'
    paths_to_add = [
        cuda_path_runtime,
        cuda_path_runtime_lib,
        cuda_path_runtime_include,
        cublas_path,
        cudnn_path,
        nvrtc_path,
        nvcc_path,
    ]
    current_value = os.environ.get('PATH', '')
    new_value = os.pathsep.join([str(p) for p in paths_to_add] + ([current_value] if current_value else []))
    os.environ['PATH'] = new_value
    triton_cuda_path = nvidia_base / 'cuda_runtime'
    current_cuda_path = os.environ.get('CUDA_PATH', '')
    new_cuda_path = os.pathsep.join([str(triton_cuda_path)] + ([current_cuda_path] if current_cuda_path else []))
    os.environ['CUDA_PATH'] = new_cuda_path
    if hasattr(os, 'add_dll_directory'):
        for path in paths_to_add:
            if path.exists():
                try:
                    os.add_dll_directory(str(path))
                except OSError:
                    pass

set_cuda_paths()

from PySide6.QtCore import Qt, QThread, Signal, QSize, QUrl
from PySide6.QtWidgets import (
    QApplication, QMainWindow, QWidget, QVBoxLayout, QHBoxLayout,
    QPushButton, QLabel, QFileDialog, QTextEdit, QTabWidget,
    QTreeWidget, QTreeWidgetItem, QSplitter, QProgressBar,
    QMessageBox, QStatusBar, QFrame, QGroupBox, QComboBox,
    QScrollArea, QSizePolicy, QSpinBox
)
from PySide6.QtGui import QFont, QColor, QPalette, QIcon

HAS_WEBENGINE = False
try:
    from PySide6.QtWebEngineWidgets import QWebEngineView
    HAS_WEBENGINE = True
except ImportError:
    pass


class ExportFormat(Enum):
    MARKDOWN = auto()
    TEXT = auto()
    HTML = auto()
    JSON = auto()
    SRT = auto()
    VTT = auto()


@dataclass
class TranscriptionSegment:
    start_time: float
    end_time: float
    text: str
    words: list = field(default_factory=list)


@dataclass
class TranscriptionResult:
    success: bool
    filename: str
    markdown: str = ""
    text: str = ""
    html: str = ""
    json_data: str = ""
    srt: str = ""
    vtt: str = ""
    segments: list = None
    processing_time: float = 0.0
    model_name: str = ""
    device: str = ""
    error_message: str = ""

    def __post_init__(self):
        if self.segments is None:
            self.segments = []


class TranscriptionWorker(QThread):
    finished = Signal(TranscriptionResult)
    progress = Signal(str)

    def __init__(self, file_path: str, model_name: str, device: str, batch_size: int):
        super().__init__()
        self.file_path = file_path
        self.model_name = model_name
        self.device = device
        self.batch_size = batch_size

    def run(self):
        try:
            self.progress.emit("Starting transcription...")
            result = self._transcribe_audio()
            self.finished.emit(result)
        except Exception as e:
            error_result = TranscriptionResult(
                success=False,
                filename=Path(self.file_path).name,
                error_message=f"{str(e)}\n\n{traceback.format_exc()}"
            )
            self.finished.emit(error_result)

    def _transcribe_audio(self) -> TranscriptionResult:
        file_path = Path(self.file_path)

        self.progress.emit("Importing Docling...")

        from docling.document_converter import DocumentConverter, AudioFormatOption
        from docling.datamodel.base_models import InputFormat
        from docling.datamodel.pipeline_options import AsrPipelineOptions
        from docling.datamodel.accelerator_options import AcceleratorOptions
        from docling.datamodel import asr_model_specs
        from docling.pipeline.asr_pipeline import AsrPipeline

        model_map = {
            "tiny": asr_model_specs.WHISPER_TINY_S2T,
            "tiny.en": asr_model_specs.WHISPER_TINY_EN_S2T,
            "base": asr_model_specs.WHISPER_BASE_S2T,
            "base.en": asr_model_specs.WHISPER_BASE_EN_S2T,
            "small": asr_model_specs.WHISPER_SMALL_S2T,
            "small.en": asr_model_specs.WHISPER_SMALL_EN_S2T,
            "distil-small.en": asr_model_specs.WHISPER_DISTIL_SMALL_EN_S2T,
            "medium": asr_model_specs.WHISPER_MEDIUM_S2T,
            "medium.en": asr_model_specs.WHISPER_MEDIUM_EN_S2T,
            "distil-medium.en": asr_model_specs.WHISPER_DISTIL_MEDIUM_EN_S2T,
            "large-v3": asr_model_specs.WHISPER_LARGE_V3_S2T,
            "distil-large-v3": asr_model_specs.WHISPER_DISTIL_LARGE_V3_S2T,
        }

        self.progress.emit(f"Loading model: {self.model_name}...")

        model_spec = model_map.get(self.model_name, asr_model_specs.WHISPER_LARGE_V3_S2T)
        model_spec = model_spec.model_copy(update={"batch_size": self.batch_size})

        device = self.device.split(":")[0] if ":" in self.device else self.device

        pipeline_options = AsrPipelineOptions(
            accelerator_options=AcceleratorOptions(device=device),
            asr_options=model_spec,
        )

        converter = DocumentConverter(
            format_options={
                InputFormat.AUDIO: AudioFormatOption(
                    pipeline_cls=AsrPipeline,
                    pipeline_options=pipeline_options,
                )
            }
        )

        self.progress.emit("Transcribing audio...")

        start_time = time.perf_counter()
        conv_result = converter.convert(str(file_path))
        end_time = time.perf_counter()

        processing_time = end_time - start_time

        self.progress.emit("Extracting content...")

        doc = conv_result.document

        segments = []
        if doc.texts:
            for text_item in doc.texts:
                start = 0.0
                end = 0.0

                if hasattr(text_item, 'source') and text_item.source:
                    track = text_item.source[0]
                    start = float(getattr(track, 'start_time', 0.0) or 0.0)
                    end = float(getattr(track, 'end_time', 0.0) or 0.0)

                segments.append(TranscriptionSegment(
                    start_time=start,
                    end_time=end,
                    text=text_item.text.strip()
                ))

        markdown_content = doc.export_to_markdown()
        text_content = "\n".join([seg.text for seg in segments])
        html_content = self._generate_html(segments, file_path.name)

        json_content = json.dumps({
            "filename": file_path.name,
            "model": self.model_name,
            "device": self.device,
            "batch_size": self.batch_size,
            "processing_time_seconds": processing_time,
            "segments": [
                {
                    "start": seg.start_time,
                    "end": seg.end_time,
                    "text": seg.text
                } for seg in segments
            ],
            "full_text": text_content
        }, indent=2)

        srt_content = self._generate_srt(segments)
        vtt_content = self._generate_vtt(segments)

        self.progress.emit("Done!")

        return TranscriptionResult(
            success=True,
            filename=file_path.name,
            markdown=markdown_content,
            text=text_content,
            html=html_content,
            json_data=json_content,
            srt=srt_content,
            vtt=vtt_content,
            segments=segments,
            processing_time=processing_time,
            model_name=self.model_name,
            device=self.device
        )

    def _format_timestamp(self, seconds: float, use_comma: bool = False) -> str:
        if seconds is None:
            seconds = 0.0
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)
        millis = int((seconds % 1) * 1000)
        sep = "," if use_comma else "."
        return f"{hours:02d}:{minutes:02d}:{secs:02d}{sep}{millis:03d}"

    def _generate_srt(self, segments: list) -> str:
        lines = []
        for i, seg in enumerate(segments, 1):
            lines.append(str(i))
            start = self._format_timestamp(seg.start_time, use_comma=True)
            end = self._format_timestamp(seg.end_time, use_comma=True)
            lines.append(f"{start} --> {end}")
            lines.append(seg.text)
            lines.append("")
        return "\n".join(lines)

    def _generate_vtt(self, segments: list) -> str:
        lines = ["WEBVTT", ""]
        for seg in segments:
            start = self._format_timestamp(seg.start_time, use_comma=False)
            end = self._format_timestamp(seg.end_time, use_comma=False)
            lines.append(f"{start} --> {end}")
            lines.append(seg.text)
            lines.append("")
        return "\n".join(lines)

    def _generate_html(self, segments: list, filename: str) -> str:
        segment_html = ""
        for seg in segments:
            start = self._format_timestamp(seg.start_time)
            end = self._format_timestamp(seg.end_time)
            segment_html += f"""
            <div class="segment">
                <span class="timestamp">[{start}{end}]</span>
                <span class="text">{seg.text}</span>
            </div>"""

        return f"""<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Transcription: {filename}</title>
    <style>
        body {{
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
            padding: 20px;
            line-height: 1.8;
            background: #1e1e1e;
            color: #d4d4d4;
            max-width: 900px;
            margin: 0 auto;
        }}
        h1 {{
            color: #569cd6;
            border-bottom: 2px solid #3a3a3a;
            padding-bottom: 10px;
        }}
        .segment {{
            margin: 15px 0;
            padding: 10px;
            background: #252525;
            border-radius: 5px;
            border-left: 3px solid #569cd6;
        }}
        .timestamp {{
            color: #4ec9b0;
            font-family: 'Consolas', monospace;
            font-size: 0.85em;
            margin-right: 10px;
        }}
        .text {{
            color: #d4d4d4;
        }}
    </style>
</head>
<body>
    <h1>Transcription: {filename}</h1>
    {segment_html}
</body>
</html>"""


class TranscriptionStructureTree(QTreeWidget):
    def __init__(self):
        super().__init__()
        self.setHeaderLabels(["Element", "Details"])
        self.setColumnWidth(0, 200)
        self.setAlternatingRowColors(True)

    def populate(self, result: TranscriptionResult):
        self.clear()

        root = QTreeWidgetItem(self, [result.filename, ""])
        root.setExpanded(True)

        info_item = QTreeWidgetItem(root, ["Transcription Info", ""])
        info_item.setExpanded(True)
        QTreeWidgetItem(info_item, ["Model", result.model_name])
        QTreeWidgetItem(info_item, ["Device", result.device])
        QTreeWidgetItem(info_item, ["Processing Time", f"{result.processing_time:.2f}s"])

        if result.segments:
            segments_item = QTreeWidgetItem(root, ["Segments", f"({len(result.segments)})"])
            segments_item.setExpanded(True)

            for i, seg in enumerate(result.segments[:50], 1):
                start = self._format_time(seg.start_time)
                end = self._format_time(seg.end_time)
                preview = seg.text[:40] + "..." if len(seg.text) > 40 else seg.text
                seg_item = QTreeWidgetItem(segments_item, [f"Segment {i}", f"{start}{end}"])
                QTreeWidgetItem(seg_item, ["Text", preview])

            if len(result.segments) > 50:
                QTreeWidgetItem(segments_item, ["...", f"({len(result.segments) - 50} more segments)"])

        stats_item = QTreeWidgetItem(root, ["Statistics", ""])
        stats_item.setExpanded(True)

        total_duration = 0.0
        if result.segments:
            total_duration = max((seg.end_time for seg in result.segments), default=0.0)

        text_len = len(result.text) if result.text else 0
        word_count = len(result.text.split()) if result.text else 0

        QTreeWidgetItem(stats_item, ["Duration", self._format_time(total_duration)])
        QTreeWidgetItem(stats_item, ["Segments", str(len(result.segments))])
        QTreeWidgetItem(stats_item, ["Characters", f"{text_len:,}"])
        QTreeWidgetItem(stats_item, ["Words", f"{word_count:,}"])

        if total_duration > 0:
            realtime_factor = result.processing_time / total_duration
            QTreeWidgetItem(stats_item, ["Realtime Factor", f"{realtime_factor:.2f}x"])

    def _format_time(self, seconds: float) -> str:
        if seconds is None:
            return "N/A"
        minutes = int(seconds // 60)
        secs = seconds % 60
        if minutes > 0:
            return f"{minutes}m {secs:.2f}s"
        return f"{secs:.2f}s"


class ContentViewer(QTabWidget):
    def __init__(self):
        super().__init__()

        self.markdown_view = QTextEdit()
        self.markdown_view.setReadOnly(True)
        self.markdown_view.setFont(QFont("Consolas", 10))
        self.addTab(self.markdown_view, "Markdown")

        self.text_view = QTextEdit()
        self.text_view.setReadOnly(True)
        self.text_view.setFont(QFont("Consolas", 10))
        self.addTab(self.text_view, "Plain Text")

        self.html_view = None
        if HAS_WEBENGINE:
            try:
                self.html_view = QWebEngineView()
                self.addTab(self.html_view, "HTML Preview")
            except Exception:
                self.html_view = None

        if self.html_view is None:
            self.html_preview_fallback = QTextEdit()
            self.html_preview_fallback.setReadOnly(True)
            self.html_preview_fallback.setFont(QFont("Consolas", 9))
            self.addTab(self.html_preview_fallback, "HTML Preview (Source)")

        self.srt_view = QTextEdit()
        self.srt_view.setReadOnly(True)
        self.srt_view.setFont(QFont("Consolas", 10))
        self.addTab(self.srt_view, "SRT Subtitles")

        self.vtt_view = QTextEdit()
        self.vtt_view.setReadOnly(True)
        self.vtt_view.setFont(QFont("Consolas", 10))
        self.addTab(self.vtt_view, "VTT Subtitles")

        self.json_view = QTextEdit()
        self.json_view.setReadOnly(True)
        self.json_view.setFont(QFont("Consolas", 9))
        self.addTab(self.json_view, "JSON Structure")

        self.segments_view = QTextEdit()
        self.segments_view.setReadOnly(True)
        self.segments_view.setFont(QFont("Consolas", 10))
        self.addTab(self.segments_view, "Segments")

    def display(self, result: TranscriptionResult):
        self.markdown_view.setPlainText(result.markdown)
        self.text_view.setPlainText(result.text)

        if self.html_view is not None:
            self.html_view.setHtml(result.html)
        else:
            self.html_preview_fallback.setPlainText(result.html)

        self.srt_view.setPlainText(result.srt)
        self.vtt_view.setPlainText(result.vtt)
        self.json_view.setPlainText(result.json_data)

        if result.segments:
            segments_text = ""
            for i, seg in enumerate(result.segments, 1):
                segments_text += f"{'='*60}\n"
                segments_text += f"SEGMENT {i}\n"
                segments_text += f"{'='*60}\n"
                segments_text += f"Start: {seg.start_time:.3f}s\n"
                segments_text += f"End:   {seg.end_time:.3f}s\n"
                segments_text += f"Duration: {seg.end_time - seg.start_time:.3f}s\n"
                segments_text += f"\n{seg.text}\n\n"
            self.segments_view.setPlainText(segments_text)
        else:
            self.segments_view.setPlainText("No segments found.")

    def clear_all(self):
        self.markdown_view.clear()
        self.text_view.clear()
        if self.html_view is not None:
            self.html_view.setHtml("")
        else:
            self.html_preview_fallback.clear()
        self.srt_view.clear()
        self.vtt_view.clear()
        self.json_view.clear()
        self.segments_view.clear()


class MainWindow(QMainWindow):
    SUPPORTED_FORMATS = (
        "Audio Files (*.mp3 *.wav *.flac *.m4a *.aac *.ogg *.wma *.opus *.webm *.mp4 *.mkv *.avi);;"
        "MP3 Files (*.mp3);;"
        "WAV Files (*.wav);;"
        "FLAC Files (*.flac);;"
        "M4A Files (*.m4a);;"
        "AAC Files (*.aac);;"
        "OGG Files (*.ogg);;"
        "WMA Files (*.wma);;"
        "Video Files (*.mp4 *.mkv *.avi *.webm);;"
        "All Files (*.*)"
    )

    MODEL_OPTIONS = [
        "large-v3",
        "distil-large-v3",
        "medium",
        "medium.en",
        "distil-medium.en",
        "small",
        "small.en",
        "distil-small.en",
        "base",
        "base.en",
        "tiny",
        "tiny.en",
    ]

    def __init__(self):
        super().__init__()
        self.setWindowTitle("Docling WhisperS2T Audio Transcription")
        self.setMinimumSize(1200, 800)
        self.worker: Optional[TranscriptionWorker] = None
        self.current_result: Optional[TranscriptionResult] = None
        self._setup_ui()

    def _setup_ui(self):
        central_widget = QWidget()
        self.setCentralWidget(central_widget)
        main_layout = QVBoxLayout(central_widget)
        main_layout.setSpacing(10)
        main_layout.setContentsMargins(10, 10, 10, 10)

        file_group = QGroupBox("File Selection")
        file_layout = QHBoxLayout(file_group)

        self.file_label = QLabel("No file selected")
        self.file_label.setStyleSheet("padding: 8px; background-color: palette(base); border-radius: 4px;")
        self.file_label.setSizePolicy(QSizePolicy.Expanding, QSizePolicy.Preferred)
        file_layout.addWidget(self.file_label)

        self.browse_button = QPushButton("Browse...")
        self.browse_button.setMinimumWidth(100)
        self.browse_button.clicked.connect(self._browse_file)
        file_layout.addWidget(self.browse_button)

        main_layout.addWidget(file_group)

        settings_group = QGroupBox("Transcription Settings")
        settings_layout = QHBoxLayout(settings_group)

        settings_layout.addWidget(QLabel("Model:"))
        self.model_combo = QComboBox()
        self.model_combo.addItems(self.MODEL_OPTIONS)
        self.model_combo.setCurrentText("large-v3")
        self.model_combo.setMinimumWidth(150)
        settings_layout.addWidget(self.model_combo)

        settings_layout.addSpacing(20)

        settings_layout.addWidget(QLabel("Device:"))
        self.device_combo = QComboBox()
        self.device_combo.addItems(["cuda", "cpu"])
        self.device_combo.setCurrentText("cuda")
        self.device_combo.setMinimumWidth(80)
        settings_layout.addWidget(self.device_combo)

        settings_layout.addSpacing(20)

        settings_layout.addWidget(QLabel("Batch Size:"))
        self.batch_spinbox = QSpinBox()
        self.batch_spinbox.setRange(1, 64)
        self.batch_spinbox.setValue(8)
        self.batch_spinbox.setMinimumWidth(80)
        settings_layout.addWidget(self.batch_spinbox)

        settings_layout.addStretch()

        self.transcribe_button = QPushButton("Transcribe")
        self.transcribe_button.setMinimumWidth(120)
        self.transcribe_button.setEnabled(False)
        self.transcribe_button.clicked.connect(self._start_transcription)
        settings_layout.addWidget(self.transcribe_button)

        main_layout.addWidget(settings_group)

        self.progress_bar = QProgressBar()
        self.progress_bar.setTextVisible(True)
        self.progress_bar.setFormat("")
        self.progress_bar.setMaximum(0)
        self.progress_bar.setMinimum(0)
        self.progress_bar.setVisible(False)
        main_layout.addWidget(self.progress_bar)

        splitter = QSplitter(Qt.Horizontal)

        left_panel = QWidget()
        left_layout = QVBoxLayout(left_panel)
        left_layout.setContentsMargins(0, 0, 0, 0)

        structure_label = QLabel("Transcription Structure")
        structure_label.setStyleSheet("font-weight: bold; padding: 5px;")
        left_layout.addWidget(structure_label)

        self.structure_tree = TranscriptionStructureTree()
        left_layout.addWidget(self.structure_tree)

        splitter.addWidget(left_panel)

        right_panel = QWidget()
        right_layout = QVBoxLayout(right_panel)
        right_layout.setContentsMargins(0, 0, 0, 0)

        content_label = QLabel("Transcription Content")
        content_label.setStyleSheet("font-weight: bold; padding: 5px;")
        right_layout.addWidget(content_label)

        self.content_viewer = ContentViewer()
        right_layout.addWidget(self.content_viewer)

        splitter.addWidget(right_panel)
        splitter.setSizes([300, 900])

        main_layout.addWidget(splitter, 1)

        button_layout = QHBoxLayout()

        self.export_combo = QComboBox()
        self.export_combo.addItems(["Markdown", "Plain Text", "HTML", "JSON", "SRT", "VTT"])
        self.export_combo.setMinimumWidth(120)
        button_layout.addWidget(QLabel("Export as:"))
        button_layout.addWidget(self.export_combo)

        self.export_button = QPushButton("Export...")
        self.export_button.setEnabled(False)
        self.export_button.clicked.connect(self._export_content)
        button_layout.addWidget(self.export_button)

        button_layout.addStretch()

        self.clear_button = QPushButton("Clear")
        self.clear_button.clicked.connect(self._clear_all)
        button_layout.addWidget(self.clear_button)

        main_layout.addLayout(button_layout)

        self.status_bar = QStatusBar()
        self.setStatusBar(self.status_bar)
        self.status_bar.showMessage("Ready - Select an audio file to begin")

    def _browse_file(self):
        file_path, _ = QFileDialog.getOpenFileName(
            self,
            "Select Audio File",
            "",
            self.SUPPORTED_FORMATS
        )
        if file_path:
            self.file_label.setText(file_path)
            self.transcribe_button.setEnabled(True)
            self.status_bar.showMessage(f"Selected: {Path(file_path).name}")

    def _start_transcription(self):
        file_path = self.file_label.text()
        if file_path == "No file selected":
            return

        self.transcribe_button.setEnabled(False)
        self.browse_button.setEnabled(False)
        self.export_button.setEnabled(False)
        self.model_combo.setEnabled(False)
        self.device_combo.setEnabled(False)
        self.batch_spinbox.setEnabled(False)
        self.progress_bar.setVisible(True)
        self.status_bar.showMessage("Transcribing...")

        self.content_viewer.clear_all()
        self.structure_tree.clear()

        self.worker = TranscriptionWorker(
            file_path,
            self.model_combo.currentText(),
            self.device_combo.currentText(),
            self.batch_spinbox.value()
        )
        self.worker.progress.connect(self._on_progress)
        self.worker.finished.connect(self._on_transcription_finished)
        self.worker.start()

    def _on_progress(self, message: str):
        self.progress_bar.setFormat(message)
        self.status_bar.showMessage(message)

    def _on_transcription_finished(self, result: TranscriptionResult):
        self.progress_bar.setVisible(False)
        self.transcribe_button.setEnabled(True)
        self.browse_button.setEnabled(True)
        self.model_combo.setEnabled(True)
        self.device_combo.setEnabled(True)
        self.batch_spinbox.setEnabled(True)

        self.current_result = result

        if result.success:
            self.content_viewer.display(result)
            self.structure_tree.populate(result)
            self.export_button.setEnabled(True)
            self.status_bar.showMessage(
                f"Successfully transcribed: {result.filename} in {result.processing_time:.2f}s"
            )
        else:
            QMessageBox.warning(
                self,
                "Transcription Failed",
                f"Failed to transcribe {result.filename}:\n\n{result.error_message}"
            )
            self.status_bar.showMessage(f"Transcription failed: {result.filename}")

    def _export_content(self):
        if not self.current_result or not self.current_result.success:
            return

        format_map = {
            "Markdown": ("md", self.current_result.markdown),
            "Plain Text": ("txt", self.current_result.text),
            "HTML": ("html", self.current_result.html),
            "JSON": ("json", self.current_result.json_data),
            "SRT": ("srt", self.current_result.srt),
            "VTT": ("vtt", self.current_result.vtt),
        }

        selected_format = self.export_combo.currentText()
        ext, content = format_map[selected_format]

        default_name = Path(self.current_result.filename).stem + f".{ext}"

        file_path, _ = QFileDialog.getSaveFileName(
            self,
            f"Export as {selected_format}",
            default_name,
            f"{selected_format} Files (*.{ext})"
        )

        if file_path:
            try:
                with open(file_path, "w", encoding="utf-8") as f:
                    f.write(content)
                self.status_bar.showMessage(f"Exported to: {file_path}")
                QMessageBox.information(self, "Export Successful", f"Content exported to:\n{file_path}")
            except Exception as e:
                QMessageBox.critical(self, "Export Failed", f"Failed to export:\n{str(e)}")

    def _clear_all(self):
        self.file_label.setText("No file selected")
        self.transcribe_button.setEnabled(False)
        self.export_button.setEnabled(False)
        self.content_viewer.clear_all()
        self.structure_tree.clear()
        self.current_result = None
        self.status_bar.showMessage("Ready - Select an audio file to begin")


def main():
    try:
        app = QApplication(sys.argv)
        app.setStyle("Fusion")

        palette = QPalette()
        palette.setColor(QPalette.Window, QColor(53, 53, 53))
        palette.setColor(QPalette.WindowText, QColor(255, 255, 255))
        palette.setColor(QPalette.Base, QColor(35, 35, 35))
        palette.setColor(QPalette.AlternateBase, QColor(53, 53, 53))
        palette.setColor(QPalette.ToolTipBase, QColor(25, 25, 25))
        palette.setColor(QPalette.ToolTipText, QColor(255, 255, 255))
        palette.setColor(QPalette.Text, QColor(255, 255, 255))
        palette.setColor(QPalette.Button, QColor(53, 53, 53))
        palette.setColor(QPalette.ButtonText, QColor(255, 255, 255))
        palette.setColor(QPalette.BrightText, QColor(255, 0, 0))
        palette.setColor(QPalette.Link, QColor(42, 130, 218))
        palette.setColor(QPalette.Highlight, QColor(42, 130, 218))
        palette.setColor(QPalette.HighlightedText, QColor(35, 35, 35))
        palette.setColor(QPalette.Disabled, QPalette.WindowText, QColor(127, 127, 127))
        palette.setColor(QPalette.Disabled, QPalette.Text, QColor(127, 127, 127))
        palette.setColor(QPalette.Disabled, QPalette.ButtonText, QColor(127, 127, 127))
        app.setPalette(palette)

        app.setStyleSheet("""
            QGroupBox {
                font-weight: bold;
                border: 1px solid #3a3a3a;
                border-radius: 5px;
                margin-top: 10px;
                padding-top: 10px;
            }
            QGroupBox::title {
                subcontrol-origin: margin;
                left: 10px;
                padding: 0 5px 0 5px;
            }
            QPushButton {
                padding: 8px 16px;
                border-radius: 4px;
                background-color: #2a82da;
                color: white;
                border: none;
            }
            QPushButton:hover {
                background-color: #3a92ea;
            }
            QPushButton:pressed {
                background-color: #1a72ca;
            }
            QPushButton:disabled {
                background-color: #555555;
                color: #888888;
            }
            QTabWidget::pane {
                border: 1px solid #3a3a3a;
                border-radius: 4px;
            }
            QTabBar::tab {
                padding: 8px 16px;
                margin-right: 2px;
                background-color: #3a3a3a;
                border-top-left-radius: 4px;
                border-top-right-radius: 4px;
            }
            QTabBar::tab:selected {
                background-color: #2a82da;
            }
            QTreeWidget {
                border: 1px solid #3a3a3a;
                border-radius: 4px;
            }
            QTextEdit {
                border: 1px solid #3a3a3a;
                border-radius: 4px;
            }
            QProgressBar {
                border: 1px solid #3a3a3a;
                border-radius: 4px;
                text-align: center;
            }
            QProgressBar::chunk {
                background-color: #2a82da;
            }
            QComboBox {
                padding: 5px 10px;
                border: 1px solid #3a3a3a;
                border-radius: 4px;
                background-color: #3a3a3a;
            }
            QComboBox::drop-down {
                border: none;
            }
            QComboBox QAbstractItemView {
                background-color: #3a3a3a;
                selection-background-color: #2a82da;
            }
            QSpinBox {
                padding: 5px 10px;
                border: 1px solid #3a3a3a;
                border-radius: 4px;
                background-color: #3a3a3a;
            }
        """)

        window = MainWindow()
        window.show()

        sys.exit(app.exec())

    except Exception as e:
        print(f"Fatal error: {e}")
        print(traceback.format_exc())
        input("Press Enter to exit...")
        sys.exit(1)


if __name__ == "__main__":
    main()

@github-actions
Copy link
Contributor

github-actions bot commented Jan 31, 2026

DCO Check Passed

Thanks @BBC-Esq, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Jan 31, 2026

Related Documentation

Checked 7 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Jan 31, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@BBC-Esq BBC-Esq changed the title add fast asr backend Add fast ASR backend Jan 31, 2026
Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
@BBC-Esq BBC-Esq force-pushed the add-fast-asr-backend branch from 3f7b50a to 85b6fef Compare January 31, 2026 05:29
@BBC-Esq BBC-Esq changed the title Add fast ASR backend feat: add fast ASR backend Jan 31, 2026
Previously it either chose MLX or fell back to vanilla whisper.  This auto chooses whispers2t if the criteria are met.

Signed-off-by: Chintella-Esq <bbc@chintellalaw.com>
@BBC-Esq
Copy link
Author

BBC-Esq commented Jan 31, 2026

Here goes nothing. I put a lot of time and effort into this PR and I jumped through all the freakin hoops as far as certifying the PR, correcting the syntax...Hopefully big company IBM does not equal shitty customer experience and/or online culture when simply trying to contribute to a cool idea. I don't even see a freakin button to request a review by whomever maintains this repo...Oh well, here goes nothing!

@codecov
Copy link

codecov bot commented Feb 1, 2026

…o this commit: e6eaa55

Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
…reby add my Signed-off-by to this commit:

Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
…igned-off-by to this commit: e6eaa55

Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
…lalaw.com>, hereby add my Signed-off-by to this commit: e6eaa55

Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
Copy link
Author

@BBC-Esq BBC-Esq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying my best to adhere to the multiple bots and their instructions...any help would be much appreciated. Thanks. lol.

…hereby add my Signed-off-by to this commit: e6eaa55

Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
@ceberam
Copy link
Member

ceberam commented Feb 2, 2026

Hi @BBC-Esq
Thanks for taking the time to put together this PR! It’s clear you’ve invested a lot of effort, and we appreciate that.

We’ll review it shortly and let you know if anything needs clarification or further work.
For detailed guidance on contributions, please refer to the Docling contribution guidelines, the Docling developer guideline, and also the Code of Conduct.
In particular, you may want to check the section Coding Style Guildelines to learn how pre-commit hooks can help when creating new commits.
For now, I would suggest that you keep the focus of this PR on the feature you are proposing (whisper-s2t-reborn support) and leave out optional styling changes on non-related files.

We take all contributions seriously and will keep you posted on the status of this PR. If you have any questions in the meantime, just drop a comment here or open an issue.

Thanks again for your work!

@ceberam ceberam added the asr Issues related to ASR (Automatic Speech Recognition) label Feb 2, 2026
@BBC-Esq
Copy link
Author

BBC-Esq commented Feb 2, 2026

Hi @BBC-Esq Thanks for taking the time to put together this PR! It’s clear you’ve invested a lot of effort, and we appreciate that.

We’ll review it shortly and let you know if anything needs clarification or further work. For detailed guidance on contributions, please refer to the Docling contribution guidelines, the Docling developer guideline, and also the Code of Conduct. In particular, you may want to check the section Coding Style Guildelines to learn how pre-commit hooks can help when creating new commits. For now, I would suggest that you keep the focus of this PR on the feature you are proposing (whisper-s2t-reborn support) and leave out optional styling changes on non-related files.

We take all contributions seriously and will keep you posted on the status of this PR. If you have any questions in the meantime, just drop a comment here or open an issue.

Thanks again for your work!

Thanks for your response. I can try to change the PR to remove the unrelated scripts if need be. They were included by accident when I ran ruff on my entire fork's codebase so...or feel free to modify the PR to the discrete topic at hand. Was taken agack by the number of bots all with different criteria but I'll review the links you sent.

@BBC-Esq
Copy link
Author

BBC-Esq commented Feb 11, 2026

Any chance this could be reviewed?

Signed-off-by: Chintella-Esq <bbc@chintellalaw.com>
@BBC-Esq
Copy link
Author

BBC-Esq commented Feb 16, 2026

Hi @BBC-Esq

Thanks for taking the time to put together this PR! It’s clear you’ve invested a lot of effort, and we appreciate that.

We’ll review it shortly and let you know if anything needs clarification or further work.

For detailed guidance on contributions, please refer to the Docling contribution guidelines, the Docling developer guideline, and also the Code of Conduct.

In particular, you may want to check the section Coding Style Guildelines to learn how pre-commit hooks can help when creating new commits.

For now, I would suggest that you keep the focus of this PR on the feature you are proposing (whisper-s2t-reborn support) and leave out optional styling changes on non-related files.

We take all contributions seriously and will keep you posted on the status of this PR. If you have any questions in the meantime, just drop a comment here or open an issue.

Thanks again for your work!

Hello, you said you guys would be following up shortly? Is there anything preventing this from being merged? @ceberam

@PeterStaar-IBM
Copy link
Member

@BBC-Esq let me follow up. We were right in the middle of a big refactoring to clean up the pipelines/stages/models/etc (see the merge conflicts). This is now getting finalized.

Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @BBC-Esq for the contribution and for exploring improvements in the ASR backend.
I’ve reviewed the PR and I’m not comfortable with the current form for several reasons:

1. Licensing

The repository currently does not contain a LICENSE file. Without an explicit open-source license, there is no clear legal grant of rights to use, modify, or redistribute the code.
Since this project is part of the LF AI & Data, we need all dependencies to have a clear, identifiable open-source license. Even though this repository appears to be a fork of WhisperS2T (which is MIT-licensed), that license must be explicitly included in the forked repository for it to be valid for redistribution.
Until this is clarified and properly addressed upstream, we cannot safely introduce this dependency.

2. Project health

The repository currently has:

  • No visible community adoption (0 stars/forks)
  • Limited signals of review
  • No documented benchmark results supporting the performance claims

If performance is the main motivation, it would help to include reproducible benchmarks or comparative results demonstrating the advantage over existing ASR backend models. Without that, it is difficult to justify introducing an additional dependency with unclear maintenance status.

3. PR scope and conflicts

This PR also includes styling and formatting changes across multiple files that are unrelated to the new dependency. While some of those changes may be acceptable in isolation (e.g., I also prefer str | None to Optional[str]), they are out of scope for this feature and make the review more difficult. Our CONTRIBUTING.md guidelines already describe how to run the automatic style checks locally, which should prevent unrelated formatting diffs. In addition, as pointed out by @PeterStaar-IBM , recent changes have introduced some merge conflicts that should be addressed once the refactoring has finalized.

I would be happy to re-review again once:

  • The upstream repository clearly includes an appropriate open-source license.
  • There is objective evidence supporting the claimed performance benefits.
  • The PR is cleaned up to only include changes strictly required for the ASR feature.
  • The merge conflicts are resolved.

Thanks again for the contribution and for understanding the need to keep compliance, scope, and maintainability in mind.

@BBC-Esq
Copy link
Author

BBC-Esq commented Feb 18, 2026

@ceberam I appreciate such a thorough and professional response. I, too, care about the code base and your message makes sense.

  1. Regarding the license issue, I'll fix that this week. Thanks for the link to LFAI, didn't know about that. I'll check out actually joining the org...

  2. Regarding issue 2, I maintain the WhisperS2T-Reborn project and it's actively maintained. Other than TensorRT, it is the fastest library to run Whisper models currently available...and still just as good quality. It's been approximately two years, but I'll update my benchmarking suite, update the repo, and post a link here to review.

Regarding "unclear maintenance" status, I totally get that. You don't want to modify your guys' code base, add a dependency (even an optional one), and then have a part of your code base outdated because someone abandons WhisperS2T-Reborn. At a minimum, it would just be a hassle to unwind the "integration" so to speak.

Since I maintain WhisperS2T-Reborn, it logically makes sense to explain about my background a little to hopefully assuage your concern.

First off, I'm a lawyer by trade but I program as hobby. I originally became fascinated by LLMs and machine learning upon learning about "vector databases" and how they enable the search-ability of massive amounts of data; for example, data obtained during discovery in a lawsuit. In my profession, there's frequently large amounts of data in the form of text messages, emails, audio recordings, video clips, etc...

As an offshoot of that, transcribing audio and then being able to search it "semantically is a huge benefit for litigation purposes. It means that I no longer have to listen to hours of audio to pinpoint the minute/second where something was said that is important for litigation purposes, which ultimately saves the client billables. This is essentially what lead me to learn about Whisper models. The "vanilla" Whisper code, while revolutionary at the time, is painstakingly slow, which leads me to faster back-ends like Ctranslate2 upon which WhisperS2T-Reborn is based. That was 3-4 years ago now and I've been working with Whisper, vector databases, and other related technologies ever since. I was frequently in contact with the maintainer of the original "WhisperS2T" repository until he abandoned it in 2024 because he was offered a high-paying job, presumably using WhisperS2T to pad his resume. As you know, this frequently happens - i.e. people use their Github repositories to pad their resumes to land a high-paying job...nothing wrong with that, but often as part of a new job you can't work on competing open source projects and/or just don't have the time so...

At any rate, I decided to maintain it since it was slowly becoming incompatible with other APIs such as transformers as their code bases improved/changed.

I also maintain the models that WhisperS2T-Reborn relies upon by default, which you can see here on huggingface. I'd like to highlight that all of the models are converted from the original "float32" Whisper models, NOT the "float16" versions that OpenAI has decided to upload for better or worse. See the config here to see what I'm referring to.

ORIGINALLY, OpenAI uploaded float32 versions of all models but then replaced them with float16 versions. I disagreed with this because, yes, float16 is compatible with more GPUs and the file size is less, but if you're running on CPU it'll just have to be upcast to float32 anyways and you cannot regain the quality loss from the initial conversion from float32 to float16. Moreover, more recent GPUs support bfloat16...People wishing for the additional stability that bfloat16 offers won't get it...after all, when converting a float16 model to bfloat16 you cannot regain the lost precision.

In short, all of my models are converted from the original float32 versions so there is no quality loss. The WhisperS2T-Reborn library can choose the best model based on a user's hardware. For GPUs that support bfloat16, it'll choose bfloat16...and for older GPUs it'll fallback to float16. CPUs use float32 or...a user can still choose to use float32 on GPU for maximum quality.

This required downloading the original float32 weights of all whisper models from the history of the various repositories, adding any updated configuration "JSON" files (e.g. if OpenAI found a bug in something), then converting them to the Ctranslate2 format.

My repository provides float32, float16, and bfloatg16 versions of all Whisper models and the Ctranslate2 backend also supports "int8" if that's desirable.

Moreover, regarding Ctranslate2, it has an excellent feature whereby it'll convert at runtime if need be. For example, if you accidentally downloaded a bfloat16 model but your GPU only supports float16, it'll automatically convert and run in float16 without a hitch. Granted, there's a very small quality loss but at least it will not throw an error.

Ctranslate2 also recently adopted ROCm support in addition to CUDA, and already supports fast execution on both Intel and AMD cpus.

Incidentally, there's a fascinating video with a guy named Ben Fletcher who used to work at IBM regarding document parsing and RAG versus traditional similarity searches like tf/idf and bm25 and so on... https://www.youtube.com/watch?v=MdF--Fz0k4I&t=214s Maybe you guys know him, he used to work at IBM?

Basically, I've been working with Whisper models since approximately 2023 and don't plan to quite anytime soon, if that's any assurance regarding the "health" of the WhisperS2T-Reborn library...

Regarding item 3, I'll redo the PR to only pertain to the subject matter...it's just that without those additional changes your review is throwing errors...would be willing to do a separate PR to resolve those errors, however. I'll also do the stylistic checks locally like you suggest.

Cheers!

Signed-off-by: Chintella-Esq <bbc@chintellalaw.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

asr Issues related to ASR (Automatic Speech Recognition)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments