Skip to content

Commit 372c65f

Browse files
committed
feat(whisper): integrate binary with build and docs
The Dockerfile has been refactored to a multi-stage build, allowing the `whisper.cpp` CLI binary to be compiled and embedded within the application's runtime image. This enables word-by-word highlighting functionality when deployed via Docker. The `README.md` has been updated to include installation and configuration instructions for `whisper.cpp` when running locally. Additionally, the `WHISPER_CPP_BIN` environment variable has been added to `template.env` and the package version has been bumped to v1.1.0.
1 parent b576910 commit 372c65f

File tree

4 files changed

+65
-52
lines changed

4 files changed

+65
-52
lines changed

Dockerfile

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
1-
# Use Node.js slim image
2-
FROM node:current-alpine
1+
# Stage 1: build whisper.cpp (no model download – the app handles that)
2+
FROM alpine:3.20 AS whisper-builder
3+
4+
RUN apk add --no-cache git cmake build-base
5+
6+
WORKDIR /opt
7+
8+
RUN git clone --depth 1 https://github.com/ggml-org/whisper.cpp.git && \
9+
cd whisper.cpp && \
10+
cmake -B build && \
11+
cmake --build build -j --config Release
312

4-
# Add ffmpeg and libreoffice using Alpine package manager
5-
RUN apk add --no-cache ffmpeg libreoffice-writer
13+
14+
# Stage 2: build the Next.js app
15+
FROM node:lts-alpine AS app-builder
616

717
# Install pnpm globally
818
RUN npm install -g pnpm
@@ -23,8 +33,34 @@ COPY . .
2333
RUN pnpm exec next telemetry disable
2434
RUN pnpm build
2535

36+
37+
# Stage 3: minimal runtime image
38+
FROM node:current-alpine AS runner
39+
40+
# Add runtime OS dependencies:
41+
# - ffmpeg: required for audiobook export and word-by-word alignment (/api/whisper)
42+
# - libreoffice-writer: required for DOCX → PDF conversion
43+
RUN apk add --no-cache ffmpeg libreoffice-writer
44+
45+
# Install pnpm globally for running the app
46+
RUN npm install -g pnpm
47+
48+
# App runtime directory
49+
WORKDIR /app
50+
51+
# Copy built app and dependencies from the builder stage
52+
COPY --from=app-builder /app ./
53+
54+
# Copy the compiled whisper.cpp build output into the runtime image
55+
# (includes whisper-cli and its shared libraries, e.g. libwhisper.so, libggml.so)
56+
COPY --from=whisper-builder /opt/whisper.cpp/build /opt/whisper.cpp/build
57+
58+
# Point the app at the compiled whisper-cli binary and ensure its libs are discoverable
59+
ENV WHISPER_CPP_BIN=/opt/whisper.cpp/build/bin/whisper-cli
60+
ENV LD_LIBRARY_PATH=/opt/whisper.cpp/build
61+
2662
# Expose the port the app runs on
2763
EXPOSE 3003
2864

2965
# Start the application
30-
CMD ["pnpm", "start"]
66+
CMD ["pnpm", "start"]

README.md

Lines changed: 19 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -11,65 +11,25 @@
1111

1212
OpenReader WebUI is an open source text to speech document reader web app built using Next.js, offering a TTS read along experience with narration for **EPUB, PDF, TXT, MD, and DOCX documents**. It supports multiple TTS providers including OpenAI, Deepinfra, and custom OpenAI-compatible endpoints like [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) and [Orpheus-FastAPI](https://github.com/Lex-au/Orpheus-FastAPI)
1313

14-
- 🧠 *(New)* **Smart Sentence-Aware Narration** merges sentences across pages/chapters for smoother TTS
15-
- 🎧 *(New)* **Reliable Audiobook Export** in **m4b/mp3**, with resumable, chapter-based export and regeneration
1614
- 🎯 *(New)* **Multi-Provider TTS Support**
1715
- [**Kokoro-FastAPI**](https://github.com/remsky/Kokoro-FastAPI): Supporting multi-voice combinations (like `af_heart+af_bella`)
1816
- [**Orpheus-FastAPI**](https://github.com/Lex-au/Orpheus-FastAPI)
1917
- **Custom OpenAI-compatible**: Any TTS API with `/v1/audio/voices` and `/v1/audio/speech` endpoints
2018
- **Cloud TTS Providers (requiring API keys)**
2119
- [**Deepinfra**](https://deepinfra.com/models/text-to-speech): Kokoro-82M + models with support for cloned voices and more
2220
- [**OpenAI API ($$)**](https://platform.openai.com/docs/pricing#transcription-and-speech): tts-1, tts-1-hd, and gpt-4o-mini-tts w/ instructions
23-
- 🚀 *(New)* **Optimized Next.js TTS Proxy** with audio caching and optimized repeat playback
24-
- 💾 *(Updated)* **Local-First Architecture** stores documents and more in-browser with Dexie.js
2521
- 📖 *(Updated)* **Read Along Experience** providing real-time text highlighting during playback (PDF/EPUB)
22+
- *(New)* **Word-by-word** highlighting uses word-by-word timestamps generated server-side with [*whisper.cpp*](https://github.com/ggml-org/whisper.cpp) (optional)
23+
- 🧠 *(New)* **Smart Sentence-Aware Narration** merges sentences across pages/chapters for smoother TTS
24+
- 🎧 *(New)* **Reliable Audiobook Export** in **m4b/mp3**, with resumable, chapter-based export and regeneration
25+
- 🚀 *(New)* **Optimized Next.js TTS Proxy** with audio caching and optimized repeat playback
26+
- 💾 **Local-First Architecture** stores documents and more in-browser with Dexie.js
2627
- 🛜 **Optional Server-side documents** using backend `/docstore` for all users
2728
- 🎨 **Customizable Experience**
2829
- 🎨 Multiple app theme options
2930
- ⚙️ Various TTS and document handling settings
3031
- And more ...
3132

32-
<details>
33-
<summary>
34-
35-
### 🆕 What's New in v1.0.0
36-
37-
</summary>
38-
39-
- 🧠 **Smart sentence continuation**
40-
- Improved NLP handling of complex structures and quoted dialogue provides more natural sentence boundaries and a smoother audio-text flow.
41-
- EPUB and PDF playback now use smarter sentence splitting and continuation metadata so sentences that cross page/chapter boundaries are merged before hitting the TTS API.
42-
- This yields more natural narration and fewer awkward pauses when a sentence spans multiple pages or EPUB spine items.
43-
- 📄 **Modernized PDF text highlighting pipeline**
44-
- Real-time PDF text highlighting is now offloaded to a dedicated Web Worker so scrolling and playback controls remain responsive during narration.
45-
- A new overlay-based highlighting system draws independent highlight layers on top of the PDF, avoiding interference with the underlying text layer.
46-
- Upgraded fuzzy matching with Dice-based similarity improves the accuracy of mapping spoken words to on-screen text.
47-
- A new per-device setting lets you enable or disable real-time PDF highlighting during playback for a more tailored reading experience.
48-
- 🎧 **Chapter/page-based audiobook export with resume & regeneration**
49-
- Per-chapter/per-page generation to disk with persistent `bookId`
50-
- Resumable generation (can cancel and continue later)
51-
- Per-chapter regeneration & deletion
52-
- Final combined **M4B** or **MP3** download with embedded chapter metadata.
53-
- 💾 **Dexie-backed local storage & sync**
54-
- All document types (PDF, EPUB, TXT/MD-as-HTML) and config are stored via a unified Dexie layer on top of IndexedDB.
55-
- Document lists use live Dexie queries (no manual refresh needed), and server sync now correctly includes text/markdown documents as part of the library backup.
56-
- 🗣️ **Kokoro multi-voice selection & utilities**
57-
- Kokoro models now support multi-voice combination, with provider-aware limits and helpers (not supported on OpenAI or Deepinfra)
58-
-**Faster, more efficient TTS backend proxy**
59-
- In-memory **LRU caching** for audio responses with configurable size/TTL
60-
- **ETag** support (`304` on cache hits) + `X-Cache` headers (`HIT` / `MISS` / `INFLIGHT`)
61-
- 📄 **More robust DOCX → PDF conversion**
62-
- DOCX conversion now uses isolated per-job LibreOffice profiles and temp directories, polls for a stable output file size, and aggressively cleans up temp files.
63-
- This reduces cross-job interference and flakiness when converting multiple DOCX files in parallel.
64-
-**Accessibility & layout improvements**
65-
- Dialogs and folder toggles expose proper roles and ARIA attributes.
66-
- PDF/EPUB/HTML readers use a full-height app shell with a sticky bottom TTS bar, improved scrollbars, and refined focus styles.
67-
-**End-to-end Playwright test suite with TTS mocks**
68-
- Deterministic TTS responses in tests via a reusable Playwright route mock.
69-
- Coverage for accessibility, upload, navigation, folder management, deletion flows, audiobook generation/export and playback across all document types.
70-
71-
</details>
72-
7333
## 🐳 Docker Quick Start
7434

7535
### Prerequisites
@@ -194,6 +154,20 @@ Optionally required for different features:
194154
```bash
195155
brew install libreoffice
196156
```
157+
- [whisper.cpp](https://github.com/ggml-org/whisper.cpp) (optional, required for word-by-word highlighting)
158+
```bash
159+
# clone and build whisper.cpp (no model download needed – OpenReader handles that)
160+
git clone https://github.com/ggml-org/whisper.cpp.git
161+
cd whisper.cpp
162+
cmake -B build
163+
cmake --build build -j --config Release
164+
165+
# point OpenReader to the compiled whisper-cli binary
166+
echo WHISPER_CPP_BIN=\"$(pwd)/build/bin/whisper-cli\"
167+
```
168+
169+
> **Note:** The `WHISPER_CPP_BIN` path should be set in your `.env` file for OpenReader to use word-by-word highlighting features.
170+
197171
### Steps
198172

199173
1. Clone the repository:

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "openreader-webui",
3-
"version": "v1.0.1",
3+
"version": "v1.1.0",
44
"private": true,
55
"scripts": {
66
"dev": "next dev --turbopack -p 3003",

template.env

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,7 @@ API_KEY=api_key_here_if_needed
55

66
# OpenAI API Base URL (default)
77
# To use a local TTS model server, I suggest using https://github.com/remsky/Kokoro-FastAPI
8-
API_BASE=https://api.openai.com/v1
8+
API_BASE=https://api.openai.com/v1
9+
10+
# Path to your local whisper.cpp CLI binary
11+
WHISPER_CPP_BIN=/whisper.cpp/build/bin/whisper-cli

0 commit comments

Comments
 (0)