Skip to content

Commit 1c6f628

Browse files
authored
Merge pull request #67 from richardr1126/v1.1.0
v1.1.0
2 parents ac0b47d + a755a0b commit 1c6f628

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+2407
-909
lines changed

Dockerfile

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
1-
# Use Node.js slim image
2-
FROM node:current-alpine
1+
# Stage 1: build whisper.cpp (no model download – the app handles that)
2+
FROM alpine:3.20 AS whisper-builder
3+
4+
RUN apk add --no-cache git cmake build-base
5+
6+
WORKDIR /opt
7+
8+
RUN git clone --depth 1 https://github.com/ggml-org/whisper.cpp.git && \
9+
cd whisper.cpp && \
10+
cmake -B build && \
11+
cmake --build build -j --config Release
312

4-
# Add ffmpeg and libreoffice using Alpine package manager
5-
RUN apk add --no-cache ffmpeg libreoffice-writer
13+
14+
# Stage 2: build the Next.js app
15+
FROM node:lts-alpine AS app-builder
616

717
# Install pnpm globally
818
RUN npm install -g pnpm
@@ -23,8 +33,34 @@ COPY . .
2333
RUN pnpm exec next telemetry disable
2434
RUN pnpm build
2535

36+
37+
# Stage 3: minimal runtime image
38+
FROM node:current-alpine AS runner
39+
40+
# Add runtime OS dependencies:
41+
# - ffmpeg: required for audiobook export and word-by-word alignment (/api/whisper)
42+
# - libreoffice-writer: required for DOCX → PDF conversion
43+
RUN apk add --no-cache ffmpeg libreoffice-writer
44+
45+
# Install pnpm globally for running the app
46+
RUN npm install -g pnpm
47+
48+
# App runtime directory
49+
WORKDIR /app
50+
51+
# Copy built app and dependencies from the builder stage
52+
COPY --from=app-builder /app ./
53+
54+
# Copy the compiled whisper.cpp build output into the runtime image
55+
# (includes whisper-cli and its shared libraries, e.g. libwhisper.so, libggml.so)
56+
COPY --from=whisper-builder /opt/whisper.cpp/build /opt/whisper.cpp/build
57+
58+
# Point the app at the compiled whisper-cli binary and ensure its libs are discoverable
59+
ENV WHISPER_CPP_BIN=/opt/whisper.cpp/build/bin/whisper-cli
60+
ENV LD_LIBRARY_PATH=/opt/whisper.cpp/build
61+
2662
# Expose the port the app runs on
2763
EXPOSE 3003
2864

2965
# Start the application
30-
CMD ["pnpm", "start"]
66+
CMD ["pnpm", "start"]

README.md

Lines changed: 19 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -11,65 +11,25 @@
1111

1212
OpenReader WebUI is an open source text to speech document reader web app built using Next.js, offering a TTS read along experience with narration for **EPUB, PDF, TXT, MD, and DOCX documents**. It supports multiple TTS providers including OpenAI, Deepinfra, and custom OpenAI-compatible endpoints like [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) and [Orpheus-FastAPI](https://github.com/Lex-au/Orpheus-FastAPI)
1313

14-
- 🧠 *(New)* **Smart Sentence-Aware Narration** merges sentences across pages/chapters for smoother TTS
15-
- 🎧 *(New)* **Reliable Audiobook Export** in **m4b/mp3**, with resumable, chapter-based export and regeneration
1614
- 🎯 *(New)* **Multi-Provider TTS Support**
1715
- [**Kokoro-FastAPI**](https://github.com/remsky/Kokoro-FastAPI): Supporting multi-voice combinations (like `af_heart+af_bella`)
1816
- [**Orpheus-FastAPI**](https://github.com/Lex-au/Orpheus-FastAPI)
1917
- **Custom OpenAI-compatible**: Any TTS API with `/v1/audio/voices` and `/v1/audio/speech` endpoints
2018
- **Cloud TTS Providers (requiring API keys)**
2119
- [**Deepinfra**](https://deepinfra.com/models/text-to-speech): Kokoro-82M + models with support for cloned voices and more
2220
- [**OpenAI API ($$)**](https://platform.openai.com/docs/pricing#transcription-and-speech): tts-1, tts-1-hd, and gpt-4o-mini-tts w/ instructions
23-
- 🚀 *(New)* **Optimized Next.js TTS Proxy** with audio caching and optimized repeat playback
24-
- 💾 *(Updated)* **Local-First Architecture** stores documents and more in-browser with Dexie.js
2521
- 📖 *(Updated)* **Read Along Experience** providing real-time text highlighting during playback (PDF/EPUB)
22+
- *(New)* **Word-by-word** highlighting uses word-by-word timestamps generated server-side with [*whisper.cpp*](https://github.com/ggml-org/whisper.cpp) (optional)
23+
- 🧠 *(New)* **Smart Sentence-Aware Narration** merges sentences across pages/chapters for smoother TTS
24+
- 🎧 *(New)* **Reliable Audiobook Export** in **m4b/mp3**, with resumable, chapter-based export and regeneration
25+
- 🚀 *(New)* **Optimized Next.js TTS Proxy** with audio caching and optimized repeat playback
26+
- 💾 **Local-First Architecture** stores documents and more in-browser with Dexie.js
2627
- 🛜 **Optional Server-side documents** using backend `/docstore` for all users
2728
- 🎨 **Customizable Experience**
2829
- 🎨 Multiple app theme options
2930
- ⚙️ Various TTS and document handling settings
3031
- And more ...
3132

32-
<details>
33-
<summary>
34-
35-
### 🆕 What's New in v1.0.0
36-
37-
</summary>
38-
39-
- 🧠 **Smart sentence continuation**
40-
- Improved NLP handling of complex structures and quoted dialogue provides more natural sentence boundaries and a smoother audio-text flow.
41-
- EPUB and PDF playback now use smarter sentence splitting and continuation metadata so sentences that cross page/chapter boundaries are merged before hitting the TTS API.
42-
- This yields more natural narration and fewer awkward pauses when a sentence spans multiple pages or EPUB spine items.
43-
- 📄 **Modernized PDF text highlighting pipeline**
44-
- Real-time PDF text highlighting is now offloaded to a dedicated Web Worker so scrolling and playback controls remain responsive during narration.
45-
- A new overlay-based highlighting system draws independent highlight layers on top of the PDF, avoiding interference with the underlying text layer.
46-
- Upgraded fuzzy matching with Dice-based similarity improves the accuracy of mapping spoken words to on-screen text.
47-
- A new per-device setting lets you enable or disable real-time PDF highlighting during playback for a more tailored reading experience.
48-
- 🎧 **Chapter/page-based audiobook export with resume & regeneration**
49-
- Per-chapter/per-page generation to disk with persistent `bookId`
50-
- Resumable generation (can cancel and continue later)
51-
- Per-chapter regeneration & deletion
52-
- Final combined **M4B** or **MP3** download with embedded chapter metadata.
53-
- 💾 **Dexie-backed local storage & sync**
54-
- All document types (PDF, EPUB, TXT/MD-as-HTML) and config are stored via a unified Dexie layer on top of IndexedDB.
55-
- Document lists use live Dexie queries (no manual refresh needed), and server sync now correctly includes text/markdown documents as part of the library backup.
56-
- 🗣️ **Kokoro multi-voice selection & utilities**
57-
- Kokoro models now support multi-voice combination, with provider-aware limits and helpers (not supported on OpenAI or Deepinfra)
58-
-**Faster, more efficient TTS backend proxy**
59-
- In-memory **LRU caching** for audio responses with configurable size/TTL
60-
- **ETag** support (`304` on cache hits) + `X-Cache` headers (`HIT` / `MISS` / `INFLIGHT`)
61-
- 📄 **More robust DOCX → PDF conversion**
62-
- DOCX conversion now uses isolated per-job LibreOffice profiles and temp directories, polls for a stable output file size, and aggressively cleans up temp files.
63-
- This reduces cross-job interference and flakiness when converting multiple DOCX files in parallel.
64-
-**Accessibility & layout improvements**
65-
- Dialogs and folder toggles expose proper roles and ARIA attributes.
66-
- PDF/EPUB/HTML readers use a full-height app shell with a sticky bottom TTS bar, improved scrollbars, and refined focus styles.
67-
-**End-to-end Playwright test suite with TTS mocks**
68-
- Deterministic TTS responses in tests via a reusable Playwright route mock.
69-
- Coverage for accessibility, upload, navigation, folder management, deletion flows, audiobook generation/export and playback across all document types.
70-
71-
</details>
72-
7333
## 🐳 Docker Quick Start
7434

7535
### Prerequisites
@@ -194,6 +154,20 @@ Optionally required for different features:
194154
```bash
195155
brew install libreoffice
196156
```
157+
- [whisper.cpp](https://github.com/ggml-org/whisper.cpp) (optional, required for word-by-word highlighting)
158+
```bash
159+
# clone and build whisper.cpp (no model download needed – OpenReader handles that)
160+
git clone https://github.com/ggml-org/whisper.cpp.git
161+
cd whisper.cpp
162+
cmake -B build
163+
cmake --build build -j --config Release
164+
165+
# point OpenReader to the compiled whisper-cli binary
166+
echo WHISPER_CPP_BIN=\"$(pwd)/build/bin/whisper-cli\"
167+
```
168+
169+
> **Note:** The `WHISPER_CPP_BIN` path should be set in your `.env` file for OpenReader to use word-by-word highlighting features.
170+
197171
### Steps
198172

199173
1. Clone the repository:

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "openreader-webui",
3-
"version": "v1.0.1",
3+
"version": "v1.1.0",
44
"private": true,
55
"scripts": {
66
"dev": "next dev --turbopack -p 3003",
File renamed without changes.
Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
import { NextRequest, NextResponse } from 'next/server';
22
import { spawn } from 'child_process';
3-
import { writeFile, readFile, mkdir, unlink, readdir } from 'fs/promises';
3+
import { writeFile, readFile, mkdir, unlink, readdir, rm } from 'fs/promises';
44
import { existsSync, createReadStream } from 'fs';
55
import { join } from 'path';
66
import { randomUUID } from 'crypto';
7+
import type { TTSAudioBytes, TTSAudiobookFormat } from '@/types/tts';
78

89
interface ConversionRequest {
910
chapterTitle: string;
10-
buffer: number[];
11+
buffer: TTSAudioBytes;
1112
bookId?: string;
12-
format?: 'mp3' | 'm4b';
13+
format?: TTSAudiobookFormat;
1314
chapterIndex?: number;
1415
}
1516

@@ -206,9 +207,12 @@ export async function POST(request: NextRequest) {
206207
await unlink(inputPath).catch(console.error);
207208

208209
return NextResponse.json({
210+
index: chapterIndex,
211+
title: data.chapterTitle,
212+
duration,
213+
status: 'completed' as const,
209214
bookId,
210-
chapterIndex,
211-
duration
215+
format
212216
});
213217

214218
} catch (error) {
@@ -229,7 +233,7 @@ export async function POST(request: NextRequest) {
229233
export async function GET(request: NextRequest) {
230234
try {
231235
const bookId = request.nextUrl.searchParams.get('bookId');
232-
const requestedFormat = request.nextUrl.searchParams.get('format') as 'mp3' | 'm4b' | null;
236+
const requestedFormat = request.nextUrl.searchParams.get('format') as TTSAudiobookFormat | null;
233237
if (!bookId) {
234238
return NextResponse.json({ error: 'Missing bookId parameter' }, { status: 400 });
235239
}
@@ -378,4 +382,31 @@ function streamFile(filePath: string, format: string) {
378382
'Cache-Control': 'no-cache',
379383
},
380384
});
381-
}
385+
}
386+
export async function DELETE(request: NextRequest) {
387+
try {
388+
const bookId = request.nextUrl.searchParams.get('bookId');
389+
if (!bookId) {
390+
return NextResponse.json({ error: 'Missing bookId parameter' }, { status: 400 });
391+
}
392+
393+
const docstoreDir = join(process.cwd(), 'docstore');
394+
const intermediateDir = join(docstoreDir, `${bookId}-audiobook`);
395+
396+
// If directory doesn't exist, consider it already reset
397+
if (!existsSync(intermediateDir)) {
398+
return NextResponse.json({ success: true, existed: false });
399+
}
400+
401+
// Recursively delete the entire audiobook directory
402+
await rm(intermediateDir, { recursive: true, force: true });
403+
404+
return NextResponse.json({ success: true, existed: true });
405+
} catch (error) {
406+
console.error('Error resetting audiobook:', error);
407+
return NextResponse.json(
408+
{ error: 'Failed to reset audiobook' },
409+
{ status: 500 }
410+
);
411+
}
412+
}

src/app/api/audio/convert/chapters/route.ts renamed to src/app/api/audiobook/status/route.ts

Lines changed: 3 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
import { NextRequest, NextResponse } from 'next/server';
2-
import { readdir, readFile, rm } from 'fs/promises';
2+
import { readdir, readFile } from 'fs/promises';
33
import { existsSync } from 'fs';
44
import { join } from 'path';
5+
import type { TTSAudiobookFormat } from '@/types/tts';
56

67
export async function GET(request: NextRequest) {
78
try {
@@ -26,7 +27,7 @@ export async function GET(request: NextRequest) {
2627
duration?: number;
2728
status: 'completed' | 'error';
2829
bookId: string;
29-
format?: 'mp3' | 'm4b';
30+
format?: TTSAudiobookFormat;
3031
}> = [];
3132

3233
for (const metaFile of metaFiles) {
@@ -68,31 +69,3 @@ export async function GET(request: NextRequest) {
6869
);
6970
}
7071
}
71-
72-
export async function DELETE(request: NextRequest) {
73-
try {
74-
const bookId = request.nextUrl.searchParams.get('bookId');
75-
if (!bookId) {
76-
return NextResponse.json({ error: 'Missing bookId parameter' }, { status: 400 });
77-
}
78-
79-
const docstoreDir = join(process.cwd(), 'docstore');
80-
const intermediateDir = join(docstoreDir, `${bookId}-audiobook`);
81-
82-
// If directory doesn't exist, consider it already reset
83-
if (!existsSync(intermediateDir)) {
84-
return NextResponse.json({ success: true, existed: false });
85-
}
86-
87-
// Recursively delete the entire audiobook directory
88-
await rm(intermediateDir, { recursive: true, force: true });
89-
90-
return NextResponse.json({ success: true, existed: true });
91-
} catch (error) {
92-
console.error('Error resetting audiobook:', error);
93-
return NextResponse.json(
94-
{ error: 'Failed to reset audiobook' },
95-
{ status: 500 }
96-
);
97-
}
98-
}

src/app/api/tts/route.ts

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ import { SpeechCreateParams } from 'openai/resources/audio/speech.mjs';
44
import { isKokoroModel } from '@/utils/voice';
55
import { LRUCache } from 'lru-cache';
66
import { createHash } from 'crypto';
7-
import type { TTSRequestPayload, TTSError } from '@/types/tts';
7+
import type { TTSRequestPayload } from '@/types/client';
8+
import type { TTSError, TTSAudioBuffer } from '@/types/tts';
89

910
export const runtime = 'nodejs';
1011

@@ -13,7 +14,7 @@ type ExtendedSpeechParams = Omit<SpeechCreateParams, 'voice'> & {
1314
voice: SpeechCreateParams['voice'] | CustomVoice;
1415
instructions?: string;
1516
};
16-
type AudioBufferValue = ArrayBuffer;
17+
type AudioBufferValue = TTSAudioBuffer;
1718

1819
const TTS_CACHE_MAX_SIZE_BYTES = Number(process.env.TTS_CACHE_MAX_SIZE_BYTES || 256 * 1024 * 1024); // 256MB
1920
const TTS_CACHE_TTL_MS = Number(process.env.TTS_CACHE_TTL_MS || 1000 * 60 * 30); // 30 minutes
@@ -25,7 +26,7 @@ const ttsAudioCache = new LRUCache<string, AudioBufferValue>({
2526
});
2627

2728
type InflightEntry = {
28-
promise: Promise<ArrayBuffer>;
29+
promise: Promise<TTSAudioBuffer>;
2930
controller: AbortController;
3031
consumers: number;
3132
};
@@ -40,7 +41,7 @@ async function fetchTTSBufferWithRetry(
4041
openai: OpenAI,
4142
createParams: ExtendedSpeechParams,
4243
signal: AbortSignal
43-
): Promise<ArrayBuffer> {
44+
): Promise<TTSAudioBuffer> {
4445
let attempt = 0;
4546
const maxRetries = Number(process.env.TTS_MAX_RETRIES ?? 2);
4647
let delay = Number(process.env.TTS_RETRY_INITIAL_MS ?? 250);
@@ -135,15 +136,15 @@ export async function POST(req: NextRequest) {
135136
voice: normalizedVoice,
136137
input: text,
137138
speed: speed,
138-
response_format: format === 'aac' ? 'aac' : 'mp3',
139+
response_format: format,
139140
};
140141
// Only add instructions if model is gpt-4o-mini-tts and instructions are provided
141142
if ((model as string) === 'gpt-4o-mini-tts' && instructions) {
142143
createParams.instructions = instructions;
143144
}
144145

145146
// Compute cache key and check LRU before making provider call
146-
const contentType = format === 'aac' ? 'audio/aac' : 'audio/mpeg';
147+
const contentType = 'audio/mpeg';
147148

148149
// Preserve voice string as-is for cache key (no weight stripping)
149150
const voiceForKey = typeof createParams.voice === 'string'
@@ -245,7 +246,7 @@ export async function POST(req: NextRequest) {
245246
};
246247
req.signal.addEventListener('abort', onAbort, { once: true });
247248

248-
let buffer: ArrayBuffer;
249+
let buffer: TTSAudioBuffer;
249250
try {
250251
buffer = await entry.promise;
251252
} finally {
@@ -280,4 +281,4 @@ export async function POST(req: NextRequest) {
280281
{ status: 500 }
281282
);
282283
}
283-
}
284+
}

0 commit comments

Comments
 (0)