|
| 1 | +# ADR-0001 Playback architecture for v1 |
| 2 | + |
| 3 | +Status: Accepted |
| 4 | +Date: 2025-11-10 |
| 5 | +Owner: @richardr1126 |
| 6 | +Related: |
| 7 | +- Plan checklist [docs/v1/todo.md](docs/v1/todo.md) |
| 8 | +- Issue triage and mapping [docs/v1/issues-to-components.md](docs/v1/issues-to-components.md) |
| 9 | + |
| 10 | +## Decision |
| 11 | +Adopt a new, single playback engine built around HTMLAudioElement with Media Source Extensions where available, replacing Howler entirely. Introduce a provider-agnostic TTS interface and document adapters that yield stable location tokens and sentence blocks. Replace the existing IndexedDB utility with Dexie.js for persistence and caching. Ship the new engine as a clean cutover without running dual engines. |
| 12 | + |
| 13 | +## Context |
| 14 | +The current 0.x implementation couples TTS, viewers, and playback control in ways that create fragile flows and race conditions. Playback requires multiple edits across contexts to add features and is sensitive to timing between NLP, preloading, and Howler lifecycle. Issues highlight problems with large export downloads, dialog chunking, PDF margin extraction, and new feature support such as voice combination and chapter based exports. |
| 15 | + |
| 16 | +Guiding constraints from v1 scope: |
| 17 | +- Streaming first playback |
| 18 | +- Replace Howler |
| 19 | +- Dexie.js as client storage layer |
| 20 | +- Preserve audiobook m4b and add chapter based MP3 export |
| 21 | +- Keep server side document sync |
| 22 | +- Browsers: Chrome, Firefox, Edge, Safari 16+ |
| 23 | + |
| 24 | +## Goals |
| 25 | +- Simplify the playback pipeline with a clear state machine and strict cancellation |
| 26 | +- Decouple document parsing from playback via adapters |
| 27 | +- Standardize provider integration behind a unified TTS interface |
| 28 | +- Improve resilience for long running operations and large audio artifacts |
| 29 | +- Make preloading, skipping, voice switching predictable and race free |
| 30 | +- Persist user state and caches using Dexie repositories |
| 31 | + |
| 32 | +## Non goals |
| 33 | +- Running the legacy engine in parallel with v1 |
| 34 | +- Rewriting existing viewers wholesale beyond adapter wiring and highlighting seams |
| 35 | +- Guaranteeing true streaming for providers that only return whole file responses |
| 36 | + |
| 37 | +## Architecture overview |
| 38 | + |
| 39 | +```mermaid |
| 40 | +flowchart TD |
| 41 | + Views[PDF viewer EPUB viewer HTML viewer] --> Adapters[Document adapters] |
| 42 | + Adapters --> Splitter[Sentence splitter and mapping] |
| 43 | + Splitter --> Queue[Sentence queue and preloader] |
| 44 | + Queue --> Engine[Playback engine state machine] |
| 45 | + Engine --> Media[Media controller HTMLAudioElement MSE] |
| 46 | + Media --> Output[Audio output media session background handling] |
| 47 | + Engine --> Cache[Audio cache Dexie] |
| 48 | + Engine --> TTS[TTS providers OpenAI Deepinfra Custom] |
| 49 | + Engine --> Position[Resume position store] |
| 50 | +``` |
| 51 | + |
| 52 | +## Component responsibilities |
| 53 | + |
| 54 | +- Adapters |
| 55 | + - Yield text blocks plus stable locationToken |
| 56 | + - Handle next prev navigation semantics per format |
| 57 | + - Provide highlight mapping strategies |
| 58 | + - Files: |
| 59 | + - [src/v1/adapters/DocumentAdapter.ts](src/v1/adapters/DocumentAdapter.ts) |
| 60 | + - [src/v1/adapters/PdfAdapter.ts](src/v1/adapters/PdfAdapter.ts) |
| 61 | + - [src/v1/adapters/EpubAdapter.ts](src/v1/adapters/EpubAdapter.ts) |
| 62 | + - [src/v1/adapters/HtmlAdapter.ts](src/v1/adapters/HtmlAdapter.ts) |
| 63 | + |
| 64 | +- NLP splitter |
| 65 | + - Builds sentence blocks with quote aware grouping |
| 66 | + - Exposes mapping to raw sentences for highlighting |
| 67 | + - Files: |
| 68 | + - [src/v1/nlp/sentences.ts](src/v1/nlp/sentences.ts) |
| 69 | + - Uses [src/utils/nlp.ts](src/utils/nlp.ts:1) |
| 70 | + |
| 71 | +- Playback engine |
| 72 | + - Drives state transitions, cancellation, preloading, and error handling |
| 73 | + - Integrates with MediaController and TTS providers |
| 74 | + - Files: |
| 75 | + - [src/v1/playback/state.ts](src/v1/playback/state.ts) |
| 76 | + - [src/v1/playback/queue.ts](src/v1/playback/queue.ts) |
| 77 | + - [src/v1/playback/engine.ts](src/v1/playback/engine.ts) |
| 78 | + - [src/v1/playback/hooks/usePlayback.ts](src/v1/playback/hooks/usePlayback.ts) |
| 79 | + |
| 80 | +- Media controller |
| 81 | + - Owns HTMLAudioElement lifecycle and Media Source Extensions when supported |
| 82 | + - Provides blob fallback and gapless segment chaining for Safari 16 plus |
| 83 | + - Integrates media session and background visibility behaviors |
| 84 | + - Files: |
| 85 | + - [src/v1/playback/media/MediaController.ts](src/v1/playback/media/MediaController.ts) |
| 86 | + - [src/v1/playback/media/mediaSession.ts](src/v1/playback/media/mediaSession.ts) |
| 87 | + - [src/v1/playback/media/background.ts](src/v1/playback/media/background.ts) |
| 88 | + |
| 89 | +- TTS providers |
| 90 | + - Unified interface for synth requests and voice listing |
| 91 | + - Pass through custom voice strings including plus syntax when supported |
| 92 | + - Files: |
| 93 | + - [src/v1/tts/types.ts](src/v1/tts/types.ts) |
| 94 | + - [src/v1/tts/Provider.ts](src/v1/tts/Provider.ts) |
| 95 | + - [src/v1/tts/providers/OpenAIProvider.ts](src/v1/tts/providers/OpenAIProvider.ts) |
| 96 | + - [src/v1/tts/providers/DeepinfraProvider.ts](src/v1/tts/providers/DeepinfraProvider.ts) |
| 97 | + - [src/v1/tts/providers/CustomOpenAIProvider.ts](src/v1/tts/providers/CustomOpenAIProvider.ts) |
| 98 | + - [src/v1/tts/voices.ts](src/v1/tts/voices.ts) |
| 99 | + |
| 100 | +- Persistence and caching |
| 101 | + - Dexie schema for documents, config, audio cache, positions, voices |
| 102 | + - Repositories expose typed APIs and transactions |
| 103 | + - Files: |
| 104 | + - [src/v1/db/schema.ts](src/v1/db/schema.ts) |
| 105 | + - [src/v1/db/client.ts](src/v1/db/client.ts) |
| 106 | + - [src/v1/db/repositories/DocumentsRepo.ts](src/v1/db/repositories/DocumentsRepo.ts) |
| 107 | + - [src/v1/db/repositories/ConfigRepo.ts](src/v1/db/repositories/ConfigRepo.ts) |
| 108 | + - [src/v1/db/repositories/AudioCacheRepo.ts](src/v1/db/repositories/AudioCacheRepo.ts) |
| 109 | + - [src/v1/db/repositories/VoicesRepo.ts](src/v1/db/repositories/VoicesRepo.ts) |
| 110 | + - [src/v1/playback/positionStore.ts](src/v1/playback/positionStore.ts) |
| 111 | + |
| 112 | +- API surface |
| 113 | + - Streaming route for providers that support chunked responses |
| 114 | + - Range enabled audio download for large m4b artifacts |
| 115 | + - Files: |
| 116 | + - [src/app/api/tts/stream/route.ts](src/app/api/tts/stream/route.ts) |
| 117 | + - [src/app/api/tts/route.ts](src/app/api/tts/route.ts:1) |
| 118 | + - [src/app/api/tts/voices/route.ts](src/app/api/tts/voices/route.ts:1) |
| 119 | + - [src/app/api/audio/convert/route.ts](src/app/api/audio/convert/route.ts:1) |
| 120 | + |
| 121 | +## Playback state machine |
| 122 | + |
| 123 | +States |
| 124 | +- idle |
| 125 | +- preparing |
| 126 | +- buffering |
| 127 | +- playing |
| 128 | +- paused |
| 129 | +- stopping |
| 130 | +- error |
| 131 | + |
| 132 | +Transitions |
| 133 | +- idle -> preparing on play with valid queue head |
| 134 | +- preparing -> buffering after first audio segment request |
| 135 | +- buffering -> playing on enough data available |
| 136 | +- playing -> buffering when underflow or on skip voice change |
| 137 | +- playing -> paused on user pause |
| 138 | +- any -> stopping on stop clear queue cancel requests |
| 139 | +- any -> error on unrecoverable error with context |
| 140 | + |
| 141 | +Guards and effects |
| 142 | +- All requests carry AbortController scoped to the current token |
| 143 | +- Config changes produce a new token and cancel in flight |
| 144 | +- Preloading is capped and respects cache budgets |
| 145 | + |
| 146 | +## Media pipeline |
| 147 | + |
| 148 | +- Try MSE with a SourceBuffer of audio mpeg or aac when available |
| 149 | +- Else use short blob segments and chain playback with minimal gaps |
| 150 | +- Apply rate changes via playbackRate for audio player speed separate from voice speed at synth time |
| 151 | +- Integrate Media Session actions play pause next previous |
| 152 | +- Pause on background visibility and auto resume on foreground if user was playing |
| 153 | + |
| 154 | +## Text and highlighting |
| 155 | + |
| 156 | +- Adapters provide raw to processed sentence mapping for highlight |
| 157 | +- PDF adapter normalizes x positions to page width and respects left right margins |
| 158 | +- EPUB adapter yields location tokens and section navigation |
| 159 | +- HTML adapter passes text and uses markdown rendering only for view |
| 160 | + |
| 161 | +## Dexie schema outline |
| 162 | + |
| 163 | +Tables and indicative indexes |
| 164 | +- documents id type name lastModified size dataRef |
| 165 | +- config key value |
| 166 | +- audioCache key createdAt expiresAt size bytesRef or chunkRefs |
| 167 | +- positions docId locationToken sentenceIndex updatedAt |
| 168 | +- voices provider model voices updatedAt |
| 169 | + |
| 170 | +Exact table definitions will be codified in [src/v1/db/schema.ts](src/v1/db/schema.ts) |
| 171 | + |
| 172 | +## API notes |
| 173 | + |
| 174 | +- TTS stream route |
| 175 | + - POST returns chunked audio where provider supports it |
| 176 | + - Fallback to full array buffer with progressive delivery |
| 177 | +- Audio convert route |
| 178 | + - Supports mp3 per chapter mode and m4b |
| 179 | + - Adds GET download with Accept Ranges for large files |
| 180 | + |
| 181 | +References: |
| 182 | +- Current TTS route [src/app/api/tts/route.ts](src/app/api/tts/route.ts:1) |
| 183 | +- Current voices route [src/app/api/tts/voices/route.ts](src/app/api/tts/voices/route.ts:1) |
| 184 | +- Current audio convert [src/app/api/audio/convert/route.ts](src/app/api/audio/convert/route.ts:1) |
| 185 | + |
| 186 | +## Migration plan |
| 187 | + |
| 188 | +- One time importer reads from legacy store helpers in [src/utils/indexedDB.ts](src/utils/indexedDB.ts:1) and writes to Dexie |
| 189 | +- Progress UI and retryable steps |
| 190 | +- After cutover remove legacy modules and dependencies including Howler |
| 191 | + |
| 192 | +## Issue alignment |
| 193 | + |
| 194 | +- #59 chapter mp3 export via chapterized pipeline and streamed zip |
| 195 | +- #48 large m4b download via range enabled download endpoint and persistent temp artifacts |
| 196 | +- #47 voice combination via free form voice string pass through on Deepinfra and custom |
| 197 | +- #44 dialog chunking via quote aware grouping in splitter |
| 198 | +- #40 pdf margins via normalized x width and better width fallback |
| 199 | + |
| 200 | +See details in [docs/v1/issues-to-components.md](docs/v1/issues-to-components.md) |
| 201 | + |
| 202 | +## Alternatives considered |
| 203 | + |
| 204 | +- Keep Howler and harden with retries |
| 205 | + - Rejected due to continued complexity and limited streaming control |
| 206 | +- Keep raw IndexedDB helper |
| 207 | + - Rejected due to ergonomics, schema evolution, and repo patterns desired |
| 208 | +- Dual engine migration |
| 209 | + - Rejected to avoid complexity and surface area during refactor |
| 210 | + |
| 211 | +## Risks and mitigations |
| 212 | + |
| 213 | +- MSE availability and Safari variance |
| 214 | + - Provide blob segment fallback and small segment chaining |
| 215 | +- Provider streaming differences |
| 216 | + - Design stream route with capability detection and fallbacks |
| 217 | +- Large artifact memory pressure |
| 218 | + - Range enabled downloads and file backed buffers where possible |
| 219 | +- Cache growth |
| 220 | + - Dexie TTL LRU and size budget enforcement with telemetry |
| 221 | + |
| 222 | +## Rollout |
| 223 | + |
| 224 | +- Alpha |
| 225 | + - HTML adapter wired end to end with engine and streaming |
| 226 | + - Basic Dexie schema and caches |
| 227 | +- Beta |
| 228 | + - PDF and EPUB adapters with highlighting and resume |
| 229 | + - Chapter mp3 export |
| 230 | + - Migration UI |
| 231 | +- GA |
| 232 | + - m4b and sync hardened |
| 233 | + - E2E and performance checks |
| 234 | + - Legacy removal |
| 235 | + |
| 236 | +## Acceptance criteria |
| 237 | + |
| 238 | +- Streaming start to speech under reasonable latency on cached sentences |
| 239 | +- Voice change mid playback cancels and resumes with a single buffer rebuild |
| 240 | +- 1 to 2 GB m4b export downloads stably in Docker with Range support |
| 241 | +- Chapter zip exports are correct and stream without UI stalls |
| 242 | +- Dialog is chunked appropriately without regressing non dialog cases |
| 243 | +- PDF margins trimming is reliable across test samples |
| 244 | + |
| 245 | +## Next actions |
| 246 | + |
| 247 | +- Finalize checklist and sequencing in [docs/v1/todo.md](docs/v1/todo.md) |
| 248 | +- Create v1 code skeleton and Dexie schema |
| 249 | +- Implement engine state machine and MediaController baseline |
| 250 | +- Wire HTML adapter and stream route for first alpha milestone |
0 commit comments