Skip to content

Commit 1080b52

Browse files
committed
Open AI for audio
1 parent e7738e8 commit 1080b52

File tree

7 files changed

+417
-106
lines changed

7 files changed

+417
-106
lines changed

.env.example

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,14 +122,36 @@ WEEKLY_SUMMARY_PROVIDER=gemini
122122
# WEEKLY_SUMMARY_MODEL=gemini-2.5-flash
123123

124124
# Text-to-Speech (Optional)
125-
# Enable TTS generation for weekly summaries using ElevenLabs
125+
# Enable TTS generation for weekly summaries
126126
# Requires WEEKLY_SUMMARY_ENABLED=true
127127
TTS_ENABLED=false
128+
129+
# TTS Provider
130+
# Options: "openai" or "elevenlabs"
131+
# OpenAI: Standard $15/1M chars (~$0.15 for 10K), HD $30/1M chars
132+
# 6 voices (alloy, echo, fable, onyx, nova, shimmer)
133+
# Good quality, very affordable for long content
134+
# ElevenLabs: Credit-based pricing, higher quality voices
135+
# More expensive for long-form content
136+
# Default: openai (recommended for cost)
137+
TTS_PROVIDER=openai
138+
139+
# OpenAI TTS Settings (when TTS_PROVIDER=openai)
140+
# Voice options: alloy, echo, fable, onyx, nova, shimmer
141+
# Model options: tts-1 (faster, cheaper), tts-1-hd (higher quality)
142+
OPENAI_TTS_VOICE=alloy
143+
OPENAI_TTS_MODEL=tts-1
144+
145+
# ElevenLabs Settings (when TTS_PROVIDER=elevenlabs)
128146
# Get your API key from: https://elevenlabs.io/
129147
ELEVENLABS_API_KEY=
130148
# Voice ID to use for TTS (default: Adam - free deep American male voice)
131149
# Free voices: Adam=pNInz6obpgDQGcFmaJgB, Rachel=21m00Tcm4TlvDq8ikWAM
150+
# Run: uv run python list_elevenlabs_voices.py to see all available voices
132151
ELEVENLABS_VOICE_ID=pNInz6obpgDQGcFmaJgB
152+
# Model options: eleven_flash_v2_5 (40K chars), eleven_turbo_v2_5 (40K), eleven_multilingual_v2 (10K)
153+
ELEVENLABS_MODEL_ID=eleven_flash_v2_5
154+
133155
# Directory to store weekly summary audio files
134156
WEEKLY_SUMMARY_AUDIO_DIR=/var/audio-summaries
135157

README.md

Lines changed: 75 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ A powerful FastAPI application that streams audio from YouTube videos as MP3 ove
2828

2929
#### Intelligent Summarization
3030
- **Video Summaries**: AI-generated summaries of each video's content
31-
- **Multi-Provider**: OpenAI GPT or Google Gemini (Gemini recommended for cost-effectiveness)
31+
- **Multi-Provider**: OpenAI GPT or Google Gemini (Gemini recommended for free tier)
3232
- **Knowledge Management**: Automatic posting to Trilium Notes with deduplication
3333
- **Rich Metadata**: Includes video title, channel, thumbnail, and YouTube link
3434

@@ -37,7 +37,7 @@ A powerful FastAPI application that streams audio from YouTube videos as MP3 ove
3737
- **Comprehensive Analysis**: Synthesizes all videos watched during the week
3838
- **Key Learnings**: Extracts 15 most important insights across all content
3939
- **Theme Detection**: Identifies common themes and patterns in your viewing
40-
- **Text-to-Speech**: Optional ElevenLabs TTS generation for listening to summaries
40+
- **Text-to-Speech**: Optional TTS generation (OpenAI or ElevenLabs) for listening to summaries
4141

4242
#### Smart Video Suggestions
4343
- **AI Content Discovery**: Analyzes your viewing history to suggest similar videos
@@ -125,9 +125,9 @@ TRANSCRIPTION_ENABLED=true
125125
OPENAI_API_KEY=sk-... # Get from https://platform.openai.com/api-keys
126126
GEMINI_API_KEY=... # Get from https://makersuite.google.com/app/apikey
127127

128-
# Provider selection (recommended: Voxtral + Gemini for best cost/quality)
129-
TRANSCRIPTION_PROVIDER=mistral # "openai", "mistral", or "gemini"
130-
SUMMARY_PROVIDER=gemini # "gemini" (cost-effective) or "openai"
128+
# Provider selection (recommended: Whisper + Gemini for best cost/quality)
129+
TRANSCRIPTION_PROVIDER=openai # "openai" (Whisper) or "gemini"
130+
SUMMARY_PROVIDER=gemini # "gemini" (free tier) or "openai"
131131

132132
# Trilium Notes integration (for saving summaries)
133133
TRILIUM_URL=http://localhost:8080
@@ -148,13 +148,13 @@ TTS_ENABLED=false
148148
- Or use your specific local IP (e.g., `10.0.0.181`)
149149

150150
- **TRANSCRIPTION_PROVIDER**:
151-
- `openai` = Whisper API ($0.006/min, very accurate, fast, 25MB limit)
152-
- `mistral` = Voxtral Mini ($0.003/min, most cost-effective, 30 min limit)
153-
- `gemini` = Gemini 2.5 Flash (~$0.0005-0.001/min, handles unlimited file sizes)
151+
- `openai` = Whisper API ($0.006/minute, very accurate, fast, 25MB limit)
152+
- `mistral` = Voxtral Mini ($0.003/minute, cost-effective, good quality, 15 min limit)
153+
- `gemini` = Gemini 1.5 Flash (free tier available, good quality, no limits)
154154

155155
- **SUMMARY_PROVIDER**:
156-
- `gemini` = Gemini 2.5 Flash (recommended, very cost-effective, fast)
157-
- `openai` = GPT-4o-mini (high quality)
156+
- `gemini` = Gemini 2.5 Flash (recommended, free tier, fast)
157+
- `openai` = GPT-4o-mini (high quality, paid)
158158

159159
### Step 4: Test Trilium Connection (Optional)
160160

@@ -224,24 +224,20 @@ Required for Whisper transcription or GPT summarization.
224224

225225
### Google Gemini API Key
226226

227-
Required for Gemini transcription or summarization. Very cost-effective pricing.
227+
Required for Gemini transcription or summarization. Has a generous free tier.
228228

229229
1. Visit https://makersuite.google.com/app/apikey
230230
2. Sign in with your Google account
231231
3. Click "Create API Key"
232232
4. Copy the key
233233
5. Add to `.env` file: `GEMINI_API_KEY=...`
234234

235-
**Pricing**:
236-
- Audio transcription: ~$0.30-0.50 per 1M input tokens + $0.40 per 1M output
237-
- Text generation: $0.15 per 1M input + $0.60 per 1M output
238-
- Rate limits: 15 req/min, 1,500 req/day, 1M tokens/day
235+
**Free Tier**:
236+
- 15 requests per minute
237+
- 1 million tokens per day
238+
- 1,500 requests per day
239239

240-
**Benefits**:
241-
- Very cost-effective for audio transcription (~$0.0005-0.001/minute)
242-
- Handles large audio files automatically (uses Files API for >20MB)
243-
- No practical file size or duration limits
244-
- Good for long recordings where Whisper/Voxtral hit their limits
240+
For typical use, summarization and weekly summaries are essentially free.
245241

246242
### Mistral AI API Key
247243

@@ -255,7 +251,7 @@ Required for Mistral Voxtral transcription. Cost-effective option at $0.003/minu
255251

256252
**Cost**: Voxtral Mini is $0.003 per minute of audio. For typical use (~30 hours/month), expect ~$5-8/month (50% cheaper than Whisper).
257253

258-
**Limitation**: Maximum 15 minutes per audio file. For longer videos, use Gemini (no limit) or split the audio.
254+
**Limitation**: Maximum 30 minutes per audio file. For longer videos, use Gemini (no limit) or split the audio.
259255

260256
### Trilium ETAPI Token
261257

@@ -272,9 +268,27 @@ Required for saving transcripts and summaries to Trilium Notes.
272268
2. Right-click the note → "Copy Note ID"
273269
3. Add to `.env` file: `TRILIUM_PARENT_NOTE_ID=...`
274270

275-
### ElevenLabs API Key (Optional)
271+
### Text-to-Speech API Keys (Optional)
276272

277-
Required only if you want text-to-speech for weekly summaries.
273+
Required only if you want text-to-speech for weekly summaries. Choose one provider:
274+
275+
#### OpenAI TTS (Recommended)
276+
**Most affordable for long-form content**
277+
278+
- Pricing: $15 per 1M characters (~$0.15 for a 10K character summary)
279+
- Quality: 6 natural voices (alloy, echo, fable, onyx, nova, shimmer)
280+
- Models: `tts-1` (standard) or `tts-1-hd` (higher quality)
281+
- You already have the API key from transcription setup
282+
283+
Set in `.env`:
284+
```bash
285+
TTS_PROVIDER=openai
286+
OPENAI_TTS_VOICE=alloy
287+
OPENAI_TTS_MODEL=tts-1
288+
```
289+
290+
#### ElevenLabs (Alternative)
291+
**Higher quality voices, more expensive**
278292

279293
1. Visit https://elevenlabs.io/
280294
2. Sign up or sign in
@@ -284,6 +298,12 @@ Required only if you want text-to-speech for weekly summaries.
284298

285299
**Free Tier**: 10,000 characters per month (~7-10 summaries)
286300

301+
Set in `.env`:
302+
```bash
303+
TTS_PROVIDER=elevenlabs
304+
ELEVENLABS_VOICE_ID=pNInz6obpgDQGcFmaJgB
305+
```
306+
287307
## Configuration Reference
288308

289309
### Environment Variables
@@ -347,9 +367,13 @@ All configuration is done via the `.env` file. See `.env.example` for a complete
347367

348368
| Variable | Default | Description |
349369
|----------|---------|-------------|
350-
| `TTS_ENABLED` | `false` | Enable ElevenLabs TTS for summaries |
351-
| `ELEVENLABS_API_KEY` | - | ElevenLabs API key |
352-
| `ELEVENLABS_VOICE_ID` | `pNInz6obpgDQGcFmaJgB` | Voice ID (Adam by default) |
370+
| `TTS_ENABLED` | `false` | Enable TTS for summaries |
371+
| `TTS_PROVIDER` | `openai` | Provider: `openai` or `elevenlabs` |
372+
| `OPENAI_TTS_VOICE` | `alloy` | OpenAI voice (alloy, echo, fable, onyx, nova, shimmer) |
373+
| `OPENAI_TTS_MODEL` | `tts-1` | OpenAI model (`tts-1` or `tts-1-hd`) |
374+
| `ELEVENLABS_API_KEY` | - | ElevenLabs API key (if using ElevenLabs) |
375+
| `ELEVENLABS_VOICE_ID` | `pNInz6obpgDQGcFmaJgB` | ElevenLabs voice ID (Adam by default) |
376+
| `ELEVENLABS_MODEL_ID` | `eleven_flash_v2_5` | ElevenLabs model |
353377
| `WEEKLY_SUMMARY_AUDIO_DIR` | `/var/audio-summaries` | Where to store TTS audio files |
354378

355379
## API Endpoints
@@ -530,36 +554,29 @@ curl "http://localhost:8000/admin/weekly-summary/next-run"
530554
| gpt-4o | $2.50 | $10.00 | Higher quality |
531555
| whisper-1 | - | - | $0.006 per minute, 25MB limit |
532556
| **Mistral AI** ||||
533-
| voxtral-mini-latest | - | - | $0.003 per minute, 30 min limit |
557+
| voxtral-mini-latest | - | - | $0.003 per minute, 15 min limit |
534558
| **Google Gemini** ||||
535-
| gemini-2.5-flash | $0.15 | $0.60 | Text: Fast, comparable to gpt-4o-mini |
536-
| gemini-2.5-flash (audio) | $0.30-0.50 | $0.40 | Audio transcription (token-based) |
559+
| gemini-2.5-flash | $0.15 | $0.60 | Fast, comparable to gpt-4o-mini (recommended) |
537560
| gemini-1.5-flash | $0.10 | $0.40 | Slightly older, still excellent |
538561
| gemini-1.5-pro | $1.25 | $5.00 | Higher quality |
539562

540-
**Note**: Gemini audio pricing is per 1M tokens. Audio duration to token conversion varies, but typically ~1 minute ≈ 1,000-1,500 tokens.
541-
542563
### Estimated Costs Per Operation
543564

544-
**Transcription Options:**
565+
**Using recommended configuration (Whisper + Gemini 2.5 Flash):**
545566

546-
1. **Whisper (OpenAI)** - Most accurate
547-
- $0.006 per minute
548-
- 10 min = $0.06 | 1 hour = $0.36
567+
- **Video transcription** (Whisper): $0.006 per minute of audio
568+
- 10 min video = $0.06
569+
- 1 hour video = $0.36
549570

550-
2. **Voxtral (Mistral)** - Cost-effective (50% cheaper)
551-
- $0.003 per minute
552-
- 10 min = $0.03 | 1 hour = $0.18
571+
**Alternative: Cost-optimized (Voxtral + Gemini 2.5 Flash):**
553572

554-
3. **Gemini 2.5 Flash** - Token-based, good for long files
555-
- ~$0.30-0.50 per 1M input tokens + $0.40 per 1M output
556-
- Estimate: ~$0.0005-0.001 per minute (varies by audio complexity)
557-
- 10 min ≈ $0.005-0.01 | 1 hour ≈ $0.03-0.06
558-
- Best for: Very long recordings, handles unlimited file sizes
573+
- **Video transcription** (Voxtral Mini): $0.003 per minute of audio (50% cheaper)
574+
- 10 min video = $0.03
575+
- 1 hour video = $0.18
559576

560-
**Summarization** (Gemini 2.5 Flash text):
561-
- Typical: 2,000 input + 500 output tokens
562-
- Cost: (2,000 × $0.15 + 500 × $0.60) / 1,000,000 = **$0.0006**
577+
- **Video summarization** (Gemini 2.5 Flash): ~$0.0003-0.001 per summary
578+
- Typical: 2,000 input tokens + 500 output tokens
579+
- Cost: (2,000 × $0.15 + 500 × $0.60) / 1,000,000 = **$0.0006**
563580

564581
- **Weekly summary** (Gemini 2.5 Flash): ~$0.003-0.01 per summary
565582
- Typical: 10,000 input tokens + 2,000 output tokens
@@ -589,21 +606,20 @@ curl "http://localhost:8000/admin/weekly-summary/next-run"
589606
- Weekly summaries: 4 weeks × $0.0027 = **$0.01**
590607
- **Total: ~$36.10/month**
591608

592-
### Gemini Pricing Advantages
609+
### Gemini Free Tier
593610

594-
Gemini offers very competitive pricing, especially for text generation:
595-
- Text: $0.15 input + $0.60 output per 1M tokens
596-
- Audio: $0.30-0.50 input + $0.40 output per 1M tokens
597-
- Rate limits: 15 req/min, 1M tokens/day, 1,500 req/day
611+
Gemini has a generous free tier that covers most summarization needs:
612+
- 15 requests per minute
613+
- 1 million tokens per day
614+
- 1,500 requests per day
598615

599-
**Cost-effective for:**
600-
- Video summarization (~$0.0006 per summary - nearly negligible)
601-
- Weekly summaries (~$0.003 per summary)
602-
- Smart suggestions (~$0.002 per request)
603-
- Long audio transcriptions (~$0.0005-0.001 per minute, cheaper than Whisper for >6 min files)
616+
**What's free:**
617+
- Video summarization (essentially unlimited for personal use)
618+
- Weekly summaries (4 per month)
619+
- Smart suggestions (as much as you need)
604620

605-
**Costs more than alternatives:**
606-
- Short audio transcription: Voxtral is more cost-effective ($0.003/min fixed)
621+
**What costs money:**
622+
- Transcription with Whisper (no free option for high quality)
607623

608624
### Cost Tracking
609625

@@ -671,7 +687,7 @@ sudo systemctl restart audio-stream
671687
sudo systemctl stop audio-stream
672688

673689
# View logs
674-
journalctl -u audio-stream -n 100 -f
690+
journalctl -u audio-stream -n 1000 -f
675691
```
676692

677693
**Note:** The service automatically loads your `.env` file from the WorkingDirectory.

config.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
import os
44
import threading
55
import logging
6-
from typing import Optional
6+
from typing import Optional, Literal, cast
77
from dataclasses import dataclass
88
from dotenv import load_dotenv
99

@@ -104,8 +104,12 @@ class Config:
104104

105105
# TTS settings
106106
tts_enabled: bool
107+
tts_provider: Literal["openai", "elevenlabs"]
108+
openai_tts_voice: str # OpenAI voice (alloy, echo, fable, onyx, nova, shimmer)
109+
openai_tts_model: str # OpenAI model (tts-1 or tts-1-hd)
107110
elevenlabs_api_key: Optional[str]
108111
elevenlabs_voice_id: str
112+
elevenlabs_model_id: str
109113
weekly_summary_audio_dir: str
110114

111115
# Client-side logging settings
@@ -194,10 +198,17 @@ def load_from_env(cls) -> "Config":
194198
== "true",
195199
# TTS settings
196200
tts_enabled=os.getenv("TTS_ENABLED", "false").lower() == "true",
201+
tts_provider=cast(
202+
Literal["openai", "elevenlabs"],
203+
os.getenv("TTS_PROVIDER", "openai").lower(),
204+
),
205+
openai_tts_voice=os.getenv("OPENAI_TTS_VOICE", "alloy"),
206+
openai_tts_model=os.getenv("OPENAI_TTS_MODEL", "tts-1"),
197207
elevenlabs_api_key=os.getenv("ELEVENLABS_API_KEY"),
198208
elevenlabs_voice_id=os.getenv(
199209
"ELEVENLABS_VOICE_ID", "pNInz6obpgDQGcFmaJgB"
200210
), # Adam - free voice
211+
elevenlabs_model_id=os.getenv("ELEVENLABS_MODEL_ID", "eleven_flash_v2_5"),
201212
weekly_summary_audio_dir=os.getenv(
202213
"WEEKLY_SUMMARY_AUDIO_DIR", "/var/audio-summaries"
203214
),
@@ -328,8 +339,21 @@ def validate_tts(self) -> None:
328339
"""Validate that required configuration for TTS is present."""
329340
errors = []
330341

331-
if not self.elevenlabs_api_key:
332-
errors.append("ELEVENLABS_API_KEY is required when TTS_ENABLED=true")
342+
# Validate TTS provider
343+
if self.tts_provider not in ["openai", "elevenlabs"]:
344+
errors.append(
345+
f"TTS_PROVIDER must be 'openai' or 'elevenlabs', got '{self.tts_provider}'"
346+
)
347+
348+
# Validate provider-specific configuration
349+
if self.tts_provider == "openai":
350+
if not self.openai_api_key:
351+
errors.append("OPENAI_API_KEY is required when TTS_PROVIDER=openai")
352+
elif self.tts_provider == "elevenlabs":
353+
if not self.elevenlabs_api_key:
354+
errors.append(
355+
"ELEVENLABS_API_KEY is required when TTS_PROVIDER=elevenlabs"
356+
)
333357

334358
if errors:
335359
error_msg = "TTS configuration validation failed:\n - " + "\n - ".join(

main.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,26 @@
3636
)
3737
logger = logging.getLogger(__name__)
3838

39+
40+
# Custom filter to suppress polling endpoint logs
41+
class PollingEndpointFilter(logging.Filter):
42+
"""Filter out frequently polled endpoint access logs to reduce noise."""
43+
44+
def filter(self, record: logging.LogRecord) -> bool:
45+
message = record.getMessage()
46+
# Filter out frequently polled endpoints
47+
return not any(
48+
pattern in message
49+
for pattern in [
50+
"GET /status HTTP", # Stream status polling
51+
"GET /transcription/status/", # Transcription status polling
52+
]
53+
)
54+
55+
56+
# Apply filter to uvicorn access logger
57+
logging.getLogger("uvicorn.access").addFilter(PollingEndpointFilter())
58+
3959
# Configurable host and port
4060
host = os.environ.get("FASTAPI_HOST", "127.0.0.1")
4161
api_port = int(os.environ.get("FASTAPI_API_PORT", 8000))

0 commit comments

Comments
 (0)