A complete real-time speech transcription and translation system with web interface and YouTube synchronization support.
一個完整的即時語音轉錄和翻譯系統,具備網頁介面和 YouTube 同步支援。
- Real-time Speech Transcription: Live audio capture and transcription using WhisperX, OpenAI 4o transcribe, or Groq
- Multi-language Translation: Automatic translation to multiple languages using GPT-4
- WebSocket Communication: Real-time bidirectional communication between client and server
- Web Interface: Modern Flask-based web application with multiple viewing modes
- YouTube Integration: Synchronized subtitles for YouTube videos with live streaming support
- Server-Sent Events: Real-time updates via SSE for live transcription display
- Session Management: Create and manage multiple transcription sessions
- Audio Processing: Advanced audio processing with overlap buffering for better accuracy
- Context-Aware Translation: Uses conversation context for improved translation quality
- Keyword Learning: Automatic keyword extraction and learning for better transcription accuracy
- Multiple Transcriber Support: Support for WhisperX, OpenAI, and Groq transcription engines
- Real-time View Modes: Grid and single-column layout options for different use cases
- 即時語音轉錄:使用 WhisperX、OpenAI 4o transcribe 或 Groq 進行即時音訊擷取和轉錄
- 多語言翻譯:使用 GPT 自動翻譯成多種語言
- WebSocket 通訊:客戶端和伺服器間的即時雙向通訊
- 網頁介面:基於 Flask 的現代化網頁應用程式,支援多種檢視模式
- YouTube 整合:支援 YouTube 影片同步字幕,包含直播支援
- 伺服器推送事件:透過 SSE 進行即時更新,顯示即時轉錄內容
- 會話管理:建立和管理多個轉錄會話
- 音訊處理:進階音訊處理,具備重疊緩衝區以提高準確性
- 上下文感知翻譯:使用對話上下文以提升翻譯品質
- 關鍵字學習:自動關鍵字提取和學習,提升轉錄準確性
- 多種轉錄引擎支援:支援 WhisperX、OpenAI 和 Groq 轉錄引擎
- 即時檢視模式:網格和單欄佈局選項,適用於不同使用情境
realtime_transcribe/
├── realtime_transcribe_client/ # Audio capture and transcription client
│ ├── run.py # Main client application
│ ├── output/ # Transcription output files
│ │ ├── current_keywords.txt # Dynamic keyword learning file
│ │ └── YYYY-MM-DD/ # Daily transcription logs
│ ├── pyproject.toml # Client dependencies
│ └── README.md # Client documentation
├── realtime_transcribe_server/ # Web server and API
│ ├── app/ # Flask application
│ │ ├── __init__.py # Routes and API endpoints
│ │ ├── static/ # CSS, JS, and static assets
│ │ └── templates/ # HTML templates
│ │ ├── base.html # Base template
│ │ ├── index.html # Main dashboard
│ │ ├── rt.html # Real-time view
│ │ └── yt.html # YouTube integration view
│ ├── main.py # Server entry point
│ ├── temp/ # Session data storage
│ ├── pyproject.toml # Server dependencies
│ └── README.md # Server documentation
└── README.md # This file
- Python 3.11 or higher / Python 3.11 或更高版本
- Microphone access / 麥克風存取權限
- OpenAI API key (for translation) / OpenAI API 金鑰(用於翻譯)
- Groq API key (optional, for Groq transcription) / Groq API 金鑰(選用,用於 Groq 轉錄)
- YouTube Data API key (optional, for YouTube integration) / YouTube Data API 金鑰(選用,用於 YouTube 整合)
-
Clone the repository / 複製專案:
git clone <repository-url> cd realtime_transcribe
-
Install dependencies / 安裝依賴套件:
# Install server dependencies / 安裝伺服器依賴套件 cd realtime_transcribe_server uv sync # Install client dependencies / 安裝客戶端依賴套件 cd ../realtime_transcribe_client uv sync
-
Configure environment variables / 設定環境變數:
Create
.env
files in both directories with the following variables: 在兩個目錄中建立.env
檔案,包含以下變數:# Client configuration / 客戶端設定 OPENAI_API_KEY=your_openai_api_key_here GROQ_API_KEY=your_groq_api_key_here # Optional / 選用 TRANSCRIBE_MODEL=large-v3 TRANSCRIBER=whisperx # Options: "whisperx", "openai", "groq" TRANSLATE_LANGUAGES=zh-Hant,ja,ko,en COMMON_PROMPT=This is a meeting about software development SERVER_ENDPOINT=http://127.0.0.1:5000/api/sync/ AI_MODEL=gpt-4.1-nano RECORD_TIMEOUT=5 RECORD_ENERGY_THRESHOLD=150 RECORD_PAUSE_THRESHOLD_MS=1000 SECRET_KEY=secret-key-to-check-is-admin # Server configuration / 伺服器設定 YOUTUBE_API_KEY=your_youtube_api_key_here # Optional / 選用 SECRET_KEY=secret-key-to-check-is-admin
-
Run the system / 執行系統:
Terminal 1 - Start the server / 終端機 1 - 啟動伺服器:
cd realtime_transcribe_server python main.py
Terminal 2 - Start the client / 終端機 2 - 啟動客戶端:
cd realtime_transcribe_client python run.py -t your_session_id
- Start the server: The web interface will be available at
http://localhost:5000
- Start the client: Run with a session ID to begin transcription
- Access the web interface:
- Visit
http://localhost:5000
for the main dashboard - Use
/rt/{session_id}
for real-time transcription view - Use
/yt/{session_id}
for YouTube integration view
- Visit
- Create sessions: Enter a unique session ID to create a new transcription session
- View transcriptions: Real-time updates will appear in the web interface
- Switch view modes: Use the dropdown in real-time view to switch between languages and layouts
- 啟動伺服器:網頁介面將在
http://localhost:5000
可用 - 啟動客戶端:使用會話 ID 執行以開始轉錄
- 存取網頁介面:
- 造訪
http://localhost:5000
進入主要儀表板 - 使用
/rt/{session_id}
檢視即時轉錄 - 使用
/yt/{session_id}
檢視 YouTube 整合
- 造訪
- 建立會話:輸入唯一的會話 ID 以建立新的轉錄會話
- 檢視轉錄:即時更新將在網頁介面中顯示
- 切換檢視模式:在即時檢視中使用下拉選單切換語言和佈局
// Join a session / 加入會話
socket.emit('join_session', {'session_id': 'your_session_id'});
// Send transcription data / 發送轉錄資料
socket.emit('sync', {
'id': 'session_id',
'message': 'transcription text',
'start_time': 10.5,
'end_time': 12.3,
'created_at': '2024-01-01T12:00:00Z',
'result': {
'corrected': 'corrected transcription',
'translated': {
'en': 'Hello world',
'zh-tw': '你好世界',
'ja': 'こんにちは世界'
},
'special_keywords': ['keyword1', 'keyword2']
}
});
// Connection confirmation / 連線確認
socket.on('connected', (data) => {
console.log('Connected:', data.client_id);
});
// Session join confirmation / 會話加入確認
socket.on('joined_session', (data) => {
console.log('Joined session:', data.session_id);
});
// Real-time transcription updates / 即時轉錄更新
socket.on('transcription_update', (data) => {
console.log('New transcription:', data);
});
POST /api/sync/{session_id}
Content-Type: application/json
{
"message": "transcription text",
"start_time": 10.5,
"end_time": 12.3,
"created_at": "2024-01-01T12:00:00Z",
"result": {
"corrected": "corrected transcription",
"translated": {
"en": "Hello world",
"zh-tw": "你好世界",
"ja": "こんにちは世界"
},
"special_keywords": ["keyword1", "keyword2"]
}
}
GET /rt/{session_id} # Real-time transcription view
GET /yt/{session_id} # YouTube integration view
GET / # Main dashboard
Variable / 變數 | Description / 說明 | Default / 預設值 | Options / 選項 |
---|---|---|---|
TRANSCRIBE_MODEL |
Whisper model to use / 使用的 Whisper 模型 | large-v3 |
tiny , base , small , medium , large , large-v2 , large-v3 |
TRANSCRIBER |
Transcription engine / 轉錄引擎 | whisperx |
whisperx , openai , groq |
TRANSLATE_LANGUAGES |
Comma-separated target languages / 目標語言(逗號分隔) | en,zh-tw |
IETF BCP 47 format |
COMMON_PROMPT |
Context prompt for better transcription / 轉錄上下文提示 | - | Free text |
SERVER_ENDPOINT |
Server API endpoint / 伺服器 API 端點 | http://127.0.0.1:5000/api/sync/ |
URL |
AI_MODEL |
AI model for translation / 翻譯用的 AI 模型 | gpt-4.1-nano |
OpenAI model name |
RECORD_TIMEOUT |
Recording timeout in seconds / 錄音超時時間(秒) | 5 |
Integer |
RECORD_ENERGY_THRESHOLD |
Energy threshold for speech detection / 語音偵測能量閾值 | 150 |
Integer |
RECORD_PAUSE_THRESHOLD_MS |
Pause threshold in milliseconds / 暫停閾值(毫秒) | 1000 |
Integer |
Variable / 變數 | Description / 說明 | Default / 預設值 |
---|---|---|
YOUTUBE_API_KEY |
YouTube Data API key / YouTube Data API 金鑰 | - |
- English: Dedicated view for live transcription display with auto-scrolling
- Features:
- Grid layout with multiple languages
- Single column layout
- Language selection dropdown
- Real-time WebSocket updates
- QR code generation for mobile access
- 繁體中文:專門用於即時轉錄顯示的檢視,具備自動捲動功能
- 功能:
- 多語言網格佈局
- 單欄佈局
- 語言選擇下拉選單
- 即時 WebSocket 更新
- 行動裝置存取 QR 碼生成
- English: Embeds YouTube videos with synchronized subtitles and live streaming support
- Features:
- YouTube player integration
- Synchronized subtitle display
- Live stream support
- Automatic start time detection
- 繁體中文:嵌入 YouTube 影片,具備同步字幕和直播支援
- 功能:
- YouTube 播放器整合
- 同步字幕顯示
- 直播支援
- 自動開始時間偵測
- English: Uses overlap buffering to ensure continuous transcription accuracy
- Features:
- Configurable energy threshold for speech detection
- Adjustable pause threshold for sentence boundaries
- Overlap buffering to prevent audio gaps
- Multiple audio format support
- 繁體中文:使用重疊緩衝區確保連續轉錄準確性
- 功能:
- 可配置的語音偵測能量閾值
- 可調整的句子邊界暫停閾值
- 重疊緩衝區防止音訊間隙
- 多種音訊格式支援
- English: Context-aware translation using conversation history and extracted keywords
- Features:
- Dynamic keyword learning and extraction
- Context-aware translation using recent conversation history
- Multi-language support with IETF BCP 47 format
- Real-time translation with threading for performance
- 繁體中文:使用對話歷史和提取的關鍵字進行上下文感知翻譯
- 功能:
- 動態關鍵字學習和提取
- 使用近期對話歷史的上下文感知翻譯
- IETF BCP 47 格式的多語言支援
- 使用執行緒提升效能的即時翻譯
- English: WebSocket and Server-Sent Events (SSE) for real-time updates across multiple clients
- Features:
- Bidirectional WebSocket communication
- Room-based session management
- Automatic reconnection handling
- Fallback to HTTP POST for reliability
- 繁體中文:使用 WebSocket 和伺服器推送事件 (SSE) 在多重客戶端間進行即時更新
- 功能:
- 雙向 WebSocket 通訊
- 基於房間的會話管理
- 自動重連處理
- HTTP POST 備援機制確保可靠性
- English: Advanced Whisper implementation with improved accuracy
- Features: Batch processing, word-level timestamps, language detection
- 繁體中文:進階 Whisper 實作,具備提升的準確性
- 功能:批次處理、詞級時間戳、語言偵測
- English: Cloud-based transcription with high accuracy
- Features: Server-side VAD, confidence scoring, chunking strategy
- 繁體中文:基於雲端的高準確性轉錄
- 功能:伺服器端 VAD、信心評分、分塊策略
- English: Fast inference with Whisper Large V3 Turbo
- Features: High-speed processing, verbose JSON output, prompt customization
- 繁體中文:使用 Whisper Large V3 Turbo 的快速推論
- 功能:高速處理、詳細 JSON 輸出、提示自訂
faster-whisper
- Fast Whisper implementationopenai
- OpenAI API clientgroq
- Groq API clientpyaudio
- Audio capturespeechrecognition
- Speech recognition utilitiesopencc
- Chinese text conversionhttpx
- HTTP clientwhisperx
- Advanced Whisper featurespython-socketio
- WebSocket clienttorch
- PyTorch for WhisperXtorchvision
- Computer vision utilities
flask
- Web frameworkflask-socketio
- WebSocket support for Flaskrequests
- HTTP requests
- Daily logs:
output/YYYY-MM-DD/HH-MM-SS.json
- Timestamped transcription files - Keywords:
output/current_keywords.txt
- Dynamic keyword learning file - Format: JSON with transcriptions, timestamps, translations, and metadata
- Session data:
temp/{session_id}.json
- Individual session transcriptions - Cache: In-memory transcription cache with 1-hour TTL
- YouTube data: Cached YouTube metadata for live streams
This project is open source. Please check the license file for details. 本專案為開源專案。請查看授權檔案以了解詳細資訊。
Contributions are welcome! Please feel free to submit issues and pull requests. 歡迎貢獻!請隨時提交問題和拉取請求。
For support and questions, please open an issue in the repository. 如需支援和問題,請在儲存庫中開啟問題。