This is an integrated audio/video processing tool that can transcribe long audio or video files into text and generate SRT subtitles using Google Gemini. The tool automatically handles the entire process, including: audio extraction from video, audio splitting, transcription, translation, and subtitle generation.
- Video Processing: Automatically extract audio from video files
- Intelligent Audio Splitting: Automatically split long audio files into smaller segments based on silence detection
- Audio Transcription and Translation: Use Google Gemini AI to transcribe audio to text and translate into multiple languages
- Multilingual Support: Support for translating audio content into Simplified Chinese, Traditional Chinese, English, Japanese, Korean, and other languages
- Multilingual Interface: Support for Chinese and English interfaces, switchable at any time
- Subtitle Timestamp Generation: Automatically add precise timestamps to transcriptions and translations
- SRT Subtitle Generation: Merge all transcription segments to generate standard SRT format subtitle files
- Graphical User Interface: Provide an intuitive interface to simplify the processing workflow
- Flexible Output Options: Support for transcription-only, translation-only, or both in subtitle files
- Process Interruption: Support for forcibly terminating ongoing processing tasks at any time
- Python 3.8+
- FFmpeg (for video processing)
- Required Python libraries (see Installation section)
- Google AI API Key (Gemini 2.5 model)
-
Clone or download this repository
-
Install the necessary Python dependencies:
pip install pydub librosa soundfile google-genai numpy psutil mutagen
- Ensure FFmpeg is installed on your system (for video processing):
# On Ubuntu/Debian systems
sudo apt-get install ffmpeg
# On macOS (using Homebrew)
brew install ffmpeg
# On Windows
# Download from https://ffmpeg.org/download.html and add to your system PATH
- Configure Google AI API Key:
- Obtain a Google Gemini API key
- You can set the environment variable
GOOGLE_API_KEY
, or enter it manually when using the tool
- Run the GUI application:
python audio_processor_gui.py
- In the interface:
- Select interface language (Chinese or English)
- Select input file (audio or video)
- Enter Google AI API key
- Adjust processing parameters (if needed)
- Click "Start Processing"
- To interrupt processing, click the "Stop Processing" button to forcibly terminate all processing processes
python process_audio.py input_file.mp3 --api-key YOUR_API_KEY [other options]
Examples:
# Basic usage
python process_audio.py recording.mp3 --api-key YOUR_API_KEY
# Process video and include both transcription and translation
python process_audio.py video.mp4 --api-key YOUR_API_KEY --content both
# Use a different target language (translate to English)
python process_audio.py chinese_speech.mp3 --api-key YOUR_API_KEY --target-language "English"
# Use Japanese as the target language
python process_audio.py speech.mp3 --api-key YOUR_API_KEY --target-language "Japanese"
# Adjust audio splitting parameters
python process_audio.py long_audio.mp3 --api-key YOUR_API_KEY --max-length 240 --silence-length 700 --silence-threshold -45
# Specify output directory and clean up intermediate files
python process_audio.py speech.mp3 --api-key YOUR_API_KEY --output-dir ./output_directory --cleanup
--api-key
: Google AI API Key (required)--output-dir
: Output directory (defaults to creating a directory using the input filename)--target-language
: Target language for translation (default is "Simplified Chinese", options include: Traditional Chinese, English, Japanese, Korean, etc.)--content
: Choose subtitle content typetranscript
: Transcription onlytranslation
: Translation onlyboth
: Both transcription and translation (default)
--model-name
: Gemini model to use for transcription (default: gemini-2.5-pro-preview-03-25)--max-length
: Maximum audio segment length (seconds, default 300)--silence-length
: Minimum length for silence detection (milliseconds, default 500)--silence-threshold
: Silence detection threshold (dB, default -40)--first-chunk-offset
: Time offset for the first audio segment (seconds, default 0)--cleanup
: Delete intermediate files after processing
-
Preprocessing:
- If the input is a video file, use FFmpeg to extract audio
-
Audio Splitting:
- Detect silence points in the audio
- Split audio at appropriate silence points
- Generate multiple smaller audio segments
-
Audio Transcription:
- Process each audio segment using Google Gemini AI
- Generate transcription and translation for each segment
- Transcription and translation text with timestamps
-
Subtitle Generation:
- Calculate cumulative time offsets based on audio segment lengths
- Merge all transcription files
- Generate subtitle files in SRT format
split_audio.py
: Audio splitting moduletranscript.py
: Audio transcription and translation modulecombine_transcripts.py
: Subtitle merging moduleprocess_audio.py
: Main processing workflow coordination moduleaudio_processor_gui.py
: Graphical user interface moduleverify_durations.py
: Utility script to verify audio chunk durations (for debugging)
- Processing long audio/video files may take some time
- API calls may incur costs, depending on your Google AI API usage plan
- The accuracy of subtitle timestamps may vary depending on audio quality
- Internet connection is required to use the Google AI API
-
Why is it necessary to split audio?
- Google Gemini API has limitations on file size and processing duration
- Splitting into smaller segments improves transcription accuracy and reliability
-
How to adjust timestamp offsets?
- If generated subtitles are not synchronized with the video, you can use the
--first-chunk-offset
parameter to adjust
- If generated subtitles are not synchronized with the video, you can use the
-
How to handle audio in different languages?
- The system automatically detects the audio language and transcribes it, then translates it to the specified target language
- Default translation is to Simplified Chinese, but can be changed using the
--target-language
parameter
-
What target languages are supported?
- Multiple languages are supported, including: Simplified Chinese, Traditional Chinese, English, Japanese, Korean, Russian, Spanish, French, German, etc.
- In the GUI interface, you can select from a dropdown menu; in the command line, you can specify via parameter
-
FFmpeg installation issues?
- Ensure FFmpeg is correctly installed and added to your system PATH
- You can verify the installation by running
ffmpeg -version
in the command line
-
How to stop ongoing processing?
- In the GUI interface, click the "Stop Processing" button
- The program will forcibly terminate all related processing processes
- Note: Forced stopping will lose current processing progress
MIT License
Copyright (c) 2025
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Contributions to this project are welcome! If you'd like to participate in development, you can follow these steps:
- Fork this repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
If you find any bugs or have any suggestions for improvements, please feel free to submit an issue.
- Improve recognition when multiple people are speaking, with possible speaker role identification
- Address potential degradation issues when repetitive words appear in speech
- Add API error handling with retry mechanism
- Allow selection between single conversation mode and multiple conversation mode.
- Allow selection of different Gemini models (e.g., 2.5Pro and 2.5Flash)
- Improve prompts to make the 2.0flash model usable