Audio/Video Transcription and Subtitle Generator

This is an integrated audio/video processing tool that can transcribe long audio or video files into text and generate SRT subtitles using Google Gemini. The tool automatically handles the entire process, including: audio extraction from video, audio splitting, transcription, translation, and subtitle generation.

中文版说明

Main Features

Video Processing: Automatically extract audio from video files
Intelligent Audio Splitting: Automatically split long audio files into smaller segments based on silence detection
Audio Transcription and Translation: Use Google Gemini AI to transcribe audio to text and translate into multiple languages
Multilingual Support: Support for translating audio content into Simplified Chinese, Traditional Chinese, English, Japanese, Korean, and other languages
Multilingual Interface: Support for Chinese and English interfaces, switchable at any time
Subtitle Timestamp Generation: Automatically add precise timestamps to transcriptions and translations
SRT Subtitle Generation: Merge all transcription segments to generate standard SRT format subtitle files
Graphical User Interface: Provide an intuitive interface to simplify the processing workflow
Flexible Output Options: Support for transcription-only, translation-only, or both in subtitle files
Process Interruption: Support for forcibly terminating ongoing processing tasks at any time

System Requirements

Python 3.8+
FFmpeg (for video processing)
Required Python libraries (see Installation section)
Google AI API Key (Gemini 2.5 model)

Installation

Clone or download this repository
Install the necessary Python dependencies:

pip install pydub librosa soundfile google-genai numpy psutil mutagen

Ensure FFmpeg is installed on your system (for video processing):

# On Ubuntu/Debian systems
sudo apt-get install ffmpeg

# On macOS (using Homebrew)
brew install ffmpeg

# On Windows
# Download from https://ffmpeg.org/download.html and add to your system PATH

Configure Google AI API Key:
- Obtain a Google Gemini API key
- You can set the environment variable GOOGLE_API_KEY, or enter it manually when using the tool

Usage

Graphical Interface Version

Run the GUI application:

python audio_processor_gui.py

In the interface:
- Select interface language (Chinese or English)
- Select input file (audio or video)
- Enter Google AI API key
- Adjust processing parameters (if needed)
- Click "Start Processing"
- To interrupt processing, click the "Stop Processing" button to forcibly terminate all processing processes

Command Line Version

python process_audio.py input_file.mp3 --api-key YOUR_API_KEY [other options]

Examples:

# Basic usage
python process_audio.py recording.mp3 --api-key YOUR_API_KEY

# Process video and include both transcription and translation
python process_audio.py video.mp4 --api-key YOUR_API_KEY --content both

# Use a different target language (translate to English)
python process_audio.py chinese_speech.mp3 --api-key YOUR_API_KEY --target-language "English"

# Use Japanese as the target language
python process_audio.py speech.mp3 --api-key YOUR_API_KEY --target-language "Japanese"

# Adjust audio splitting parameters
python process_audio.py long_audio.mp3 --api-key YOUR_API_KEY --max-length 240 --silence-length 700 --silence-threshold -45

# Specify output directory and clean up intermediate files
python process_audio.py speech.mp3 --api-key YOUR_API_KEY --output-dir ./output_directory --cleanup

Main Parameters

--api-key: Google AI API Key (required)
--output-dir: Output directory (defaults to creating a directory using the input filename)
--target-language: Target language for translation (default is "Simplified Chinese", options include: Traditional Chinese, English, Japanese, Korean, etc.)
--content: Choose subtitle content type
- transcript: Transcription only
- translation: Translation only
- both: Both transcription and translation (default)
--model-name: Gemini model to use for transcription (default: gemini-2.5-pro-preview-03-25)
--max-length: Maximum audio segment length (seconds, default 300)
--silence-length: Minimum length for silence detection (milliseconds, default 500)
--silence-threshold: Silence detection threshold (dB, default -40)
--first-chunk-offset: Time offset for the first audio segment (seconds, default 0)
--cleanup: Delete intermediate files after processing

Workflow

Preprocessing:
- If the input is a video file, use FFmpeg to extract audio
Audio Splitting:
- Detect silence points in the audio
- Split audio at appropriate silence points
- Generate multiple smaller audio segments
Audio Transcription:
- Process each audio segment using Google Gemini AI
- Generate transcription and translation for each segment
- Transcription and translation text with timestamps
Subtitle Generation:
- Calculate cumulative time offsets based on audio segment lengths
- Merge all transcription files
- Generate subtitle files in SRT format

Project Structure

split_audio.py: Audio splitting module
transcript.py: Audio transcription and translation module
combine_transcripts.py: Subtitle merging module
process_audio.py: Main processing workflow coordination module
audio_processor_gui.py: Graphical user interface module
verify_durations.py: Utility script to verify audio chunk durations (for debugging)

Notes

Processing long audio/video files may take some time
API calls may incur costs, depending on your Google AI API usage plan
The accuracy of subtitle timestamps may vary depending on audio quality
Internet connection is required to use the Google AI API

Frequently Asked Questions

Why is it necessary to split audio?
- Google Gemini API has limitations on file size and processing duration
- Splitting into smaller segments improves transcription accuracy and reliability
How to adjust timestamp offsets?
- If generated subtitles are not synchronized with the video, you can use the --first-chunk-offset parameter to adjust
How to handle audio in different languages?
- The system automatically detects the audio language and transcribes it, then translates it to the specified target language
- Default translation is to Simplified Chinese, but can be changed using the --target-language parameter
What target languages are supported?
- Multiple languages are supported, including: Simplified Chinese, Traditional Chinese, English, Japanese, Korean, Russian, Spanish, French, German, etc.
- In the GUI interface, you can select from a dropdown menu; in the command line, you can specify via parameter
FFmpeg installation issues?
- Ensure FFmpeg is correctly installed and added to your system PATH
- You can verify the installation by running ffmpeg -version in the command line
How to stop ongoing processing?
- In the GUI interface, click the "Stop Processing" button
- The program will forcibly terminate all related processing processes
- Note: Forced stopping will lose current processing progress

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contribution Guidelines

Contributions to this project are welcome! If you'd like to participate in development, you can follow these steps:

Fork this repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

If you find any bugs or have any suggestions for improvements, please feel free to submit an issue.

TODO

Improve recognition when multiple people are speaking, with possible speaker role identification
Address potential degradation issues when repetitive words appear in speech
Add API error handling with retry mechanism
Allow selection between single conversation mode and multiple conversation mode.
Allow selection of different Gemini models (e.g., 2.5Pro and 2.5Flash)
Improve prompts to make the 2.0flash model usable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio/Video Transcription and Subtitle Generator

Main Features

System Requirements

Installation

Usage

Graphical Interface Version

Command Line Version

Main Parameters

Workflow

Project Structure

Notes

Frequently Asked Questions

License

Contribution Guidelines

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md
audio_processor_gui.py		audio_processor_gui.py
combine_transcripts.py		combine_transcripts.py
process_audio.py		process_audio.py
split_audio.py		split_audio.py
transcript.py		transcript.py
verify_durations.py		verify_durations.py

zd9999cs/gemini-subtitle-generator-translator

Folders and files

Latest commit

History

Repository files navigation

Audio/Video Transcription and Subtitle Generator

Main Features

System Requirements

Installation

Usage

Graphical Interface Version

Command Line Version

Main Parameters

Workflow

Project Structure

Notes

Frequently Asked Questions

License

Contribution Guidelines

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages