Your task is to develop a program that can be hosted on GitHub or a similar platform. The program should meet the following specifications:
- Input: A video file in MP4 format.
- Output: Two separate audio tracks:
- Track 1: Cleaned vocals, with as much background noise removed as possible.
- Track 2: Isolated background noises.
- Utilize any open-source libraries or tools you are familiar with.
- Deep learning solutions like U-Net architectures are commonly used for such tasks, but classical digital signal processing methods are also an option, depending on their effectiveness within the project's time constraints.
- The primary focus should be on separating vocals from background noises, specifically those that are disruptive to STT systems.
- Approach: Briefly explain your approach and the rationale behind your choices.
- Demonstration of Effectiveness: If possible, demonstrate the effectiveness of your solution by running a simple STT on the original and cleaned audio tracks. You may use any open-source STT for this purpose. Then, compare the transcription accuracy.
- Potential Improvements and Limitations: Discuss any potential improvements and limitations of your solution.
- Integration into a Larger STT Pipeline: Consider how this audio separation module could be integrated into a larger speech-to-text pipeline, taking into account factors such as real-time processing, scalability, and adaptability to different video sources.
- Consider using tools/libraries like TensorFlow, PyTorch, and librosa.
- Pretrained models and solutions are available for these kinds of tasks. While these can serve as a starting point, ensure you properly cite any sources if you borrow code or ideas.
This guide will help you set up the environment to run the audio separation program, which uses resources from Google Drive and can be executed in Google Colab.
- A Google account for access to Google Drive and Google Colab.
- Python 3.6 or higher installed locally (if running locally).
- pip (Python package installer).
First, clone the repository to your local machine or open it in Google Colab:
git clone https://github.com/POPE001/dubme_task.git
cd dubme_taskAlternatively, you can open the notebook directly in Google Colab:
https://colab.research.google.com/github/POPE001/dubme_task
Next, if you are running the program locally, install the required packages using pip:
Next, if you are running the program locally, install the required packages using pip:
pip install -r requirements.txtThis will install all the dependencies listed in the requirements.txt file.
If you are using Google Colab, the dependencies will be installed directly in the notebook using !pip install commands.
Since the program uses audio files from Google Drive, you'll need to mount your Google Drive in the Colab notebook:
from google.colab import drive
drive.mount('/content/drive')After mounting, you can access files in your Google Drive under the /content/drive/My Drive/ directory.
In Google Colab, run the notebook cells sequentially. The notebook will execute the audio separation on the provided video file adream.mp4, which is included in the repository's data folder.
For local execution, ensure you are in the project directory, and then run the program using:
python dubme_task.pyThe video file adream.mp4 has been added to the data folder of this GitHub repository. Ensure you have the correct path set in your script or notebook to access the video file for processing.
-
Create an 'AudioProcessing' Folder:
- Navigate to your Google Drive.
- Create a new folder and name it
AudioProcessing. - This folder will be used to store output files and any intermediate files the program creates.
-
Add the Video File:
- Place the
adream.mp4video file into the root directory of your Google Drive. - Ensure that the video file is named correctly as
adream.mp4as the program will be looking for this specific file.
- Place the
The audio separation was performed on a segment of the "I Have a Dream" speech by Martin Luther King Jr. The original video is titled "I Have a Dream speech by Martin Luther King .Jr HD (subtitled)" and is available on YouTube.
The specific segment used for this project was extracted from the full speech and should be included in the Google Drive root directory as adream.mp4. The video segment is utilized under fair use for educational purposes, as part of this audio processing project to improve Speech-to-Text transcription accuracy.
This project systematically targets the enhancement of audio quality for Speech-to-Text (STT) systems. The selection of tools and methodologies was based on their recognized efficiency in audio processing:
- FFmpeg: Chosen for its unparalleled reliability in audio extraction, ensuring the original quality is retained for subsequent processing steps.
- Spleeter: Utilized for its state-of-the-art machine learning algorithms, enabling efficient separation of vocals from background noises.
- Noise Reduction Techniques: Implemented to refine the vocal track clarity, a critical factor for the accuracy of STT transcription.
- Open-source STT Tool: Employed for an objective demonstration of the project's effectiveness in improving transcription quality.
Transcription accuracy was compared between the original and processed audio tracks using an open-source STT tool. This comparison showcased a marked improvement in transcription accuracy for the cleaned audio, with a significant reduction in errors, thus affirming the efficacy of our chosen approach.
Comparative waveform analysis provided a visual representation of the audio quality enhancement:
- Original Audio Waveform: Exhibited a dense signal, reflecting a mix of vocals and noise.
- Processed Audio Waveform: Displayed clearer segmentation, indicating successful noise reduction.
Spectrograms were analyzed to understand the frequency distribution changes post-processing:
- Original Audio Spectrogram: Showed a dense frequency spread, typical of mixed audio content.
- Processed Audio Spectrogram: Revealed a discernible reduction in background noise frequencies, particularly in silent segments.
The transcription's Word Error Rate (WER) provided quantitative evidence of the audio processing impact:
- First 30 Seconds: A higher WER in cleaned vocals (16.67%) compared to the original audio (10%) prompted a reevaluation of the audio processing approach.
- Second 30 Seconds: An unchanged WER (36.67%) for both cleaned and original audio suggested that our processing maintained speech integrity.
While the project's objectives were met with a degree of success, several limitations have been identified, which open up opportunities for further enhancements:
While the project achieved its primary objectives, there are areas for improvement and certain limitations that have been identified:
-
Advanced Noise Reduction: Exploring more sophisticated algorithms or deep learning models could yield further improvements in vocal clarity and noise suppression.
-
Real-time Processing: The current approach is not optimized for real-time audio processing. Enhancing the methodology to function effectively in a live environment remains a priority for future development.
-
Diverse Video Samples: The project was tested with a limited scope, using only one sample video into which noise was artificially introduced using Inshot. This presents a limitation in terms of the robustness and adaptability of the system to different types of audio environments and noise profiles.
-
Variable Noise Impact: In testing with other video samples outside the project, it was noted that noise removal did not consistently improve STT accuracy. This variability indicates that the current noise removal technique may not be universally effective across different types of background noise. It is crucial to investigate the conditions under which noise removal is beneficial and to develop criteria or algorithms that can predict and optimize the application of noise reduction techniques for each specific case.
-
Sample Variety and Naturalism: To better assess the system's performance and create a more robust audio processing solution, future work should include testing with a wider variety of video samples. These samples should ideally contain natural background noise and represent a diverse set of recording conditions, speakers, and speech nuances.
To integrate this module into a full-fledged STT system, several factors need consideration:
- Real-time Processing: Essential for applications requiring immediate transcription.
- Scalability: Capability to handle increased load efficiently.
- Adaptability: Flexibility to work with various audio conditions and video sources.
- Quality Assurance: Continuous performance monitoring to ensure consistent STT accuracy.
The following resources were referenced during the development of this project:
-
Jansson, Andreas, et al. "Singing Voice Separation with Deep U-Net Convolutional Networks." ISMIR, 2017. Read the paper.
-
Grais, Emad M., and Plumbley, Mark D. "Single Channel Audio Source Separation using Convolutional Denoising Autoencoders." 2017. Read the paper.
-
Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas. "U-Net: Convolutional Networks for Biomedical Image Segmentation." MICCAI, 2015. Read the paper.
-
Piczak, Karol J. "ESC: Dataset for Environmental Sound Classification." Proceedings of the 23rd Annual ACM Conference on Multimedia, Brisbane, Australia, 2015. DOI link.