A GPT/Whisper-based system for identification and transcription of non-vocal sounds, such as sirens, falling objects, collisions, vehicle engines, etc.
This project is part of a research work aimed at developing a system capable of transforming non-vocal sounds into textual descriptions. We use a model based on OpenAI's Whisper, fine-tuned to identify different categories of environmental sounds.
audio-classifier/
├── src/
│ ├── ml/ # Machine learning modules
| | ├── trained_model/ # Whisper model
│ ├── backend/ # FastAPI API
│ └── frontend/ # Web interface (Flask)
├── data/
| ├── report/ # Project documentation
│ ├── sounds/ # Training data
└── reports/
- Web interface for uploading or recording audio
- REST API for processing and classifying audio
- Support for .wav files
- Automatic processing to 16kHz frequency
- 30-second limit per audio
- Model trained to identify various categories of non-vocal sounds
- Python 3.8+
- PyTorch
- Whisper
- FastAPI
- Flask
- Other dependencies specified in requirements.txt
- Clone the repository:
git clone https://github.com/rubemalmeida/audio-classifier.git cd audio-classifier
- Install dependencies:
pip install -r requirements.txt
The trained model files are not included in the repository due to their size. Download them from Google Drive:
- Download the trained model: Google Drive Link
- Extract the downloaded zip file
- Place the model files in
src/ml/
directory
To download out research report:
- Access the PDF directly: relatorio.pdf
- Or find it in the
data/reports/
directory after cloning the repository
Start the FastAPI backend server:
python -m src.backend.main
The API will be available at http://localhost:8000
.
In a separate terminal, start the Flask frontend server:
python -m src.frontend.app
The web interface will be available at http://localhost:5000
- Ensure both backend and frontend servers are running
- Open your web browser and navigate to http://localhost:5000
- Upload an audio file (.wav format) or record a new one using your microphone
- Click on "Classify Sound"
- View the classification results showing the detected sound type and confidence level
To train a new model:
# Method 1: Using the Jupyter notebook
jupyter notebook src/ml/train.ipynb
# Method 2: Using the training script
python src/ml/train.py --audio_dir "data/sounds" --model_size "small" --epochs 10
Some of the results obtained from the training process are shown below:
Figure 1: Accuracy rate per epoch
Figure 2: Loss rate per epoch