Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 203 additions & 0 deletions narrative-audio-system/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# TableTalk Narrative Audio System

This repository contains the Technical Evaluation Test for the **GSoC 2026 HumanAI: TableTalk** project. It implements an end-to-end pipeline for processing, classifying, and retrieving narrative audio for interactive storytelling.

## 📄 Final Submission Documents
* **[Technical Report (PDF)](./TableTalk%20Narrative%20Audio%20System_%20Technical%20Report.pdf)** - Detailed analysis of methodology, results, and storytelling heuristics.
* **[Implementation Roadmap](./TableTalk%20Narrative%20Audio%20System_%20Technical%20Report.pdf#page=4)** - 12-week GSoC project plan.

---

## 🌟 Key Results
* **Task 3 (Transcription):** Achieved an average **Word Error Rate (WER) of 16.67%** using OpenAI Whisper.
* **Task 2 (Classification):** Successfully trained a neural model to **38.6% accuracy** on the RAVDESS subset, identifying key markers for "Calm" vs. "Fearful" tones.
* **Bonus Task:** Developed a storytelling detection heuristic using **Mean Pitch Variation (123.01 Hz)** and **Pause Ratios (0.952)**.

---

## 🛠️ Setup & Installation

### 1. System Requirements
This project requires **FFmpeg** for audio processing.
* **macOS:** `brew install ffmpeg`
* **Ubuntu/Linux:** `sudo apt install ffmpeg`
* **Windows:** Install via [ffmpeg.org](https://ffmpeg.org/download.html) and add to PATH.

### 2. Python Environment
```bash
# Clone the repository
git clone [YOUR_REPO_URL]
cd [REPO_NAME]

# Install dependencies
pip install -r requirements.txt
```

---

## Project Structure

- `run_pipeline.py`: End-to-end pipeline for all required tasks and bonus analysis
- `task1_audio_pipeline/audio_pipeline.py`: Task 1 audio preprocessing and feature extraction
- `task2_tone_classification/train_classifier.py`: Task 2 tone classification model training and evaluation
- `task3_transcription/whisper_transcriber.py`: Task 3 batch transcription and WER measurement
- `task4_audio_retrieval/retrieval_prototype.py`: Task 4 retrieval prototype (filtering + semantic ranking)
- `task_bonus_storytelling/storytelling_analysis.py`: Bonus storytelling feature analysis and scoring
- `examples/`: Input recordings, labels, and generated output artifacts

---

## Task Summary

### Task 1: Audio Processing Pipeline

The Task 1 pipeline:

1. Loads `.wav` files from the input directory
2. Normalizes audio amplitude for consistent feature extraction
3. Segments audio into fixed windows when needed
4. Extracts machine-learning-ready features

Extracted features include:

1. MFCC coefficients
2. Pitch (fundamental frequency summary)
3. Spectral centroid
4. RMS energy
5. Duration

Primary outputs:

- `examples/task1_features_dataset.csv`
- `examples/normalized_audio/`

### Task 2: Narrative Tone Classification

The classifier uses MFCC-based features and a feedforward neural network to predict emotional tone labels. The training pipeline includes:

1. Stratified train/test split
2. Feature standardization using train-set statistics
3. Neural model training with cross-entropy loss
4. Test evaluation

Reported metrics:

1. Accuracy
2. Weighted F1 score
3. Per-class report

### Task 3: AI-Based Transcription

The transcription module uses OpenAI Whisper to:

1. Transcribe multiple recordings in batch
2. Save transcripts to a text output file
3. Measure transcription quality with Word Error Rate (WER) on a subset

Primary output:

- `examples/transcripts.txt`

### Task 4: Narrative Audio Retrieval (TableTalk Simulation)

The retrieval system uses a hybrid strategy:

1. Structured filtering from query constraints (duration, energy, pitch, tone)
2. Semantic ranking over generated recording descriptions

Example queries:

1. `calm narration longer than 4 seconds`
2. `high-energy speech`
3. `dramatic dialogue`

### Bonus: Storytelling Audio Analysis

The bonus module analyzes several recordings for storytelling-related cues:

1. Pacing and pauses
2. Pitch variation
3. Energy dynamics
4. Sentence-length characteristics from transcripts

It also computes a heuristic `storytelling_score` and ranks clips by storytelling-like expressiveness.

Primary output:

- `examples/storytelling_analysis.csv`

---

## Run Instructions

### Run the full pipeline

From the `narrative-audio-system/` directory:

```bash
python run_pipeline.py examples/03-01-04-02-01-01-11.wav
```

### Run tasks individually

Task 1:

```bash
python task1_audio_pipeline/audio_pipeline.py
```

Task 2:

```bash
python task2_tone_classification/train_classifier.py
```

Task 3:

```bash
python task3_transcription/whisper_transcriber.py
```

Task 4:

```bash
python task4_audio_retrieval/retrieval_prototype.py
```

Bonus task:

```bash
python task_bonus_storytelling/storytelling_analysis.py
```

---

## Example Outputs

Generated artifacts include:

1. `examples/task1_features_dataset.csv`
2. `examples/normalized_audio/`
3. `examples/transcripts.txt`
4. `examples/storytelling_analysis.csv`

Console outputs include:

1. Task 2 test metrics (accuracy, weighted F1, class report)
2. Task 3 WER summary
3. Task 4 retrieval results for sample queries
4. Bonus storytelling summary and top-ranked clips

---

## Approach and Discussion

This project is designed as a practical, reproducible end-to-end prototype for narrative audio processing.

- Task 1 converts raw recordings into structured numerical features.
- Task 2 demonstrates tone classification from audio-derived features.
- Task 3 provides scalable transcription with measurable quality.
- Task 4 combines explicit filtering with semantic retrieval for narrative-style queries.
- The bonus task explores prosodic and transcript-level cues for distinguishing storytelling narration from conversational speech.

Current limitations include dataset scale, CPU transcription speed, and the heuristic nature of storytelling scoring. Future improvements include pretrained audio embeddings, stronger ranking objectives, and dedicated storytelling annotations.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
129 changes: 129 additions & 0 deletions narrative-audio-system/examples/emotion_labels.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
{
"03-01-01-01-01-01-01.wav": "neutral",
"03-01-01-01-01-01-04.wav": "neutral",
"03-01-01-01-01-01-11.wav": "neutral",
"03-01-01-01-01-02-01.wav": "neutral",
"03-01-01-01-01-02-04.wav": "neutral",
"03-01-01-01-01-02-11.wav": "neutral",
"03-01-01-01-02-01-01.wav": "neutral",
"03-01-01-01-02-01-04.wav": "neutral",
"03-01-01-01-02-01-11.wav": "neutral",
"03-01-01-01-02-02-01.wav": "neutral",
"03-01-01-01-02-02-04.wav": "neutral",
"03-01-01-01-02-02-11.wav": "neutral",
"03-01-02-01-01-01-01.wav": "calm",
"03-01-02-01-01-01-04.wav": "calm",
"03-01-02-01-01-01-11.wav": "calm",
"03-01-02-01-01-02-01.wav": "calm",
"03-01-02-01-01-02-04.wav": "calm",
"03-01-02-01-01-02-11.wav": "calm",
"03-01-02-01-02-01-01.wav": "calm",
"03-01-02-01-02-01-04.wav": "calm",
"03-01-02-01-02-01-11.wav": "calm",
"03-01-02-01-02-02-04.wav": "calm",
"03-01-02-01-02-02-11.wav": "calm",
"03-01-02-02-01-01-04.wav": "calm",
"03-01-02-02-01-01-11.wav": "calm",
"03-01-02-02-01-02-04.wav": "calm",
"03-01-02-02-01-02-11.wav": "calm",
"03-01-02-02-02-01-04.wav": "calm",
"03-01-02-02-02-01-11.wav": "calm",
"03-01-02-02-02-02-04.wav": "calm",
"03-01-02-02-02-02-11.wav": "calm",
"03-01-03-01-01-01-04.wav": "happy",
"03-01-03-01-01-01-11.wav": "happy",
"03-01-03-01-01-02-04.wav": "happy",
"03-01-03-01-01-02-11.wav": "happy",
"03-01-03-01-02-01-04.wav": "happy",
"03-01-03-01-02-01-11.wav": "happy",
"03-01-03-01-02-02-04.wav": "happy",
"03-01-03-01-02-02-11.wav": "happy",
"03-01-03-02-01-01-04.wav": "happy",
"03-01-03-02-01-01-11.wav": "happy",
"03-01-03-02-01-02-04.wav": "happy",
"03-01-03-02-01-02-11.wav": "happy",
"03-01-03-02-02-01-04.wav": "happy",
"03-01-03-02-02-01-11.wav": "happy",
"03-01-03-02-02-02-04.wav": "happy",
"03-01-03-02-02-02-11.wav": "happy",
"03-01-04-01-01-01-04.wav": "sad",
"03-01-04-01-01-01-11.wav": "sad",
"03-01-04-01-01-02-04.wav": "sad",
"03-01-04-01-01-02-11.wav": "sad",
"03-01-04-01-02-01-04.wav": "sad",
"03-01-04-01-02-01-11.wav": "sad",
"03-01-04-01-02-02-04.wav": "sad",
"03-01-04-01-02-02-11.wav": "sad",
"03-01-04-02-01-01-04.wav": "sad",
"03-01-04-02-01-01-11.wav": "sad",
"03-01-04-02-01-02-04.wav": "sad",
"03-01-04-02-01-02-11.wav": "sad",
"03-01-04-02-02-01-04.wav": "sad",
"03-01-04-02-02-01-11.wav": "sad",
"03-01-04-02-02-02-04.wav": "sad",
"03-01-04-02-02-02-11.wav": "sad",
"03-01-05-01-01-01-04.wav": "angry",
"03-01-05-01-01-01-11.wav": "angry",
"03-01-05-01-01-02-04.wav": "angry",
"03-01-05-01-01-02-11.wav": "angry",
"03-01-05-01-02-01-04.wav": "angry",
"03-01-05-01-02-01-11.wav": "angry",
"03-01-05-01-02-02-04.wav": "angry",
"03-01-05-01-02-02-11.wav": "angry",
"03-01-05-02-01-01-04.wav": "angry",
"03-01-05-02-01-01-11.wav": "angry",
"03-01-05-02-01-02-04.wav": "angry",
"03-01-05-02-01-02-11.wav": "angry",
"03-01-05-02-02-01-04.wav": "angry",
"03-01-05-02-02-01-11.wav": "angry",
"03-01-05-02-02-02-04.wav": "angry",
"03-01-05-02-02-02-11.wav": "angry",
"03-01-06-01-01-01-04.wav": "fearful",
"03-01-06-01-01-01-11.wav": "fearful",
"03-01-06-01-01-02-04.wav": "fearful",
"03-01-06-01-01-02-11.wav": "fearful",
"03-01-06-01-02-01-04.wav": "fearful",
"03-01-06-01-02-01-11.wav": "fearful",
"03-01-06-01-02-02-04.wav": "fearful",
"03-01-06-01-02-02-11.wav": "fearful",
"03-01-06-02-01-01-04.wav": "fearful",
"03-01-06-02-01-01-11.wav": "fearful",
"03-01-06-02-01-02-04.wav": "fearful",
"03-01-06-02-01-02-11.wav": "fearful",
"03-01-06-02-02-01-04.wav": "fearful",
"03-01-06-02-02-01-11.wav": "fearful",
"03-01-06-02-02-02-04.wav": "fearful",
"03-01-06-02-02-02-11.wav": "fearful",
"03-01-07-01-01-01-04.wav": "disgust",
"03-01-07-01-01-01-11.wav": "disgust",
"03-01-07-01-01-02-04.wav": "disgust",
"03-01-07-01-01-02-11.wav": "disgust",
"03-01-07-01-02-01-04.wav": "disgust",
"03-01-07-01-02-01-11.wav": "disgust",
"03-01-07-01-02-02-04.wav": "disgust",
"03-01-07-01-02-02-11.wav": "disgust",
"03-01-07-02-01-01-04.wav": "disgust",
"03-01-07-02-01-01-11.wav": "disgust",
"03-01-07-02-01-02-04.wav": "disgust",
"03-01-07-02-01-02-11.wav": "disgust",
"03-01-07-02-02-01-04.wav": "disgust",
"03-01-07-02-02-01-11.wav": "disgust",
"03-01-07-02-02-02-04.wav": "disgust",
"03-01-07-02-02-02-11.wav": "disgust",
"03-01-08-01-01-01-04.wav": "surprised",
"03-01-08-01-01-01-11.wav": "surprised",
"03-01-08-01-01-02-04.wav": "surprised",
"03-01-08-01-01-02-11.wav": "surprised",
"03-01-08-01-02-01-04.wav": "surprised",
"03-01-08-01-02-01-11.wav": "surprised",
"03-01-08-01-02-02-04.wav": "surprised",
"03-01-08-01-02-02-11.wav": "surprised",
"03-01-08-02-01-01-04.wav": "surprised",
"03-01-08-02-01-01-11.wav": "surprised",
"03-01-08-02-01-02-04.wav": "surprised",
"03-01-08-02-01-02-11.wav": "surprised",
"03-01-08-02-02-01-04.wav": "surprised",
"03-01-08-02-02-01-11.wav": "surprised",
"03-01-08-02-02-02-04.wav": "surprised",
"03-01-08-02-02-02-11.wav": "surprised"
}
Binary file added narrative-audio-system/examples/input2.wav
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added narrative-audio-system/examples/sample_audio.wav
Binary file not shown.
9 changes: 9 additions & 0 deletions narrative-audio-system/examples/storytelling_analysis.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
filename,transcript,duration_seconds,tempo_bpm,pause_ratio,pause_events,pitch_mean_hz,pitch_std_hz,energy_mean,energy_std,energy_dynamic_range,word_count,sentence_count,avg_sentence_words,max_sentence_words,storytelling_score
03-01-01-01-01-01-01.wav,Kids are talking by the door.,3.3033125,156.25,0.9516908212560387,3,267.0305984870316,142.3334142857007,0.0021737448405474424,0.003347895573824644,0.006854305077217761,6,1,6.0,6,49.65
03-01-01-01-01-01-04.wav,Kids are talking by the door.,3.3033125,144.23076923076923,0.9420289855072463,3,321.2472546577193,98.68349404954499,0.0023810872808098793,0.003555573523044586,0.008169263228774072,6,1,6.0,6,29.88
03-01-01-01-01-01-11.wav,Kids are talking by the door.,3.1365,170.45454545454547,0.9898477157360406,1,182.08479750808337,135.69343514278123,0.0017422254895791411,0.0024538985453546047,0.005624081086716618,6,1,6.0,6,48.22
03-01-01-01-01-02-01.wav,Kids are talking by the door.,3.3366875,144.23076923076923,0.9473684210526315,3,257.914998939382,146.12262952905198,0.0023219622671604156,0.003586930688470602,0.007503410851813899,6,1,6.0,6,56.52
03-01-01-01-01-02-04.wav,Kids are talking by the door.,3.3700625,144.23076923076923,0.9383886255924171,3,308.02964854572264,106.9491687631,0.0026571406051516533,0.0038461738731712103,0.008434683084487915,6,1,6.0,6,36.36
03-01-01-01-01-02-11.wav,Kids are talking by the door.,3.103125,170.45454545454547,0.9639175257731959,2,147.45442111024389,118.9600294179293,0.0024527707137167454,0.003387857461348176,0.008178922370461809,6,1,6.0,6,51.37
03-01-01-01-02-01-01.wav,Dogs are sitting by the door.,3.2699375,125.0,0.9365853658536586,2,272.5633234391762,143.07343874895489,0.002737249480560422,0.004259149543941021,0.008850548467989938,6,1,6.0,6,63.02
03-01-01-01-02-01-04.wav,Dogs are sitting by the door.,3.2699375,110.29411764705883,0.9463414634146341,1,326.6884042851927,92.25911778869441,0.0024951701052486897,0.003717334009706974,0.007492591347545385,6,1,6.0,6,21.04
Loading