feat: Integrate AssemblyAI for transcription services

johnnyhuy · johnnyhuy · commit 690c376cf2da · 2024-10-21T08:43:50.000+11:00
- Add `ASSEMBLYAI_API_KEY` to the config options
- Update usage instructions for the `transcribe-me` tool
- Changed `.env.example` to `.env.dev` in the `init` target in Makefile
- Add a flag to use AssemblyAI for transcription in `.transcribe.yaml`
- Include features related to AssemblyAI outputs and transcription in README.md
diff --git a/.env.dev b/.env.dev
@@ -0,0 +1,4 @@
+TWINE_PASSWORD=pypi_api_token
+GITHUB_TOKEN=github_api_token
+GITHUB_REPOSITORY=echohello-dev/transcribe-me
+GITHUB_ACTOR=echohello-dev
diff --git a/.env.example b/.env.example
@@ -1,4 +1,3 @@
-TWINE_PASSWORD=pypi_api_token
-GITHUB_TOKEN=github_api_token
-GITHUB_REPOSITORY=echohello-dev/transcribe-me
-GITHUB_ACTOR=echohello-dev
+OPENAI_API_KEY=your_openai_api_key_here
+ANTHROPIC_API_KEY=your_anthropic_api_key_here
+ASSEMBLYAI_API_KEY=your_assemblyai_api_key_here
diff --git a/.transcribe.yaml b/.transcribe.yaml
@@ -1,3 +1,5 @@
+use_assemblyai: true
+
 openai:
   models:
   - temperature: 0.1
diff --git a/Makefile b/Makefile
@@ -8,7 +8,7 @@ VERSION ?= $(shell git describe --tags --always)
 export
 
 init:
-	cp .env.example .env
+	cp .env.dev .env
 
 check-ffmpeg:
 ifeq (, $(shell which ffmpeg))
@@ -79,13 +79,13 @@ else
 	docker compose build --push
 endif
 
-transcribe: install
+transcribe:
 	$(VENV) python -m transcribe_me.main
 
-transcribe-archive: install
+transcribe-archive:
 	$(VENV) python -m transcribe_me.main archive
 
-transcribe-install: install
+transcribe-install:
 	$(VENV) python -m transcribe_me.main install
 
 release-version:
diff --git a/README.md b/README.md
@@ -4,28 +4,33 @@
 
 [![Build](https://github.com/echohello-dev/transcribe-me/actions/workflows/build.yaml/badge.svg)](https://github.com/echohello-dev/transcribe-me/actions/workflows/build.yaml)
 
-Transcribe Me is a CLI-driven Python application that transcribes audio files using the OpenAI Whisper API and generates summaries of the transcriptions using both OpenAI's GPT-4 and Anthropic's Claude models.
+Transcribe Me is a CLI-driven Python application that transcribes audio files using either the OpenAI Whisper API or AssemblyAI, and generates summaries of the transcriptions using OpenAI's GPT-4 and Anthropic's Claude models.
 
 ```mermaid
 graph TD
     A[Load Config] --> B[Get Audio Files]
     B --> C{Audio File Exists?}
-    C --Yes--> D[Transcribe Audio File]
-    D --> E[Generate Summaries]
-    E --> F[Save Transcription]
-    F --> G[Save Summaries]
-    G --> H[Clean Up Temporary Files]
-    H --> B
-    C --No--> I[Print Warning]
-    I --> B
+    C --Yes--> D{Use AssemblyAI?}
+    D --Yes--> E[Transcribe with AssemblyAI]
+    D --No--> F[Transcribe with OpenAI]
+    E --> G[Generate Additional Outputs]
+    F --> H[Generate Summaries]
+    G --> I[Save Transcription and Outputs]
+    H --> J[Save Transcription and Summaries]
+    I --> K[Clean Up Temporary Files]
+    J --> K
+    K --> B
+    C --No--> L[Print Warning]
+    L --> B
 ```
 
 ## :key: Key Features
 
-- **Audio Transcription**: Transcribes audio files using the OpenAI Whisper API. It supports both MP3 and M4A formats and can handle large files by splitting them into smaller chunks for transcription.
-- **Summary Generation**: Generates summaries of the transcriptions using both OpenAI's GPT-4 and Anthropic's Claude models. The summaries are saved in Markdown format and include key points in bold and a "Next Steps" section.
+- **Audio Transcription**: Transcribes audio files using either the OpenAI Whisper API or AssemblyAI. It supports both MP3 and M4A formats.
+- **Summary Generation**: Generates summaries of the transcriptions using both OpenAI's GPT-4 and Anthropic's Claude models when using OpenAI for transcription.
+- **AssemblyAI Features**: When using AssemblyAI, provides additional outputs including Speaker Diarization, Summary, Sentiment Analysis, Key Phrases, and Topic Detection.
 - **Configurable Models**: Supports multiple models for OpenAI and Anthropic, with configurable temperature, max_tokens, and system prompts.
-- **Supports Audio Files**: Supports audio files `.m4a` and `.mp3` formats.
+- **Supports Audio Files**: Supports audio files in `.m4a` and `.mp3` formats.
 - **Supports Docker**: Can be run in a Docker container for easy deployment and reproducibility.
 
 ## :package: Installation
@@ -65,11 +70,12 @@ This has been tested with macOS, your mileage may vary on other operating system
     transcribe-me install
     ```
 
-    This command will also prompt you to enter your API keys for OpenAI and Anthropic if they are not already provided in environment variables. You can also set the API keys in environment variables:
+    This command will prompt you to enter your API keys for OpenAI, Anthropic, and AssemblyAI if they are not already provided in environment variables. You can also set the API keys in environment variables:
 
     ```bash
     export OPENAI_API_KEY=your_api_key
     export ANTHROPIC_API_KEY=your_api_key
+    export ASSEMBLYAI_API_KEY=your_api_key
     ```
 
 2. Place your audio files in the `input` directory (or any other directory specified in the configuration).
@@ -117,6 +123,7 @@ You can also run the application using Docker:
         --rm \
         -e OPENAI_API_KEY \
         -e ANTHROPIC_API_KEY \
+        -e ASSEMBLYAI_API_KEY \
         -v $(pwd)/archive:/app/archive \
         -v $(pwd)/input:/app/input \
         -v $(pwd)/output:/app/output \
@@ -136,6 +143,7 @@ You can also run the application using Docker:
         environment:
           - OPENAI_API_KEY
           - ANTHROPIC_API_KEY
+          - ASSEMBLYAI_API_KEY
         volumes:
           - ./input:/app/input
           - ./output:/app/output
@@ -151,7 +159,7 @@ You can also run the application using Docker:
 
     This command mounts the `input`, `output`, `archive`, and `.transcribe.yaml` configuration file into the Docker container. See [`compose.example.yaml`](./compose.example.yaml) for an example configuration.
 
-    Make sure to replace `OPENAI_API_KEY` and `ANTHROPIC_API_KEY` with your actual API keys. Also make sure to create the `.transcribe.yaml` configuration file in the same directory as the `docker-compose.yml` file.
+    Make sure to replace `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, and `ASSEMBLYAI_API_KEY` with your actual API keys. Also make sure to create the `.transcribe.yaml` configuration file in the same directory as the `docker-compose.yml` file.
 
 ## :rocket: How it Works
 
@@ -160,21 +168,23 @@ The Transcribe Me application follows a straightforward workflow:
 1. **Load Configuration**: The application loads the configuration from the `.transcribe.yaml` file, which includes settings for input/output directories, models, and their configurations.
 2. **Get Audio Files**: The application gets a list of audio files from the input directory specified in the configuration.
 3. **Check Existing Transcriptions**: For each audio file, the application checks if there is an existing transcription file. If a transcription file exists, it skips to the next audio file.
-4. **Transcribe Audio File**: If no transcription file exists, the application transcribes the audio file using the OpenAI Whisper API. It splits the audio file into smaller chunks for efficient transcription.
-5. **Generate Summaries**: After transcription, the application generates summaries of the transcription using the configured models (OpenAI GPT-4 and Anthropic Claude).
-6. **Save Transcription and Summaries**: The application saves the transcription to a text file and the summaries from each configured model to separate Markdown files in the output directory.
+4. **Transcribe Audio File**: If no transcription file exists, the application transcribes the audio file using either the OpenAI Whisper API or AssemblyAI, based on the configuration.
+5. **Generate Outputs**:
+   - For OpenAI: The application generates summaries of the transcription using the configured models (OpenAI GPT-4 and Anthropic Claude).
+   - For AssemblyAI: The application generates additional outputs including Speaker Diarization, Summary, Sentiment Analysis, Key Phrases, and Topic Detection.
+6. **Save Transcription and Outputs**: The application saves the transcription and all generated outputs to separate files in the output directory.
 7. **Clean Up Temporary Files**: The application removes any temporary files generated during the transcription process.
 8. **Repeat**: The process repeats for each audio file in the input directory.
 
 ## :gear: Configuration
 
 The application uses a configuration file (`.transcribe.yaml`) to specify settings such as input/output directories, API keys, models, and their configurations. The configuration file is created automatically when you run the `transcribe-me install` command.
 
-> `max_tokens` is the maximum number of tokens to generate in the summary. The default is dynamic based on the model.
-
 Here is an example configuration file:
 
 ```yaml
+use_assemblyai: false  # Set to true to use AssemblyAI instead of OpenAI for transcription
+
 openai:
   models:
     - temperature: 0.1
@@ -226,7 +236,7 @@ output_folder: output
    make install
    ```
 
-3. Run the `transcribe-me install` command to create the `.transcribe.yaml` configuration file and provide your API keys for OpenAI and Anthropic:
+3. Run the `transcribe-me install` command to create the `.transcribe.yaml` configuration file and provide your API keys for OpenAI, Anthropic, and AssemblyAI:
 
    ```bash
    make transcribe-install
@@ -277,4 +287,4 @@ To release a new version:
 
 ## Star History
 
-[![Star History Chart](https://api.star-history.com/svg?repos=echohello-dev/transcribe-me&type=Date)](https://star-history.com/#echohello-dev/transcribe-me&Date)
+[![Star History Chart](https://api.star-history.com/svg?repos=echohello-dev/transcribe-me&type=Date)](https://star-history.com/#echohello-dev/transcribe-me&Date)
diff --git a/requirements.txt b/requirements.txt
@@ -2,7 +2,9 @@ annotated-types==0.6.0
 anthropic==0.21.3
 anyio==4.4.0
 argcomplete==3.2.3
+assemblyai==0.34.0
 astroid==3.1.0
+autopep8==2.3.1
 black==24.4.0
 build==1.2.1
 certifi==2024.7.4
@@ -72,5 +74,6 @@ twine==5.0.0
 typing_extensions==4.10.0
 urllib3==2.2.1
 wcwidth==0.2.13
+websockets==13.1
 yamale==5.1.0
 zipp==3.19.1
diff --git a/transcribe_me/audio/transcription.py b/transcribe_me/audio/transcription.py
@@ -2,6 +2,7 @@
 from glob import glob
 from typing import Dict, Any
 import openai
+import assemblyai as aai
 from tqdm import tqdm
 from colorama import Fore
 from tenacity import retry, wait_exponential, stop_after_attempt
@@ -29,20 +30,33 @@ def transcribe_chunk(file_path: str) -> str:
             raise e
 
 
-def transcribe_audio(file_path: str, output_path: str) -> None:
+def transcribe_audio(file_path: str, output_path: str, config: Dict[str, Any]) -> None:
     """
-    Transcribe an audio file using the OpenAI Whisper API.
+    Transcribe an audio file using either OpenAI Whisper API or AssemblyAI.
 
     Args:
         file_path (str): Path to the audio file to transcribe.
         output_path (str): Path to the output file for the transcription.
+        config (Dict[str, Any]): Configuration dictionary.
+    """
+    use_assemblyai = config.get("use_assemblyai", False)
+
+    if use_assemblyai:
+        transcribe_with_assemblyai(file_path, output_path, config)
+    else:
+        transcribe_with_openai(file_path, output_path)
+
+
+def transcribe_with_openai(file_path: str, output_path: str) -> None:
+    """
+    Transcribe an audio file using the OpenAI Whisper API.
     """
     chunk_files = split_audio(file_path)
     full_transcription = ""
 
     progress_bar = tqdm(
         chunk_files,
-        desc="Transcribing",
+        desc="Transcribing with OpenAI",
         unit="chunk",
         bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt}",
     )
@@ -61,6 +75,78 @@ def transcribe_audio(file_path: str, output_path: str) -> None:
         file.write(full_transcription)
 
 
+def transcribe_with_assemblyai(
+    file_path: str, output_path: str, config: Dict[str, Any]
+) -> None:
+    """
+    Transcribe an audio file using AssemblyAI.
+    """
+    transcription_config = aai.TranscriptionConfig(
+        speaker_labels=True,
+        summarization=True,
+        sentiment_analysis=True,
+        auto_highlights=True,
+        iab_categories=True,
+    )
+    transcriber = aai.Transcriber()
+
+    transcript = transcriber.transcribe(file_path, config=transcription_config)
+
+    # Write transcription to file
+    with open(output_path, "w", encoding="utf-8") as file:
+        file.write(transcript.text)
+
+    # Write additional information to separate files
+    base_name = os.path.splitext(output_path)[0]
+
+    # Speaker Diarization
+    with open(f"{base_name}_speakers.txt", "w", encoding="utf-8") as file:
+        for utterance in transcript.utterances:
+            file.write(f"Speaker {utterance.speaker}: {utterance.text}\n")
+
+    # Auto Highlights
+    with open(f"{base_name}_auto_highlights.txt", "w", encoding="utf-8") as file:
+        for highlight in transcript.auto_highlights_result.results:
+            file.write(f"{highlight.text}\n")
+
+    # Summary
+    with open(f"{base_name}_summary.txt", "w", encoding="utf-8") as file:
+        file.write(transcript.summary)
+
+    # Sentiment Analysis
+    if transcript.sentiment_analysis:
+        with open(f"{base_name}_sentiment.txt", "w", encoding="utf-8") as file:
+            for result in transcript.sentiment_analysis:
+                file.write(f"Text: {result.text}\n")
+                file.write(f"Sentiment: {result.sentiment}\n")
+                file.write(f"Confidence: {result.confidence}\n")
+                file.write(f"Timestamp: {result.start} - {result.end}\n\n")
+
+    # Key Phrases
+    with open(f"{base_name}_key_phrases.txt", "w", encoding="utf-8") as file:
+        for phrase in transcript.auto_highlights_result.results:
+            file.write(f"{phrase.text}\n")
+
+    # Topic Detection
+    if transcript.iab_categories:
+        with open(f"{base_name}_topics.txt", "w", encoding="utf-8") as file:
+            # Detailed results
+            file.write("Detailed Topic Results:\n")
+            for result in transcript.iab_categories.results:
+                file.write(f"Text: {result.text}\n")
+                file.write(
+                    f"Timestamp: {result.timestamp.start} - {result.timestamp.end}\n"
+                )
+                for label in result.labels:
+                    file.write(f"  {label.label} (Relevance: {label.relevance})\n")
+                file.write("\n")
+
+            # Summary of all topics
+            file.write("\nTopic Summary:\n")
+            for topic, relevance in transcript.iab_categories.summary.items():
+                file.write(f"Audio is {relevance * 100:.2f}% relevant to {topic}\n")
+
+
 def process_audio_files(
     input_folder: str, output_folder: str, config: Dict[str, Any]
 ) -> None:
@@ -84,11 +170,12 @@ def process_audio_files(
         try:
             if not os.path.exists(output_file):
                 print(f"{Fore.BLUE}Transcribing audio file: {file_path}\n")
-                transcribe_audio(file_path, output_file)
+                transcribe_audio(file_path, output_file, config)
         except Exception as e:
             print(f"{Fore.RED}An error occurred while processing {file_path}: {e}")
             raise e
         finally:
-            # Delete the _part* MP3 files
-            for file in glob(f"{file_path.partition('.')[0]}_part*.mp3"):
-                os.remove(file)
+            # Delete the _part* MP3 files if using OpenAI
+            if not config.get("use_assemblyai", False):
+                for file in glob(f"{file_path.partition('.')[0]}_part*.mp3"):
+                    os.remove(file)
diff --git a/transcribe_me/config/config_manager.py b/transcribe_me/config/config_manager.py
diff --git a/transcribe_me/config/schema.yaml b/transcribe_me/config/schema.yaml

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,5 @@`
	`1`	`+use_assemblyai: true`
	`2`	`+`
`1`	`3`	`openai:`
`2`	`4`	`models:`
`3`	`5`	`- temperature: 0.1`