✨ New recording modes!

savbell · savbell · commit 6b04f3f63f01 · 2024-01-28T14:21:32.000-08:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,23 +7,25 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 ## [Unreleased]
 ### Added
 - New message to identify whether Whisper was being called using the API or running locally.
-- New configuration options to choose which sound device and sample rate to use.
-- Push to talk
-- New configuration option to enable/disable push to talk
-- New configuration option to hide status window
-- New configuration options to readme along with their descriptions
+- Additional hold-to-talk ([PR #28](https://github.com/savbell/whisper-writer/pull/28)) and press-to-toggle recording methods ([Issue #21](https://github.com/savbell/whisper-writer/issues/21)).
+- New configuration options to:
+  - Choose recording method (defaulting to voice activity detection).
+  - Choose which sound device and sample rate to use.
+  - Hide the status window ([PR #28](https://github.com/savbell/whisper-writer/pull/28)).
 
 ### Changed
-- Migrated from `whisper` to `faster-whisper` (Issue #11).
-- Migrated from `pyautogui` to `pynput` (PR #10).
-- Migrated from `webrtcvad` to `webrtcvad-wheels` (PR #17).
+- Migrated from `whisper` to `faster-whisper` ([Issue #11](https://github.com/savbell/whisper-writer/issues/11)).
+- Migrated from `pyautogui` to `pynput` ([PR #10](https://github.com/savbell/whisper-writer/pull/10)).
+- Migrated from `webrtcvad` to `webrtcvad-wheels` ([PR #17](https://github.com/savbell/whisper-writer/pull/17)).
 - Changed default activation key combo from `ctrl+alt+space` to `ctrl+shift+space`.
 - Changed to using a local model rather than the API by default.
 - Revamped README.md, including new Roadmap, Contributing, and Credits sections.
 
 ### Fixed
 - Local model is now only loaded once at start-up, rather than every time the activation key combo was pressed.
 - Default configuration now auto-chooses compute type for the local model to avoid warnings.
+- Graceful degradation to CPU if CUDA isn't available ([PR #30](https://github.com/savbell/whisper-writer/pull/30)).
+- Removed long prefix of spaces in transcription ([PR #19](https://github.com/savbell/whisper-writer/pull/19)).
 
 ## [1.0.0] - 2023-05-29
 ### Added
diff --git a/README.md b/README.md
@@ -8,11 +8,16 @@
 
 WhisperWriter is a small speech-to-text app that uses [OpenAI's Whisper model](https://openai.com/research/whisper) to auto-transcribe recordings from a user's microphone.
 
-Once started, the script runs in the background and waits for a keyboard shortcut to be pressed (`ctrl+shift+space` by default, but this can be changed in the [Configuration Options](#configuration-options)). When the shortcut is pressed, the app starts recording from your microphone. It will continue recording until you stop speaking or there is a long enough pause in your speech. While it is recording, a small status window is displayed that shows the current stage of the transcription process. Once the transcription is complete, the transcribed text will be automatically written to the active window.
+Once started, the script runs in the background and waits for a keyboard shortcut to be pressed (`ctrl+shift+space` by default). When the shortcut is pressed, the app starts recording from your microphone. There are three options to stop recording:
+- `voice_activity_detection` that stops recording once it detects a long enough pause in your speech.
+- `press_to_toggle` that stops recording when the activation key is pressed again.
+- `hold_to_record` that stops recording when the activation key is released.
+
+You can change the activation key and recording mode in the [Configuration Options](#configuration-options). While recording and transcribing, a small status window is displayed that shows the current stage of the process (but this can be turned off). Once the transcription is complete, the transcribed text will be automatically written to the active window.
 
 The transcription can either be done locally through the [faster-whisper Python package](https://github.com/SYSTRAN/faster-whisper/) or through a request to [OpenAI's API](https://platform.openai.com/docs/guides/speech-to-text). By default, the app will use a local model, but you can change this in the [Configuration Options](#configuration-options). If you choose to use the API, you will need to provide your OpenAI API key in a `.env` file.
 
-**Fun fact:** Almost the entirety of this project was pair-programmed with [ChatGPT-4](https://openai.com/product/gpt-4) and [GitHub Copilot](https://github.com/features/copilot) using VS Code. Practically every line, including most of this README, was written by AI. After the initial prototype was finished, WhisperWriter was used to write a lot of the prompts as well!
+**Fun fact:** Almost the entirety of the initial release of the project was pair-programmed with [ChatGPT-4](https://openai.com/product/gpt-4) and [GitHub Copilot](https://github.com/features/copilot) using VS Code. Practically every line, including most of this README, was written by AI. After the initial prototype was finished, WhisperWriter was used to write a lot of the prompts as well!
 
 ## Getting Started
 
@@ -22,6 +27,11 @@ Before you can run this app, you'll need to have the following software installe
 - Git: [https://git-scm.com/downloads](https://git-scm.com/downloads)
 - Python `3.11`: [https://www.python.org/downloads/](https://www.python.org/downloads/)
 
+If you want to run `faster-whisper` on your GPU, you'll also need to install the following NVIDIA libraries:
+
+- [cuBLAS for CUDA 11](https://developer.nvidia.com/cublas)
+- [cuDNN 8 for CUDA 11](https://developer.nvidia.com/cudnn)
+
 ### Installation
 To set up and run the project, follow these steps:
 
@@ -54,7 +64,7 @@ pip install -r requirements.txt
 To switch between running Whisper locally and using the OpenAI API, you need to modify the `src\config.json` file:
 
 - If you prefer using the OpenAI API, set `"use_api"` to `true`. You will also need to set up your OpenAI API key in the next step.
-- If you prefer using a local Whisper model, set `"use_api"` to `false`. You may also want to change the device that the model uses; see the [Model Options](#model-options).
+- If you prefer using a local Whisper model, set `"use_api"` to `false`. You may also want to change the device that the model uses; see the [Model Options](#model-options). Note that you need to have the [NVIDIA libraries installed](https://github.com/SYSTRAN/faster-whisper/#gpu) to run the model on your GPU.
 
 ```
 {
@@ -109,6 +119,7 @@ WhisperWriter uses a configuration file to customize its behaviour. To set up th
         "vad_filter": false
     },
     "activation_key": "ctrl+shift+space",
+    "recording_mode": "voice_activity",
     "sound_device": null,
     "sample_rate": 16000,
     "silence_duration": 900,
@@ -137,6 +148,7 @@ WhisperWriter uses a configuration file to customize its behaviour. To set up th
   - `vad_filter`: Set to `true` to use [a voice activity detection (VAD) filter](https://github.com/snakers4/silero-vad) to remove silence from the recording. (Default: `false`)
 #### Customization Options
 - `activation_key`: The keyboard shortcut to activate the recording and transcribing process. (Default: `"ctrl+shift+space"`)
+- `recording_mode`: The recording mode to use. Options include `voice_activity_detection` to use voice activity detection to determine when to stop recording, or `press_to_toggle` to start and stop recording by pressing the activation key, or `hold_to_record` to start recording when the activation key is pressed down and stop recording when the activation key is released. (Default: `"voice_activity"`)
 - `sound_device`: The name of the sound device to use for recording. Set to `null` to let the system automatically choose the default device. To find a device number, run `python -m sounddevice`. (Default: `null`)
 - `sample_rate`: The sample rate in Hz to use for recording. (Default: `16000`)
 - `silence_duration`: The duration in milliseconds to wait for silence before stopping the recording. (Default: `900`)
@@ -145,7 +157,6 @@ WhisperWriter uses a configuration file to customize its behaviour. To set up th
 - `add_trailing_space`: Set to `true` to add a trailing space to the transcribed text. (Default: `true`)
 - `remove_capitalization`: Set to `true` to convert the transcribed text to lowercase. (Default: `false`)
 - `print_to_terminal`: Set to `true` to print the script status and transcribed text to the terminal. (Default: `true`)
-- `push_to_talk`: Set to `true` to enable push to talk. Recording starts when activation-key is pressed down. When activation-key is released, recording stops and transcription starts.
 - `hide_window`: Set to `true` to hide the status window.
 
 If any of the configuration options are invalid or not provided, the program will use the default values.
@@ -164,9 +175,6 @@ Below are features I am planning to add in the near future:
 - [ ] Updating GUI
 - [ ] Creating standalone executable file
 
-Below are features I plan on investigating and may end up adding in the future:
-- [ ] Push-to-talk option
-
 Below are features not currently planned:
 - [ ] Pipelining audio files
 
@@ -176,8 +184,9 @@ Contributions are welcome! I created this project for my own personal use and di
 
 ## Credits
 
-- [OpenAI](https://openai.com/) for creating the Whisper model and providing the API.
+- [OpenAI](https://openai.com/) for creating the Whisper model and providing the API. Plus [ChatGPT](https://chat.openai.com/), which was used to write a lot of the initial code for this project.
 - [Guillaume Klein](https://github.com/guillaumekln) for creating the [faster-whisper Python package](https://github.com/SYSTRAN/faster-whisper).
+- All of our [contributors](https://github.com/savbell/whisper-writer/graphs/contributors)!
 
 ## License
 
diff --git a/src/config.json b/src/config.json
@@ -17,6 +17,7 @@
         "vad_filter": false
     },
     "activation_key": "ctrl+shift+space",
+    "recording_mode": "voice_activity_detection",
     "sound_device": null,
     "sample_rate": 16000,
     "silence_duration": 900,
@@ -25,6 +26,5 @@
     "add_trailing_space": true,
     "remove_capitalization": false,
     "print_to_terminal": true,
-    "push_to_talk": false,
     "hide_status_window": false
 }
diff --git a/src/main.py b/src/main.py
@@ -40,6 +40,7 @@ def load_config_with_defaults():
             'vad_filter': False,
         },
         'activation_key': 'ctrl+shift+space',
+        'recording_mode': 'voice_activity_detection', # 'voice_activity_detection', 'press_to_toggle', or 'hold_to_record'
         'sound_device': None,
         'sample_rate': 16000,
         'silence_duration': 900,
@@ -48,7 +49,6 @@ def load_config_with_defaults():
         'add_trailing_space': False,
         'remove_capitalization': False,
         'print_to_terminal': True,
-        'push_to_talk': False,
         'hide_status_window': False
     }
 
@@ -109,20 +109,31 @@ def typewrite(text, interval):
 # Main script
 
 config = load_config_with_defaults()
-method = 'OpenAI\'s API' if config['use_api'] else 'a local model'
-status_queue = queue.Queue()
 
-keyboard.add_hotkey(config['activation_key'], on_shortcut)
-pyinput_keyboard = Controller()
+model_method = 'OpenAI\'s API' if config['use_api'] else 'a local model'
+print(f'Script activated. Whisper is set to run using {model_method}. To change this, modify the "use_api" value in the src\\config.json file.')
 
-print(f'Script activated. Whisper is set to run using {method}. To change this, modify the "use_api" value in the src\\config.json file.')
+# Set up local model if needed
 local_model = None
 if not config['use_api']:
     print('Creating local model...')
     local_model = create_local_model(config)
     print('Local model created.')
 
-print(f'Press {format_keystrokes(config["activation_key"])} to start recording and transcribing. Press Ctrl+C on the terminal window to quit.')
+print(f'WhisperWriter is set to record using {config["recording_mode"]}. To change this, modify the "recording_mode" value in the src\\config.json file.')
+print(f'The activation key combo is set to {format_keystrokes(config["activation_key"])}.', end='')
+if config['recording_mode'] == 'voice_activity_detection':
+    print(' When it is pressed, recording will start, and will stop when you stop speaking.')
+elif config['recording_mode'] == 'press_to_toggle':
+    print(' When it is pressed, recording will start, and will stop when you press the key combo again.')
+elif config['recording_mode'] == 'hold_to_record':
+    print(' When it is pressed, recording will start, and will stop when you release the key combo.')
+print('Press Ctrl+C on the terminal window to quit.')
+
+# Set up status window and keyboard listener
+status_queue = queue.Queue()
+pyinput_keyboard = Controller()
+keyboard.add_hotkey(config['activation_key'], on_shortcut)
 try:
     keyboard.wait()  # Keep the script running to listen for the shortcut
 except KeyboardInterrupt:
diff --git a/src/transcription.py b/src/transcription.py
@@ -22,13 +22,13 @@ def create_local_model(config):
                                  device=config['local_model_options']['device'],
                                  compute_type=config['local_model_options']['compute_type'])
         except Exception as e:
-            print(f"Error initializing WhisperModel with CUDA: {e}")
-            print("Falling back to CPU.")
+            print(f'Error initializing WhisperModel with CUDA: {e}') if config['print_to_terminal'] else ''
+            print('Falling back to CPU.') if config['print_to_terminal'] else ''
             model = WhisperModel(config['local_model_options']['model'], 
                                  device='cpu',
                                  compute_type=config['local_model_options']['compute_type'])
     else:
-        print("CUDA not available, using CPU.")
+        print('CUDA not available, using CPU.') if config['print_to_terminal'] else ''
         model = WhisperModel(config['local_model_options']['model'], 
                              device='cpu',
                              compute_type=config['local_model_options']['compute_type'])
@@ -78,7 +78,7 @@ def record(status_queue, cancel_flag, config):
     buffer_duration = 300  # 300ms
     silence_duration = config['silence_duration'] if config else 900  # 900ms
 
-    push_to_talk = config['push_to_talk']
+    recording_mode = config['recording_mode']
     activation_key = config['activation_key']
 
     vad = webrtcvad.Vad(3)  # Aggressiveness mode: 3 (highest)
@@ -97,22 +97,28 @@ def record(status_queue, cancel_flag, config):
 
                 frame = buffer[:sample_rate * frame_duration // 1000]
                 buffer = buffer[sample_rate * frame_duration // 1000:]
-
-                if push_to_talk:
-                    recording.extend(frame)
-                    if not keyboard.is_pressed(activation_key):
-                        break
-                else:
-                    is_speech = vad.is_speech(np.array(frame).tobytes(), sample_rate)
-                    if is_speech:
-                        recording.extend(frame)
-                        num_silent_frames = 0
-                    else:
-                        if len(recording) > 0:
-                            num_silent_frames += 1
-
-                        if num_silent_frames >= num_silence_frames:
+                
+                if not cancel_flag():
+                    if recording_mode == 'press_to_toggle':
+                        if len(recording) > 0 and keyboard.is_pressed(activation_key):
+                            break
+                        else:
+                            recording.extend(frame)
+                    if recording_mode == 'hold_to_record':
+                        if keyboard.is_pressed(activation_key):
+                            recording.extend(frame)
+                        else:
                             break
+                    elif recording_mode == 'voice_activity_detection':
+                        is_speech = vad.is_speech(np.array(frame).tobytes(), sample_rate)
+                        if is_speech:
+                            recording.extend(frame)
+                            num_silent_frames = 0
+                        else:
+                            if len(recording) > 0:
+                                num_silent_frames += 1
+                            if num_silent_frames >= num_silence_frames:
+                                break
 
         if cancel_flag():
             status_queue.put(('cancel', ''))
@@ -184,4 +190,4 @@ def record_and_transcribe(status_queue, cancel_flag, config, local_model=None):
     if cancel_flag():
         return ''
     result = transcribe(status_queue, cancel_flag, config, audio_file, local_model)
-    return result
+    return result