Skip to content

Commit 6b04f3f

Browse files
committed
✨ New recording modes!
1 parent 550e595 commit 6b04f3f

File tree

5 files changed

+72
-44
lines changed

5 files changed

+72
-44
lines changed

CHANGELOG.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,23 +7,25 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
77
## [Unreleased]
88
### Added
99
- New message to identify whether Whisper was being called using the API or running locally.
10-
- New configuration options to choose which sound device and sample rate to use.
11-
- Push to talk
12-
- New configuration option to enable/disable push to talk
13-
- New configuration option to hide status window
14-
- New configuration options to readme along with their descriptions
10+
- Additional hold-to-talk ([PR #28](https://github.com/savbell/whisper-writer/pull/28)) and press-to-toggle recording methods ([Issue #21](https://github.com/savbell/whisper-writer/issues/21)).
11+
- New configuration options to:
12+
- Choose recording method (defaulting to voice activity detection).
13+
- Choose which sound device and sample rate to use.
14+
- Hide the status window ([PR #28](https://github.com/savbell/whisper-writer/pull/28)).
1515

1616
### Changed
17-
- Migrated from `whisper` to `faster-whisper` (Issue #11).
18-
- Migrated from `pyautogui` to `pynput` (PR #10).
19-
- Migrated from `webrtcvad` to `webrtcvad-wheels` (PR #17).
17+
- Migrated from `whisper` to `faster-whisper` ([Issue #11](https://github.com/savbell/whisper-writer/issues/11)).
18+
- Migrated from `pyautogui` to `pynput` ([PR #10](https://github.com/savbell/whisper-writer/pull/10)).
19+
- Migrated from `webrtcvad` to `webrtcvad-wheels` ([PR #17](https://github.com/savbell/whisper-writer/pull/17)).
2020
- Changed default activation key combo from `ctrl+alt+space` to `ctrl+shift+space`.
2121
- Changed to using a local model rather than the API by default.
2222
- Revamped README.md, including new Roadmap, Contributing, and Credits sections.
2323

2424
### Fixed
2525
- Local model is now only loaded once at start-up, rather than every time the activation key combo was pressed.
2626
- Default configuration now auto-chooses compute type for the local model to avoid warnings.
27+
- Graceful degradation to CPU if CUDA isn't available ([PR #30](https://github.com/savbell/whisper-writer/pull/30)).
28+
- Removed long prefix of spaces in transcription ([PR #19](https://github.com/savbell/whisper-writer/pull/19)).
2729

2830
## [1.0.0] - 2023-05-29
2931
### Added

README.md

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,16 @@
88

99
WhisperWriter is a small speech-to-text app that uses [OpenAI's Whisper model](https://openai.com/research/whisper) to auto-transcribe recordings from a user's microphone.
1010

11-
Once started, the script runs in the background and waits for a keyboard shortcut to be pressed (`ctrl+shift+space` by default, but this can be changed in the [Configuration Options](#configuration-options)). When the shortcut is pressed, the app starts recording from your microphone. It will continue recording until you stop speaking or there is a long enough pause in your speech. While it is recording, a small status window is displayed that shows the current stage of the transcription process. Once the transcription is complete, the transcribed text will be automatically written to the active window.
11+
Once started, the script runs in the background and waits for a keyboard shortcut to be pressed (`ctrl+shift+space` by default). When the shortcut is pressed, the app starts recording from your microphone. There are three options to stop recording:
12+
- `voice_activity_detection` that stops recording once it detects a long enough pause in your speech.
13+
- `press_to_toggle` that stops recording when the activation key is pressed again.
14+
- `hold_to_record` that stops recording when the activation key is released.
15+
16+
You can change the activation key and recording mode in the [Configuration Options](#configuration-options). While recording and transcribing, a small status window is displayed that shows the current stage of the process (but this can be turned off). Once the transcription is complete, the transcribed text will be automatically written to the active window.
1217

1318
The transcription can either be done locally through the [faster-whisper Python package](https://github.com/SYSTRAN/faster-whisper/) or through a request to [OpenAI's API](https://platform.openai.com/docs/guides/speech-to-text). By default, the app will use a local model, but you can change this in the [Configuration Options](#configuration-options). If you choose to use the API, you will need to provide your OpenAI API key in a `.env` file.
1419

15-
**Fun fact:** Almost the entirety of this project was pair-programmed with [ChatGPT-4](https://openai.com/product/gpt-4) and [GitHub Copilot](https://github.com/features/copilot) using VS Code. Practically every line, including most of this README, was written by AI. After the initial prototype was finished, WhisperWriter was used to write a lot of the prompts as well!
20+
**Fun fact:** Almost the entirety of the initial release of the project was pair-programmed with [ChatGPT-4](https://openai.com/product/gpt-4) and [GitHub Copilot](https://github.com/features/copilot) using VS Code. Practically every line, including most of this README, was written by AI. After the initial prototype was finished, WhisperWriter was used to write a lot of the prompts as well!
1621

1722
## Getting Started
1823

@@ -22,6 +27,11 @@ Before you can run this app, you'll need to have the following software installe
2227
- Git: [https://git-scm.com/downloads](https://git-scm.com/downloads)
2328
- Python `3.11`: [https://www.python.org/downloads/](https://www.python.org/downloads/)
2429

30+
If you want to run `faster-whisper` on your GPU, you'll also need to install the following NVIDIA libraries:
31+
32+
- [cuBLAS for CUDA 11](https://developer.nvidia.com/cublas)
33+
- [cuDNN 8 for CUDA 11](https://developer.nvidia.com/cudnn)
34+
2535
### Installation
2636
To set up and run the project, follow these steps:
2737

@@ -54,7 +64,7 @@ pip install -r requirements.txt
5464
To switch between running Whisper locally and using the OpenAI API, you need to modify the `src\config.json` file:
5565

5666
- If you prefer using the OpenAI API, set `"use_api"` to `true`. You will also need to set up your OpenAI API key in the next step.
57-
- If you prefer using a local Whisper model, set `"use_api"` to `false`. You may also want to change the device that the model uses; see the [Model Options](#model-options).
67+
- If you prefer using a local Whisper model, set `"use_api"` to `false`. You may also want to change the device that the model uses; see the [Model Options](#model-options). Note that you need to have the [NVIDIA libraries installed](https://github.com/SYSTRAN/faster-whisper/#gpu) to run the model on your GPU.
5868

5969
```
6070
{
@@ -109,6 +119,7 @@ WhisperWriter uses a configuration file to customize its behaviour. To set up th
109119
"vad_filter": false
110120
},
111121
"activation_key": "ctrl+shift+space",
122+
"recording_mode": "voice_activity",
112123
"sound_device": null,
113124
"sample_rate": 16000,
114125
"silence_duration": 900,
@@ -137,6 +148,7 @@ WhisperWriter uses a configuration file to customize its behaviour. To set up th
137148
- `vad_filter`: Set to `true` to use [a voice activity detection (VAD) filter](https://github.com/snakers4/silero-vad) to remove silence from the recording. (Default: `false`)
138149
#### Customization Options
139150
- `activation_key`: The keyboard shortcut to activate the recording and transcribing process. (Default: `"ctrl+shift+space"`)
151+
- `recording_mode`: The recording mode to use. Options include `voice_activity_detection` to use voice activity detection to determine when to stop recording, or `press_to_toggle` to start and stop recording by pressing the activation key, or `hold_to_record` to start recording when the activation key is pressed down and stop recording when the activation key is released. (Default: `"voice_activity"`)
140152
- `sound_device`: The name of the sound device to use for recording. Set to `null` to let the system automatically choose the default device. To find a device number, run `python -m sounddevice`. (Default: `null`)
141153
- `sample_rate`: The sample rate in Hz to use for recording. (Default: `16000`)
142154
- `silence_duration`: The duration in milliseconds to wait for silence before stopping the recording. (Default: `900`)
@@ -145,7 +157,6 @@ WhisperWriter uses a configuration file to customize its behaviour. To set up th
145157
- `add_trailing_space`: Set to `true` to add a trailing space to the transcribed text. (Default: `true`)
146158
- `remove_capitalization`: Set to `true` to convert the transcribed text to lowercase. (Default: `false`)
147159
- `print_to_terminal`: Set to `true` to print the script status and transcribed text to the terminal. (Default: `true`)
148-
- `push_to_talk`: Set to `true` to enable push to talk. Recording starts when activation-key is pressed down. When activation-key is released, recording stops and transcription starts.
149160
- `hide_window`: Set to `true` to hide the status window.
150161

151162
If any of the configuration options are invalid or not provided, the program will use the default values.
@@ -164,9 +175,6 @@ Below are features I am planning to add in the near future:
164175
- [ ] Updating GUI
165176
- [ ] Creating standalone executable file
166177

167-
Below are features I plan on investigating and may end up adding in the future:
168-
- [ ] Push-to-talk option
169-
170178
Below are features not currently planned:
171179
- [ ] Pipelining audio files
172180

@@ -176,8 +184,9 @@ Contributions are welcome! I created this project for my own personal use and di
176184

177185
## Credits
178186

179-
- [OpenAI](https://openai.com/) for creating the Whisper model and providing the API.
187+
- [OpenAI](https://openai.com/) for creating the Whisper model and providing the API. Plus [ChatGPT](https://chat.openai.com/), which was used to write a lot of the initial code for this project.
180188
- [Guillaume Klein](https://github.com/guillaumekln) for creating the [faster-whisper Python package](https://github.com/SYSTRAN/faster-whisper).
189+
- All of our [contributors](https://github.com/savbell/whisper-writer/graphs/contributors)!
181190

182191
## License
183192

src/config.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
"vad_filter": false
1818
},
1919
"activation_key": "ctrl+shift+space",
20+
"recording_mode": "voice_activity_detection",
2021
"sound_device": null,
2122
"sample_rate": 16000,
2223
"silence_duration": 900,
@@ -25,6 +26,5 @@
2526
"add_trailing_space": true,
2627
"remove_capitalization": false,
2728
"print_to_terminal": true,
28-
"push_to_talk": false,
2929
"hide_status_window": false
3030
}

src/main.py

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ def load_config_with_defaults():
4040
'vad_filter': False,
4141
},
4242
'activation_key': 'ctrl+shift+space',
43+
'recording_mode': 'voice_activity_detection', # 'voice_activity_detection', 'press_to_toggle', or 'hold_to_record'
4344
'sound_device': None,
4445
'sample_rate': 16000,
4546
'silence_duration': 900,
@@ -48,7 +49,6 @@ def load_config_with_defaults():
4849
'add_trailing_space': False,
4950
'remove_capitalization': False,
5051
'print_to_terminal': True,
51-
'push_to_talk': False,
5252
'hide_status_window': False
5353
}
5454

@@ -109,20 +109,31 @@ def typewrite(text, interval):
109109
# Main script
110110

111111
config = load_config_with_defaults()
112-
method = 'OpenAI\'s API' if config['use_api'] else 'a local model'
113-
status_queue = queue.Queue()
114112

115-
keyboard.add_hotkey(config['activation_key'], on_shortcut)
116-
pyinput_keyboard = Controller()
113+
model_method = 'OpenAI\'s API' if config['use_api'] else 'a local model'
114+
print(f'Script activated. Whisper is set to run using {model_method}. To change this, modify the "use_api" value in the src\\config.json file.')
117115

118-
print(f'Script activated. Whisper is set to run using {method}. To change this, modify the "use_api" value in the src\\config.json file.')
116+
# Set up local model if needed
119117
local_model = None
120118
if not config['use_api']:
121119
print('Creating local model...')
122120
local_model = create_local_model(config)
123121
print('Local model created.')
124122

125-
print(f'Press {format_keystrokes(config["activation_key"])} to start recording and transcribing. Press Ctrl+C on the terminal window to quit.')
123+
print(f'WhisperWriter is set to record using {config["recording_mode"]}. To change this, modify the "recording_mode" value in the src\\config.json file.')
124+
print(f'The activation key combo is set to {format_keystrokes(config["activation_key"])}.', end='')
125+
if config['recording_mode'] == 'voice_activity_detection':
126+
print(' When it is pressed, recording will start, and will stop when you stop speaking.')
127+
elif config['recording_mode'] == 'press_to_toggle':
128+
print(' When it is pressed, recording will start, and will stop when you press the key combo again.')
129+
elif config['recording_mode'] == 'hold_to_record':
130+
print(' When it is pressed, recording will start, and will stop when you release the key combo.')
131+
print('Press Ctrl+C on the terminal window to quit.')
132+
133+
# Set up status window and keyboard listener
134+
status_queue = queue.Queue()
135+
pyinput_keyboard = Controller()
136+
keyboard.add_hotkey(config['activation_key'], on_shortcut)
126137
try:
127138
keyboard.wait() # Keep the script running to listen for the shortcut
128139
except KeyboardInterrupt:

src/transcription.py

Lines changed: 26 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ def create_local_model(config):
2222
device=config['local_model_options']['device'],
2323
compute_type=config['local_model_options']['compute_type'])
2424
except Exception as e:
25-
print(f"Error initializing WhisperModel with CUDA: {e}")
26-
print("Falling back to CPU.")
25+
print(f'Error initializing WhisperModel with CUDA: {e}') if config['print_to_terminal'] else ''
26+
print('Falling back to CPU.') if config['print_to_terminal'] else ''
2727
model = WhisperModel(config['local_model_options']['model'],
2828
device='cpu',
2929
compute_type=config['local_model_options']['compute_type'])
3030
else:
31-
print("CUDA not available, using CPU.")
31+
print('CUDA not available, using CPU.') if config['print_to_terminal'] else ''
3232
model = WhisperModel(config['local_model_options']['model'],
3333
device='cpu',
3434
compute_type=config['local_model_options']['compute_type'])
@@ -78,7 +78,7 @@ def record(status_queue, cancel_flag, config):
7878
buffer_duration = 300 # 300ms
7979
silence_duration = config['silence_duration'] if config else 900 # 900ms
8080

81-
push_to_talk = config['push_to_talk']
81+
recording_mode = config['recording_mode']
8282
activation_key = config['activation_key']
8383

8484
vad = webrtcvad.Vad(3) # Aggressiveness mode: 3 (highest)
@@ -97,22 +97,28 @@ def record(status_queue, cancel_flag, config):
9797

9898
frame = buffer[:sample_rate * frame_duration // 1000]
9999
buffer = buffer[sample_rate * frame_duration // 1000:]
100-
101-
if push_to_talk:
102-
recording.extend(frame)
103-
if not keyboard.is_pressed(activation_key):
104-
break
105-
else:
106-
is_speech = vad.is_speech(np.array(frame).tobytes(), sample_rate)
107-
if is_speech:
108-
recording.extend(frame)
109-
num_silent_frames = 0
110-
else:
111-
if len(recording) > 0:
112-
num_silent_frames += 1
113-
114-
if num_silent_frames >= num_silence_frames:
100+
101+
if not cancel_flag():
102+
if recording_mode == 'press_to_toggle':
103+
if len(recording) > 0 and keyboard.is_pressed(activation_key):
104+
break
105+
else:
106+
recording.extend(frame)
107+
if recording_mode == 'hold_to_record':
108+
if keyboard.is_pressed(activation_key):
109+
recording.extend(frame)
110+
else:
115111
break
112+
elif recording_mode == 'voice_activity_detection':
113+
is_speech = vad.is_speech(np.array(frame).tobytes(), sample_rate)
114+
if is_speech:
115+
recording.extend(frame)
116+
num_silent_frames = 0
117+
else:
118+
if len(recording) > 0:
119+
num_silent_frames += 1
120+
if num_silent_frames >= num_silence_frames:
121+
break
116122

117123
if cancel_flag():
118124
status_queue.put(('cancel', ''))
@@ -184,4 +190,4 @@ def record_and_transcribe(status_queue, cancel_flag, config, local_model=None):
184190
if cancel_flag():
185191
return ''
186192
result = transcribe(status_queue, cancel_flag, config, audio_file, local_model)
187-
return result
193+
return result

0 commit comments

Comments
 (0)