Skip to content

Commit b16b5f1

Browse files
authored
Add OpenAI as second cloud transcription provider (#7)
Introduce WhisperAPIEngine base class that handles WAV encoding, multipart HTTP, and response parsing. GroqEngine and OpenAIEngine are thin subclasses that only supply provider-specific config (URL, models, keychain key). Engine changes: - WAVEncoder.swift: shared base class, WAVEncoder, WhisperAPIConfig, errors - OpenAIEngine.swift: OpenAI subclass with whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe (default) models - GroqEngine.swift: rewritten as WhisperAPIEngine subclass (removes ~150 lines of duplicated HTTP/WAV/multipart code) - TranscriptionEngine.swift: add .openAI case, update EngineResolver (auto priority: Groq -> OpenAI -> Apple Speech) UI changes: - SettingsView: add OpenAI section (API key, model picker), extract shared apiKeyField ViewBuilder - OnboardingView: rename apiKeyConfigured to anyCloudKeyConfigured, update engine badge to reflect all three providers - TranscriptionManager: generalize error messages for multiple cloud providers Docs: - README.md: rewrite for three-backend support - AGENTS.md: update architecture, patterns, project structure - architecture.md: add WhisperAPIEngine hierarchy, OpenAI engine section - release.yml: update release notes template
1 parent c7266f3 commit b16b5f1

File tree

12 files changed

+776
-406
lines changed

12 files changed

+776
-406
lines changed

.github/workflows/release.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -98,25 +98,25 @@ jobs:
9898
body: |
9999
## AudioType v${{ steps.version.outputs.version }}
100100
101-
Voice-to-text for macOS powered by Groq cloud transcription (Whisper Large V3).
101+
Voice-to-text for macOS with multiple transcription backends: Groq Whisper, OpenAI Whisper, and Apple Speech (on-device).
102102
103103
### What's New
104-
- Cloud-powered transcription via Groq API for significantly better accuracy
105-
- Self-serve: bring your own free Groq API key
106-
- Simplified build — no more whisper.cpp compilation required
104+
- Cloud-powered transcription via Groq or OpenAI Whisper APIs
105+
- On-device fallback via Apple Speech — works without an API key
106+
- Self-serve: bring your own Groq (free tier) or OpenAI API key
107107
108108
### Installation
109109
1. Download `AudioType.dmg` or `AudioType.zip`
110110
2. Extract and move `AudioType.app` to your Applications folder
111-
3. Open the app and grant Microphone and Accessibility permissions
112-
4. Enter your free Groq API key (get one at https://console.groq.com/keys)
111+
3. Open the app and grant Microphone, Accessibility, and Speech Recognition permissions
112+
4. Optionally enter a Groq or OpenAI API key for cloud transcription
113113
5. Hold the fn key to dictate
114114
115115
### Requirements
116116
- macOS 14.0 or later
117117
- Apple Silicon or Intel Mac
118-
- Internet connection
119-
- Free Groq API key
118+
- Internet connection (for cloud engines; not needed for Apple Speech)
119+
- API key optional (Groq free tier or OpenAI)
120120
121121
> Looking for the offline/local version? See [v1.1.1](https://github.com/PatelUtkarsh/audio-type/releases/tag/v1.1.1)
122122
files: |

AGENTS.md

Lines changed: 35 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
## Project Overview
66

7-
AudioType is a **native macOS menu bar app** for voice-to-text. Users hold the `fn` key to record, release to transcribe, and the result is typed into the focused app. It supports two transcription backends: **Groq Whisper** (cloud) and **Apple Speech** (on-device). If no Groq API key is configured, the app falls back to Apple's on-device `SFSpeechRecognizer` automatically. It runs as an `LSUIElement` (no dock icon), built with Swift Package Manager (not Xcode projects).
7+
AudioType is a **native macOS menu bar app** for voice-to-text. Users hold the `fn` key to record, release to transcribe, and the result is typed into the focused app. It supports three transcription backends: **Groq Whisper** (cloud), **OpenAI Whisper** (cloud), and **Apple Speech** (on-device). If no cloud API key is configured, the app falls back to Apple's on-device `SFSpeechRecognizer` automatically. It runs as an `LSUIElement` (no dock icon), built with Swift Package Manager (not Xcode projects).
88

99
## Build Commands
1010

@@ -72,23 +72,26 @@ Releases (`.github/workflows/release.yml`) trigger on `v*` tags and produce `Aud
7272

7373
### Transcription Engine System
7474

75-
The app uses a **protocol-based engine abstraction** to support multiple speech-to-text backends:
75+
The app uses a **protocol-based engine abstraction** with a shared base class to support multiple speech-to-text backends:
7676

7777
```
7878
TranscriptionEngine (protocol)
79-
├── GroqEngine — Cloud-based, Groq Whisper API, requires API key
79+
├── WhisperAPIEngine (base class) — shared WAV encoding, multipart HTTP, response parsing
80+
│ ├── GroqEngine — Cloud, Groq Whisper API, requires API key
81+
│ └── OpenAIEngine — Cloud, OpenAI Whisper/GPT-4o API, requires API key
8082
└── AppleSpeechEngine — On-device, Apple SFSpeechRecognizer, no API key needed
8183
```
8284

8385
**`EngineResolver`** selects the active engine at runtime based on user preference (`TranscriptionEngineType`):
8486

8587
| Mode | Behavior |
8688
|------|----------|
87-
| **Auto** (default) | Groq if API key exists, otherwise Apple Speech |
89+
| **Auto** (default) | Groq if key exists → OpenAI if key exists Apple Speech |
8890
| **Groq Whisper** | Always use Groq (fails if no key) |
91+
| **OpenAI Whisper** | Always use OpenAI (fails if no key) |
8992
| **Apple Speech** | Always use on-device recognition |
9093

91-
Both engines implement a single method: `transcribe(samples: [Float]) async throws -> String` — accepting 16 kHz mono Float32 PCM samples from `AudioRecorder`.
94+
All engines implement a single method: `transcribe(samples: [Float]) async throws -> String` — accepting 16 kHz mono Float32 PCM samples from `AudioRecorder`.
9295

9396
### Data Flow
9497

@@ -100,11 +103,11 @@ fn key held → HotKeyManager → TranscriptionManager.startRecording()
100103
fn key released → TranscriptionManager.stopRecordingAndTranscribe()
101104
102105
EngineResolver.resolve() → TranscriptionEngine
103-
104-
GroqEngine AppleSpeechEngine
105-
(HTTP multipart → (SFSpeechAudioBuffer-
106-
Groq Whisper API) RecognitionRequest)
107-
106+
107+
GroqEngine OpenAIEngine AppleSpeechEngine
108+
(WhisperAPIEngine (WhisperAPIEngine (SFSpeechAudioBuffer-
109+
→ Groq API) → OpenAI API) RecognitionRequest)
110+
108111
transcribed text
109112
110113
TextPostProcessor (corrections)
@@ -122,7 +125,7 @@ fn key released → TranscriptionManager.stopRecordingAndTranscribe()
122125
| Accessibility | Keyboard simulation (TextInserter) | Granted via System Settings |
123126
| Speech Recognition | Apple Speech engine (on-device) | `NSSpeechRecognitionUsageDescription` |
124127

125-
Speech recognition permission is requested on-demand the first time the Apple Speech engine is used. The Groq engine does not require this permission.
128+
Speech recognition permission is requested on-demand the first time the Apple Speech engine is used. Cloud engines (Groq, OpenAI) do not require this permission.
126129

127130
## Project Structure
128131

@@ -135,19 +138,21 @@ AudioType/
135138
Core/ # Business logic & transcription engines
136139
AudioRecorder.swift # AVAudioEngine capture, PCM→16kHz resampling, RMS level
137140
TranscriptionEngine.swift # TranscriptionEngine protocol, TranscriptionEngineType, EngineResolver
138-
GroqEngine.swift # Groq Whisper API client, WAV encoding, multipart upload
141+
WAVEncoder.swift # WhisperAPIEngine base class, WAVEncoder, WhisperAPIConfig, Data helpers
142+
GroqEngine.swift # GroqEngine subclass, GroqModel enum, TranscriptionLanguage
143+
OpenAIEngine.swift # OpenAIEngine subclass, OpenAIModel enum
139144
AppleSpeechEngine.swift # Apple SFSpeechRecognizer on-device transcription
140145
HotKeyManager.swift # CGEventTap for fn key hold detection
141146
TextInserter.swift # CGEvent keyboard simulation to type into focused app
142147
TextPostProcessor.swift # Post-transcription corrections (tech terms, punctuation)
143148
UI/ # SwiftUI views
144149
RecordingOverlay.swift # Floating waveform (recording) / thinking dots (processing)
145150
OnboardingView.swift # First-launch permission setup (API key optional)
146-
SettingsView.swift # Engine picker, API key, model, language, permissions
151+
SettingsView.swift # Engine picker, API keys, models, language, permissions
147152
Theme.swift # Brand color system (coral palette, adaptive dark/light)
148153
Utilities/
149154
Permissions.swift # Microphone, Accessibility, Speech Recognition permission helpers
150-
KeychainHelper.swift # File-based secret storage (Application Support, 0600 perms)
155+
KeychainHelper.swift # macOS Keychain-based secret storage
151156
Resources/
152157
Assets.xcassets/ # Asset catalog (currently empty)
153158
Resources/
@@ -170,9 +175,9 @@ Resources/
170175

171176
### Types & Naming
172177
- **Protocols** for abstractions with multiple implementations: `TranscriptionEngine`
173-
- **Classes** for stateful objects with reference semantics: `TranscriptionManager`, `AudioRecorder`, `GroqEngine`, `AppleSpeechEngine`
174-
- **Enums** for namespaced constants and error types: `AudioTypeTheme`, `GroqEngineError`, `AppleSpeechError`, `TranscriptionEngineType`, `KeychainHelper`
175-
- **Structs** for SwiftUI views: `RecordingOverlay`, `SettingsView`
178+
- **Classes** for stateful objects with reference semantics: `TranscriptionManager`, `AudioRecorder`, `WhisperAPIEngine`, `GroqEngine`, `OpenAIEngine`, `AppleSpeechEngine`
179+
- **Enums** for namespaced constants and error types: `AudioTypeTheme`, `WhisperAPIError`, `AppleSpeechError`, `TranscriptionEngineType`, `KeychainHelper`
180+
- **Structs** for SwiftUI views and config: `RecordingOverlay`, `SettingsView`, `WhisperAPIConfig`
176181
- camelCase for properties/methods, PascalCase for types
177182
- Identifier names: min 1 char, max 50 chars; `x`, `y`, `i`, `j`, `k` are allowed
178183

@@ -184,7 +189,8 @@ Resources/
184189
- Errors shown to user go through `TranscriptionState.error(String)`
185190

186191
### Patterns Used
187-
- **Protocol abstraction**: `TranscriptionEngine` with `GroqEngine` and `AppleSpeechEngine` implementations
192+
- **Protocol abstraction**: `TranscriptionEngine` with `WhisperAPIEngine` base class and `AppleSpeechEngine`
193+
- **Inheritance for shared logic**: `WhisperAPIEngine` base class handles WAV encoding, multipart HTTP, response parsing; `GroqEngine` and `OpenAIEngine` are thin subclasses supplying config
188194
- **Resolver pattern**: `EngineResolver.resolve()` picks the engine at runtime based on config
189195
- **Singleton**: `TranscriptionManager.shared`, `TextPostProcessor.shared`, `AudioLevelMonitor.shared`
190196
- **`@MainActor`** on `TranscriptionManager` — all state mutations on main thread
@@ -193,7 +199,16 @@ Resources/
193199
- **Closures** for callbacks: `HotKeyManager(callback:)`, `audioRecorder.onLevelUpdate`
194200
- **`os.log` Logger** with subsystem `"com.audiotype"` — use per-class categories
195201

196-
### Adding a New Transcription Engine
202+
### Adding a New Cloud Transcription Provider
203+
1. Create a new subclass of `WhisperAPIEngine` in `AudioType/Core/`
204+
2. Override `config` (with `WhisperAPIConfig`) and `currentModel` — that's it for the engine
205+
3. Add a model enum if the provider has multiple models
206+
4. Add static convenience methods (`isConfigured`, `setApiKey`, `clearApiKey`)
207+
5. Add a case to `TranscriptionEngineType` and update `EngineResolver.resolve()`
208+
6. Update `EngineResolver.anyEngineAvailable` if the engine has standalone availability
209+
7. Add UI in `SettingsView.swift` (API key field, model picker)
210+
211+
### Adding a Non-Whisper Engine
197212
1. Create a new class conforming to `TranscriptionEngine` in `AudioType/Core/`
198213
2. Implement `displayName`, `isAvailable`, and `transcribe(samples:)`
199214
3. Add a case to `TranscriptionEngineType` and update `EngineResolver.resolve()`
@@ -214,7 +229,7 @@ All colors live in `AudioType/UI/Theme.swift` (`AudioTypeTheme` enum). Never use
214229
- **Error**: `exclamationmark.triangle.fill`, tinted `.systemRed`
215230

216231
### Security
217-
- API keys stored in `~/Library/Application Support/AudioType/.secrets` with `0600` permissions
232+
- API keys stored in macOS Keychain via `KeychainHelper` (Security framework)
218233
- Never commit `.env`, credentials, or API keys
219234
- Audio is recorded in-memory only — never written to disk
220235

AudioType/App/TranscriptionManager.swift

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ class TranscriptionManager: ObservableObject {
6060

6161
if !EngineResolver.anyEngineAvailable {
6262
logger.warning("No transcription engine available")
63-
setState(.error("No engine available — add a Groq key or enable Apple Speech"))
63+
setState(.error("No engine available — add a cloud API key or enable Apple Speech"))
6464
} else {
6565
logger.info("Transcription engine ready: \(engine.displayName)")
6666
}
@@ -93,7 +93,7 @@ class TranscriptionManager: ObservableObject {
9393
setState(.idle)
9494
logger.info("Engine config changed, active engine: \(engine.displayName)")
9595
} else {
96-
setState(.error("No engine available — add a Groq key or enable Apple Speech"))
96+
setState(.error("No engine available — add a cloud API key or enable Apple Speech"))
9797
}
9898
}
9999

@@ -118,7 +118,7 @@ class TranscriptionManager: ObservableObject {
118118
}
119119

120120
guard EngineResolver.anyEngineAvailable else {
121-
setState(.error("No engine available — add a Groq key or enable Apple Speech"))
121+
setState(.error("No engine available — add a cloud API key or enable Apple Speech"))
122122
return
123123
}
124124

AudioType/Core/AppleSpeechEngine.swift

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ enum AppleSpeechError: Error, LocalizedError {
3030
/// On-device speech-to-text using Apple's Speech framework (`SFSpeechRecognizer`).
3131
///
3232
/// This engine requires no API key and works offline when on-device recognition is
33-
/// available (macOS 13+). It acts as the fallback when no Groq API key is configured.
33+
/// available (macOS 13+). It acts as the fallback when no cloud API key is configured.
3434
class AppleSpeechEngine: TranscriptionEngine {
3535

3636
private let logger = Logger(subsystem: "com.audiotype", category: "AppleSpeechEngine")

0 commit comments

Comments
 (0)