Skip to content

Commit f7c8d59

Browse files
authored
Add Apple Speech as fallback when no Groq API key is configured (#5)
* Add AGENTS.md with build, lint, style, and architecture guidelines * Add language dropdown with auto-detect and 26 languages Support multilingual transcription via Groq's Whisper API. Default is auto-detect (omits language param, letting Whisper infer). Users can pin a specific language in Settings for better accuracy and latency. * Add Apple Speech as fallback transcription engine when no Groq API key is configured Introduce a TranscriptionEngine protocol to abstract speech-to-text backends, with GroqEngine (cloud) and AppleSpeechEngine (on-device via SFSpeechRecognizer) as implementations. In Auto mode, Groq is preferred when a key exists; otherwise Apple Speech is used — making the app fully functional without any API key. - Add TranscriptionEngine protocol and EngineResolver - Add AppleSpeechEngine using SFSpeechAudioBufferRecognitionRequest - Make API key optional in onboarding (skip to use Apple Speech) - Add engine picker in Settings (Auto / Groq / Apple Speech) - Add Speech framework linking and NSSpeechRecognitionUsageDescription - Add speech recognition permission handling in Permissions * Update AGENTS.md with engine abstraction architecture and Apple Speech docs * Update architecture.md with dual-engine system and Apple Speech documentation
1 parent b1ba14b commit f7c8d59

File tree

12 files changed

+709
-108
lines changed

12 files changed

+709
-108
lines changed

AGENTS.md

Lines changed: 81 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
## Project Overview
66

7-
AudioType is a **native macOS menu bar app** for voice-to-text. Users hold the `fn` key to record, release to transcribe via Groq's Whisper API, and the result is typed into the focused app. It runs as an `LSUIElement` (no dock icon), built with Swift Package Manager (not Xcode projects).
7+
AudioType is a **native macOS menu bar app** for voice-to-text. Users hold the `fn` key to record, release to transcribe, and the result is typed into the focused app. It supports two transcription backends: **Groq Whisper** (cloud) and **Apple Speech** (on-device). If no Groq API key is configured, the app falls back to Apple's on-device `SFSpeechRecognizer` automatically. It runs as an `LSUIElement` (no dock icon), built with Swift Package Manager (not Xcode projects).
88

99
## Build Commands
1010

@@ -68,6 +68,62 @@ CI runs on every push/PR to `main` (`.github/workflows/ci.yml`):
6868

6969
Releases (`.github/workflows/release.yml`) trigger on `v*` tags and produce `AudioType.dmg` + `AudioType.zip`.
7070

71+
## Architecture
72+
73+
### Transcription Engine System
74+
75+
The app uses a **protocol-based engine abstraction** to support multiple speech-to-text backends:
76+
77+
```
78+
TranscriptionEngine (protocol)
79+
├── GroqEngine — Cloud-based, Groq Whisper API, requires API key
80+
└── AppleSpeechEngine — On-device, Apple SFSpeechRecognizer, no API key needed
81+
```
82+
83+
**`EngineResolver`** selects the active engine at runtime based on user preference (`TranscriptionEngineType`):
84+
85+
| Mode | Behavior |
86+
|------|----------|
87+
| **Auto** (default) | Groq if API key exists, otherwise Apple Speech |
88+
| **Groq Whisper** | Always use Groq (fails if no key) |
89+
| **Apple Speech** | Always use on-device recognition |
90+
91+
Both engines implement a single method: `transcribe(samples: [Float]) async throws -> String` — accepting 16 kHz mono Float32 PCM samples from `AudioRecorder`.
92+
93+
### Data Flow
94+
95+
```
96+
fn key held → HotKeyManager → TranscriptionManager.startRecording()
97+
98+
AudioRecorder (AVAudioEngine, 16kHz mono PCM)
99+
100+
fn key released → TranscriptionManager.stopRecordingAndTranscribe()
101+
102+
EngineResolver.resolve() → TranscriptionEngine
103+
↓ ↓
104+
GroqEngine AppleSpeechEngine
105+
(HTTP multipart → (SFSpeechAudioBuffer-
106+
Groq Whisper API) RecognitionRequest)
107+
↓ ↓
108+
transcribed text
109+
110+
TextPostProcessor (corrections)
111+
112+
TextInserter (CGEvent keyboard simulation)
113+
114+
text typed into focused app
115+
```
116+
117+
### Permission Requirements
118+
119+
| Permission | Required for | Plist key |
120+
|------------|-------------|-----------|
121+
| Microphone | Audio recording | `NSMicrophoneUsageDescription` |
122+
| Accessibility | Keyboard simulation (TextInserter) | Granted via System Settings |
123+
| Speech Recognition | Apple Speech engine (on-device) | `NSSpeechRecognitionUsageDescription` |
124+
125+
Speech recognition permission is requested on-demand the first time the Apple Speech engine is used. The Groq engine does not require this permission.
126+
71127
## Project Structure
72128

73129
```
@@ -76,24 +132,26 @@ AudioType/
76132
AudioTypeApp.swift # @main, AppDelegate, onboarding flow
77133
MenuBarController.swift # NSStatusItem, state-driven icon tinting, overlay windows
78134
TranscriptionManager.swift # State machine (idle→recording→processing→idle/error)
79-
Core/ # Business logic
80-
AudioRecorder.swift # AVAudioEngine capture, PCM→16kHz resampling, RMS level
81-
GroqEngine.swift # Groq Whisper API client, WAV encoding, multipart upload
82-
HotKeyManager.swift # CGEventTap for fn key hold detection
83-
TextInserter.swift # CGEvent keyboard simulation to type into focused app
84-
TextPostProcessor.swift # Post-transcription corrections (tech terms, punctuation)
135+
Core/ # Business logic & transcription engines
136+
AudioRecorder.swift # AVAudioEngine capture, PCM→16kHz resampling, RMS level
137+
TranscriptionEngine.swift # TranscriptionEngine protocol, TranscriptionEngineType, EngineResolver
138+
GroqEngine.swift # Groq Whisper API client, WAV encoding, multipart upload
139+
AppleSpeechEngine.swift # Apple SFSpeechRecognizer on-device transcription
140+
HotKeyManager.swift # CGEventTap for fn key hold detection
141+
TextInserter.swift # CGEvent keyboard simulation to type into focused app
142+
TextPostProcessor.swift # Post-transcription corrections (tech terms, punctuation)
85143
UI/ # SwiftUI views
86144
RecordingOverlay.swift # Floating waveform (recording) / thinking dots (processing)
87-
OnboardingView.swift # First-launch permission + API key setup
88-
SettingsView.swift # API key, model picker, permissions, launch-at-login
145+
OnboardingView.swift # First-launch permission setup (API key optional)
146+
SettingsView.swift # Engine picker, API key, model, language, permissions
89147
Theme.swift # Brand color system (coral palette, adaptive dark/light)
90148
Utilities/
91-
Permissions.swift # Microphone + Accessibility permission helpers
149+
Permissions.swift # Microphone, Accessibility, Speech Recognition permission helpers
92150
KeychainHelper.swift # File-based secret storage (Application Support, 0600 perms)
93151
Resources/
94152
Assets.xcassets/ # Asset catalog (currently empty)
95153
Resources/
96-
Info.plist # Bundle config (LSUIElement, mic usage description)
154+
Info.plist # Bundle config (LSUIElement, mic + speech recognition usage descriptions)
97155
AppIcon.icns # App icon (coral gradient)
98156
```
99157

@@ -111,8 +169,9 @@ Resources/
111169
- Use `// MARK: -` sections to organize classes (`// MARK: - Private`, `// MARK: - Transcription`)
112170

113171
### Types & Naming
114-
- **Classes** for stateful objects with reference semantics: `TranscriptionManager`, `AudioRecorder`
115-
- **Enums** for namespaced constants and error types: `AudioTypeTheme`, `GroqEngineError`, `KeychainHelper`
172+
- **Protocols** for abstractions with multiple implementations: `TranscriptionEngine`
173+
- **Classes** for stateful objects with reference semantics: `TranscriptionManager`, `AudioRecorder`, `GroqEngine`, `AppleSpeechEngine`
174+
- **Enums** for namespaced constants and error types: `AudioTypeTheme`, `GroqEngineError`, `AppleSpeechError`, `TranscriptionEngineType`, `KeychainHelper`
116175
- **Structs** for SwiftUI views: `RecordingOverlay`, `SettingsView`
117176
- camelCase for properties/methods, PascalCase for types
118177
- Identifier names: min 1 char, max 50 chars; `x`, `y`, `i`, `j`, `k` are allowed
@@ -125,13 +184,22 @@ Resources/
125184
- Errors shown to user go through `TranscriptionState.error(String)`
126185

127186
### Patterns Used
187+
- **Protocol abstraction**: `TranscriptionEngine` with `GroqEngine` and `AppleSpeechEngine` implementations
188+
- **Resolver pattern**: `EngineResolver.resolve()` picks the engine at runtime based on config
128189
- **Singleton**: `TranscriptionManager.shared`, `TextPostProcessor.shared`, `AudioLevelMonitor.shared`
129190
- **`@MainActor`** on `TranscriptionManager` — all state mutations on main thread
130191
- **NotificationCenter** for decoupled state communication (`transcriptionStateChanged`, `audioLevelChanged`)
131192
- **`@Published` + ObservableObject** for SwiftUI reactivity
132193
- **Closures** for callbacks: `HotKeyManager(callback:)`, `audioRecorder.onLevelUpdate`
133194
- **`os.log` Logger** with subsystem `"com.audiotype"` — use per-class categories
134195

196+
### Adding a New Transcription Engine
197+
1. Create a new class conforming to `TranscriptionEngine` in `AudioType/Core/`
198+
2. Implement `displayName`, `isAvailable`, and `transcribe(samples:)`
199+
3. Add a case to `TranscriptionEngineType` and update `EngineResolver.resolve()`
200+
4. Update `EngineResolver.anyEngineAvailable` if the engine has standalone availability
201+
5. Add any needed permissions to `Permissions.swift` and `Info.plist`
202+
135203
### Colors & Theming
136204
All colors live in `AudioType/UI/Theme.swift` (`AudioTypeTheme` enum). Never use hardcoded color literals in views. The palette:
137205
- **Coral** `#FF6B6B` — brand color, waveform bars, accents, checkmarks

AudioType/App/AudioTypeApp.swift

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,8 @@ class AppDelegate: NSObject, NSApplicationDelegate {
4444
let micPermission = await Permissions.checkMicrophone()
4545
let accessibilityPermission = Permissions.checkAccessibility()
4646

47-
if !micPermission || !accessibilityPermission || !GroqEngine.isConfigured {
48-
// Show onboarding window
47+
// Show onboarding if permissions are missing or no engine is usable
48+
if !micPermission || !accessibilityPermission || !EngineResolver.anyEngineAvailable {
4949
DispatchQueue.main.async {
5050
self.showOnboarding()
5151
}

AudioType/App/TranscriptionManager.swift

Lines changed: 34 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ enum TranscriptionState: Equatable {
1111
switch (lhs, rhs) {
1212
case (.idle, .idle), (.recording, .recording), (.processing, .processing):
1313
return true
14-
case let (.error(a), .error(b)):
14+
case (.error(let a), .error(let b)):
1515
return a == b
1616
default:
1717
return false
@@ -26,8 +26,8 @@ class TranscriptionManager: ObservableObject {
2626
@Published private(set) var state: TranscriptionState = .idle
2727
@Published private(set) var isInitialized = false
2828
@Published private(set) var audioLevel: Float = 0.0
29+
@Published private(set) var activeEngineName: String = ""
2930

30-
private var groqEngine: GroqEngine?
3131
private var audioRecorder: AudioRecorder?
3232
private var hotKeyManager: HotKeyManager?
3333
private var textInserter: TextInserter?
@@ -53,14 +53,16 @@ class TranscriptionManager: ObservableObject {
5353
}
5454
textInserter = TextInserter()
5555

56-
// Initialize Groq engine (lightweight — no model download needed)
57-
groqEngine = GroqEngine()
56+
// Resolve which engine we will use and log it
57+
let engine = EngineResolver.resolve()
58+
activeEngineName = engine.displayName
59+
logger.info("Active transcription engine: \(engine.displayName)")
5860

59-
if !GroqEngine.isConfigured {
60-
logger.warning("Groq API key not configured")
61-
setState(.error("API key requiredopen Settings"))
61+
if !EngineResolver.anyEngineAvailable {
62+
logger.warning("No transcription engine available")
63+
setState(.error("No engine availableadd a Groq key or enable Apple Speech"))
6264
} else {
63-
logger.info("Groq engine ready")
65+
logger.info("Transcription engine ready: \(engine.displayName)")
6466
}
6567

6668
// Start hotkey listener
@@ -72,28 +74,34 @@ class TranscriptionManager: ObservableObject {
7274
hotKeyManager?.startListening()
7375

7476
isInitialized = true
75-
if GroqEngine.isConfigured {
77+
if EngineResolver.anyEngineAvailable {
7678
setState(.idle)
7779
}
7880
logger.info("TranscriptionManager initialized successfully")
7981
}
8082

8183
func cleanup() {
8284
hotKeyManager?.stopListening()
83-
groqEngine = nil
8485
audioRecorder = nil
8586
}
8687

87-
/// Called when the user saves an API key — re-validate and clear error state.
88-
func onApiKeyChanged() {
89-
if GroqEngine.isConfigured {
88+
/// Called when the user saves an API key or changes engine preference — re-evaluate.
89+
func onEngineConfigChanged() {
90+
let engine = EngineResolver.resolve()
91+
activeEngineName = engine.displayName
92+
if EngineResolver.anyEngineAvailable {
9093
setState(.idle)
91-
logger.info("API key configured, engine ready")
94+
logger.info("Engine config changed, active engine: \(engine.displayName)")
9295
} else {
93-
setState(.error("API key requiredopen Settings"))
96+
setState(.error("No engine availableadd a Groq key or enable Apple Speech"))
9497
}
9598
}
9699

100+
/// Backwards-compatible alias used by SettingsView.
101+
func onApiKeyChanged() {
102+
onEngineConfigChanged()
103+
}
104+
97105
private func handleHotKeyEvent(_ event: HotKeyEvent) {
98106
switch event {
99107
case .keyDown:
@@ -109,8 +117,8 @@ class TranscriptionManager: ObservableObject {
109117
return
110118
}
111119

112-
guard GroqEngine.isConfigured else {
113-
setState(.error("API key requiredopen Settings"))
120+
guard EngineResolver.anyEngineAvailable else {
121+
setState(.error("No engine availableadd a Groq key or enable Apple Speech"))
114122
return
115123
}
116124

@@ -146,19 +154,20 @@ class TranscriptionManager: ObservableObject {
146154
}
147155

148156
private func transcribeAndInsert(samples: [Float]) async {
149-
guard let groqEngine = groqEngine else {
150-
await MainActor.run {
151-
self.setState(.error("Groq engine not initialized"))
152-
}
153-
return
157+
let engine = EngineResolver.resolve()
158+
159+
await MainActor.run {
160+
self.activeEngineName = engine.displayName
154161
}
155162

156163
let startTime = CFAbsoluteTimeGetCurrent()
157164

158165
do {
159-
let text = try await groqEngine.transcribe(samples: samples)
166+
let text = try await engine.transcribe(samples: samples)
160167
let elapsed = CFAbsoluteTimeGetCurrent() - startTime
161-
logger.info("Transcription completed in \(elapsed, format: .fixed(precision: 2))s: \(text)")
168+
logger.info(
169+
"[\(engine.displayName)] Transcription completed in \(elapsed, format: .fixed(precision: 2))s: \(text)"
170+
)
162171

163172
// Ensure processing indicator is visible for at least 0.5s
164173
let minDisplayTime = 0.5
@@ -177,7 +186,7 @@ class TranscriptionManager: ObservableObject {
177186
self.setState(.idle)
178187
}
179188
} catch {
180-
logger.error("Transcription failed: \(error.localizedDescription)")
189+
logger.error("[\(engine.displayName)] Transcription failed: \(error.localizedDescription)")
181190
await MainActor.run {
182191
self.setState(.error("Transcription failed"))
183192
// Auto-reset to idle after 2 seconds

0 commit comments

Comments
 (0)