-
Notifications
You must be signed in to change notification settings - Fork 467
feat(api): add batch STT transcription endpoint #2146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add POST /transcribe endpoint for batch speech-to-text transcription via file upload. This mirrors the existing real-time WebSocket proxy pattern but for batch processing. Features: - Support for Deepgram, AssemblyAI, and Soniox providers via ?provider= query param - Normalized BatchResponse format matching owhisper_interface::batch::Response - Proper polling for async providers (AssemblyAI, Soniox) - OpenAPI documentation with Zod schemas - Sentry tracing and metrics integration Usage: POST /transcribe?provider=deepgram&language=en Content-Type: audio/wav <audio data> Co-Authored-By: yujonglee <[email protected]>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
✅ Deploy Preview for hyprnote-storybook ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for hyprnote ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughA batch transcription feature is introduced with support for multiple providers (Deepgram, AssemblyAI, Soniox). New types define the batch response structure. Provider-specific modules implement audio transcription workflows. A dispatcher routes requests to the appropriate provider. A new POST /transcribe endpoint exposes this functionality via the public API. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant API as POST /transcribe<br/>Endpoint
participant Dispatcher as transcribeBatch<br/>Dispatcher
participant Provider as Provider-Specific<br/>Module
participant ExtAPI as External API<br/>(AssemblyAI/Deepgram/Soniox)
participant DB as Response Mapping
Client->>API: Audio + params<br/>(provider, language,<br/>keywords, model)
API->>API: Validate audio data<br/>(400 if missing)
API->>Dispatcher: transcribeBatch(provider,<br/>audioData, contentType, params)
Dispatcher->>Provider: Route to provider impl
alt Provider: Deepgram
Provider->>ExtAPI: POST batch listen<br/>(audio, params)
ExtAPI-->>Provider: BatchResponse
else Provider: AssemblyAI
Provider->>ExtAPI: Upload audio
ExtAPI-->>Provider: Upload URL
Provider->>ExtAPI: Create transcript
ExtAPI-->>Provider: Transcript ID
loop Poll (max attempts)
Provider->>ExtAPI: Check status
ExtAPI-->>Provider: Status/Result
break On completion or error
end
end
Provider->>DB: convertToResponse()
DB-->>Provider: Mapped BatchResponse
else Provider: Soniox
Provider->>ExtAPI: Upload audio file
ExtAPI-->>Provider: File ID
Provider->>ExtAPI: Create transcription
ExtAPI-->>Provider: Transcription ID
loop Poll (max attempts)
Provider->>ExtAPI: Check status
ExtAPI-->>Provider: Status/Result
break On completion or error
end
end
Provider->>ExtAPI: Retrieve transcript
ExtAPI-->>Provider: Full transcript
Provider->>DB: convertToResponse()
DB-->>Provider: Mapped BatchResponse
end
Provider-->>Dispatcher: BatchResponse or error
Dispatcher-->>API: BatchResponse or error
alt Success
API-->>Client: 200 + BatchResponseSchema
API->>API: Record latency metric
else Upstream Error
API-->>Client: 502 + BatchErrorSchema
else Other Error
API-->>Client: 500 + BatchErrorSchema
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (11)
apps/api/src/stt/batch-deepgram.ts (1)
35-43: Consider adding a timeout to prevent indefinite hangs.The fetch call to Deepgram has no timeout configured. If the upstream service becomes unresponsive, this could block indefinitely.
- const response = await fetch(url.toString(), { + const controller = new AbortController(); + const timeoutId = setTimeout(() => controller.abort(), 120_000); // 2 min timeout for batch + + const response = await fetch(url.toString(), { method: "POST", headers: { Authorization: `Token ${env.DEEPGRAM_API_KEY}`, "Content-Type": contentType, Accept: "application/json", }, body: audioData, + signal: controller.signal, - }); + }).finally(() => clearTimeout(timeoutId));apps/api/src/routes.ts (3)
426-430: Provider validation is missing - invalid values silently fall through to default.The
providerquery parameter is cast directly toBatchProviderwithout validation. An invalid provider like?provider=invalidwill silently use Deepgram (the default case intranscribeBatch). Consider validating the provider value.- type BatchProvider = "deepgram" | "assemblyai" | "soniox"; + const VALID_PROVIDERS = ["deepgram", "assemblyai", "soniox"] as const; + type BatchProvider = (typeof VALID_PROVIDERS)[number]; const clientUrl = new URL(c.req.url, "http://localhost"); - const provider = - (clientUrl.searchParams.get("provider") as BatchProvider) ?? "deepgram"; + const providerParam = clientUrl.searchParams.get("provider"); + const provider: BatchProvider = + providerParam && VALID_PROVIDERS.includes(providerParam as BatchProvider) + ? (providerParam as BatchProvider) + : "deepgram";Alternatively, import
BatchProviderfrom./sttto avoid the duplicate type declaration.
471-473: Fragile upstream error detection based on substring matching.The check
errorMessage.includes("failed:")is brittle and could misclassify errors if error message formats change. Consider using a custom error class or error codes from the provider modules instead.// In provider modules, throw a typed error: class UpstreamError extends Error { constructor(message: string, public readonly provider: string) { super(message); this.name = "UpstreamError"; } } // In route handler: const isUpstreamError = error instanceof UpstreamError;
58-89: Zod schemas duplicate TypeScript types from batch-types.ts.The schemas mirror the types defined in
batch-types.ts. While necessary for OpenAPI documentation, this creates a maintenance burden where changes must be synchronized manually. Consider deriving types from schemas usingz.infer<>or generating schemas from types.apps/api/src/stt/batch-soniox.ts (2)
37-39: Consider specifying MIME type for Blob.The Blob is created without a MIME type, which could cause issues with the file upload if Soniox expects a specific content type. Consider passing the
contentTypeparameter (currently unused) to the Blob constructor.-const uploadFile = async ( - audioData: ArrayBuffer, - fileName: string, -): Promise<string> => { +const uploadFile = async ( + audioData: ArrayBuffer, + fileName: string, + contentType?: string, +): Promise<string> => { const formData = new FormData(); - const blob = new Blob([audioData]); + const blob = new Blob([audioData], contentType ? { type: contentType } : undefined); formData.append("file", blob, fileName);
175-188: Default confidence of 1.0 may be misleading.When Soniox doesn't provide confidence values, defaulting to 1.0 (100% confidence) could mislead consumers of the API. Consider using
nullor a clearly marked default value, or document this behavior.apps/api/src/stt/index.ts (1)
23-23: Consider consolidatingSttProviderandBatchProvider.Both types represent the same set of providers (
"deepgram" | "assemblyai" | "soniox"). While they serve different contexts (streaming vs batch), consolidating to a singleProvidertype could reduce duplication.-export type SttProvider = "deepgram" | "assemblyai" | "soniox"; +// Use BatchProvider for both streaming and batch contexts +export type { BatchProvider as SttProvider } from "./batch-types";apps/api/src/stt/batch-assemblyai.ts (4)
11-31: Config constants and basic AssemblyAI types look good; minor typing nit onspeakerThe constants and transcript shape line up with AssemblyAI’s async STT API surface, and the 3s × 200 polling budget (≈10 minutes) is a reasonable default.
One small TypeScript nit: AssemblyAI can return
speaker: nullwhen diarization is disabled, so widening the type to includenullwould better match the API response and avoid surprises at call sites.(learn.microsoft.com)type AssemblyAIWord = { text: string; start: number; end: number; confidence: number; - speaker?: string; + speaker?: string | null; };
33-52: Upload flow and error handling look solid; consider guarding against hung requestsThe
/uploadcall is straightforward and you surface rich error details from AssemblyAI, which is great for debugging.One thing you might want to layer in (either here or at a higher level) is a
fetchtimeout/abort mechanism so a stuck network connection doesn’t hold onto a worker indefinitely, especially given large audio payloads.
102-138: Polling loop is correct; ensure the 10‑minute cap matches your API timeoutsThe poller correctly:
- Retrieves
/v2/transcript/{id}in a loop.- Exits immediately on
status === "completed"or throws onstatus === "error".- Throws a clear timeout error after
MAX_POLL_ATTEMPTS.Given
POLL_INTERVAL_MS = 3000andMAX_POLL_ATTEMPTS = 200, you cap a transcription at ~10 minutes. That’s reasonable, but if you expect long audio files and slower models (e.g. Slam‑1 with keyterms prompting), it’s worth double‑checking that:
- This window is sufficient for your typical workloads, and
- It doesn’t conflict with upstream HTTP/server timeouts for the batch endpoint.
180-189: End‑to‑end orchestration is clear; watch overall latency and memory usageThe
transcribeWithAssemblyAIpipeline (upload → create transcript → poll → convert) is easy to follow and keeps provider‑specific logic well factored.A couple of higher‑level considerations:
- The function holds onto the full
ArrayBufferfor the duration of the upload. If you expect very large audio files or high concurrency, streaming from disk/Blob (where possible) could reduce memory pressure.- Combined with the 10‑minute polling window, this endpoint can tie up an API worker for a long time per request. If batch jobs are expected to be long‑running, you may eventually want to push transcript creation/polling into a background worker and return a job handle instead of blocking the HTTP request.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
apps/api/src/routes.ts(2 hunks)apps/api/src/stt/batch-assemblyai.ts(1 hunks)apps/api/src/stt/batch-deepgram.ts(1 hunks)apps/api/src/stt/batch-soniox.ts(1 hunks)apps/api/src/stt/batch-types.ts(1 hunks)apps/api/src/stt/index.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.ts
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.ts: Agent implementations should use TypeScript and follow the established architectural patterns defined in the agent framework
Agent communication should use defined message protocols and interfaces
Files:
apps/api/src/stt/batch-types.tsapps/api/src/stt/batch-assemblyai.tsapps/api/src/stt/batch-soniox.tsapps/api/src/stt/index.tsapps/api/src/stt/batch-deepgram.tsapps/api/src/routes.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx}: Avoid creating a bunch of types/interfaces if they are not shared. Especially for function props, just inline them instead.
Never do manual state management for form/mutation. Use useForm (from tanstack-form) and useQuery/useMutation (from tanstack-query) instead for 99% of cases. Avoid patterns like setError.
If there are many classNames with conditional logic, usecn(import from@hypr/utils). It is similar toclsx. Always pass an array and split by logical grouping.
Usemotion/reactinstead offramer-motion.
Files:
apps/api/src/stt/batch-types.tsapps/api/src/stt/batch-assemblyai.tsapps/api/src/stt/batch-soniox.tsapps/api/src/stt/index.tsapps/api/src/stt/batch-deepgram.tsapps/api/src/routes.ts
🧬 Code graph analysis (6)
apps/api/src/stt/batch-types.ts (1)
apps/api/src/stt/index.ts (3)
BatchResponse(15-15)BatchProvider(15-15)BatchParams(15-15)
apps/api/src/stt/batch-assemblyai.ts (1)
apps/api/src/stt/batch-types.ts (6)
BatchParams(31-35)BatchResponse(24-27)BatchWord(1-8)BatchAlternatives(10-14)BatchChannel(16-18)BatchResults(20-22)
apps/api/src/stt/batch-soniox.ts (2)
apps/api/src/stt/batch-types.ts (6)
BatchParams(31-35)BatchResponse(24-27)BatchWord(1-8)BatchAlternatives(10-14)BatchChannel(16-18)BatchResults(20-22)apps/api/src/stt/index.ts (3)
BatchParams(15-15)BatchResponse(15-15)transcribeWithSoniox(18-18)
apps/api/src/stt/index.ts (4)
apps/api/src/stt/batch-types.ts (3)
BatchProvider(29-29)BatchParams(31-35)BatchResponse(24-27)apps/api/src/stt/batch-assemblyai.ts (1)
transcribeWithAssemblyAI(180-189)apps/api/src/stt/batch-soniox.ts (1)
transcribeWithSoniox(205-216)apps/api/src/stt/batch-deepgram.ts (1)
transcribeWithDeepgram(6-53)
apps/api/src/stt/batch-deepgram.ts (2)
apps/api/src/stt/index.ts (3)
transcribeWithDeepgram(16-16)BatchParams(15-15)BatchResponse(15-15)apps/api/src/stt/batch-types.ts (2)
BatchParams(31-35)BatchResponse(24-27)
apps/api/src/routes.ts (3)
apps/api/src/stt/batch-types.ts (1)
BatchProvider(29-29)apps/api/src/stt/index.ts (2)
BatchProvider(15-15)transcribeBatch(25-41)apps/api/src/sentry/metrics.ts (1)
Metrics(3-35)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Redirect rules - hyprnote
- GitHub Check: Header rules - hyprnote
- GitHub Check: Pages changed - hyprnote
- GitHub Check: fmt
- GitHub Check: Devin
🔇 Additional comments (9)
apps/api/src/stt/batch-deepgram.ts (2)
17-17: Verify thatmip_opt_out=falseis intentional.Setting
mip_opt_outtofalseallows Deepgram to use the audio data for model improvement. Depending on your data privacy requirements or customer agreements, you may want to set this totrueto opt out.
52-52: Type assertion trusts upstream response structure.The response is cast directly to
BatchResponsewithout runtime validation. If Deepgram's API returns an unexpected structure, it could cause runtime errors downstream. This is acceptable if you trust the upstream API contract, but consider adding validation for critical paths.apps/api/src/stt/batch-types.ts (1)
1-35: Well-structured type definitions for batch transcription.The types provide a clean, unified contract for all providers to map their responses to. The structure mirrors Deepgram's format while allowing flexibility for metadata.
apps/api/src/routes.ts (1)
373-423: Route documentation and structure look good.The endpoint is well-documented with OpenAPI descriptions, appropriate status codes (400, 401, 500, 502), and security requirements. The Sentry instrumentation and metrics collection provide good observability.
apps/api/src/stt/batch-soniox.ts (2)
102-138: Polling implementation is well-structured.The polling logic correctly handles all status transitions (completed, error, queued, processing), includes a reasonable timeout (200 attempts × 3s = 10 minutes), and throws descriptive errors for unexpected states.
205-216: Clean orchestration of the Soniox workflow.The main function properly sequences upload → create → poll → fetch → convert, with each step's errors propagating naturally. The
_contentTypeparameter is correctly prefixed to indicate intentional non-use.apps/api/src/stt/index.ts (2)
25-41: Clean dispatcher implementation.The switch statement correctly routes to provider-specific implementations. The default fallback to Deepgram is reasonable. The
fileNameparameter is appropriately only forwarded to Soniox, which is the only provider requiring it for file uploads.
15-18: Good public API surface design.Re-exporting both types and individual provider functions gives consumers flexibility to either use the unified
transcribeBatchdispatcher or call providers directly when needed.apps/api/src/stt/batch-assemblyai.ts (1)
140-178: BatchResponse mapping is consistent across STT providersThe conversion to
BatchWord/BatchAlternatives/BatchChannel/BatchResultsis clean and aligns with other providers:
- Converting
start/endfrom milliseconds to seconds matches the standard used by Soniox and keeps units consistent across the application.- Parsing the
speakerlabel (e.g.,"A","B") into a numeric ID while falling back toundefinedon parse failure is robust.- Falling back to
""whenresult.textis absent and to1.0whenconfidenceis missing matches Soniox's implementation and is the project's standard.punctuated_wordis consistently mapped to the rawwordtext across all providers.
Wire params.model to AssemblyAI's speech_model parameter for model selection, matching the behavior of Deepgram and Soniox handlers. Addresses CodeRabbit review feedback. Co-Authored-By: yujonglee <[email protected]>
Apply the same auth middleware pattern as /listen to ensure the /transcribe endpoint requires Supabase authentication as documented in the OpenAPI spec. Addresses CodeRabbit review feedback. Co-Authored-By: yujonglee <[email protected]>
feat(api): add batch STT transcription endpoint
Summary
Adds a new
POST /transcribeHTTP endpoint for batch speech-to-text transcription via file upload. This mirrors the existing real-time WebSocket proxy (GET /listen) but for batch processing.New files:
batch-types.ts- TypeScript types matchingowhisper_interface::batch::Responsebatch-deepgram.ts- Synchronous POST to Deepgram's batch APIbatch-assemblyai.ts- Upload → Create transcript → Poll until completebatch-soniox.ts- Upload file → Create transcription → Poll → Get transcriptUsage:
Updates since last revision
requireSupabaseAuthmiddleware to/transcribeendpoint inindex.ts(same pattern as/listen)params.modelto AssemblyAI'sspeech_modelparameter for model selection consistency across all providersReview & Testing Checklist for Human
BatchResponseTypeScript type matches what the Rust client (owhisper_interface::batch::Response) expects to deserialize.200 attempts × 3s). Verify this won't hit infrastructure timeouts (load balancer, edge proxy, etc.).NODE_ENV !== "development". Confirm this matches the/listenendpoint behavior and is intentional.Recommended test plan:
curl -X POST "https://api.staging.hyprnote.com/transcribe?provider=deepgram" -H "Authorization: Bearer $TOKEN" -H "Content-Type: audio/wav" --data-binary @test.wav?provider=assemblyaiand?provider=sonioxNotes