This document helps AI agents work effectively in this codebase.
Parakeet ASR Server - A Go-based automatic speech recognition (ASR) server using NVIDIA's Parakeet TDT 0.6B model in ONNX format. Provides an OpenAI Whisper-compatible API for audio transcription.
- Language: Go 1.25+
- ML Runtime: ONNX Runtime 1.21.x (CPU inference)
- Model: NVIDIA Parakeet TDT 0.6B (Conformer-based encoder with Token-and-Duration Transducer decoder)
- API: REST, OpenAI Whisper-compatible
# Build
make build # Build to ./bin/parakeet
# Run
make run # Build and run with debug mode
make run-dev # Run with custom port (5092) for development
./bin/parakeet # Run binary directly
./bin/parakeet -port 8080 -models /path/to/models -debug=true
# Download models
make models # Download int8 models (default, ~670MB)
make models-int8 # Download int8 quantized models
make models-fp32 # Download full precision models (~2.5GB)
# Test
make test # Run tests
make test-coverage # Run with coverage
# Code quality
make fmt # Format code
make vet # Run go vet
make lint # Run all linters (vet + fmt)
# Dependencies
make deps # Download Go dependencies
make deps-tidy # Tidy dependencies
make deps-onnxruntime # Install ONNX Runtime library
# Docker
make docker-build-int8 # Build image with int8 models
make docker-build-fp32 # Build image with fp32 models
make docker-run-int8 # Run container with int8 models
make docker-run-fp32 # Run container with fp32 models
# Release
make release # Build all platforms
make release-linux # Build Linux binaries (amd64/arm64)
make release-darwin # Build macOS binaries (amd64/arm64)
make release-windows # Build Windows binary (amd64)parakeet/
├── main.go # Entry point, CLI flags, server initialization
├── internal/
│ ├── asr/
│ │ ├── transcriber.go # ONNX inference pipeline, TDT decoding
│ │ ├── mel.go # Mel filterbank feature extraction (FFT, windowing)
│ │ └── audio.go # WAV parsing, resampling to 16kHz
│ └── server/
│ ├── server.go # HTTP server, route setup, lifecycle management
│ ├── handlers.go # API endpoint handlers, response formatting
│ └── types.go # Request/response type definitions
├── models/ # ONNX models (downloaded separately)
├── bin/ # Build output directory
├── Makefile # Build recipes
├── Dockerfile # Multi-stage container build
├── .github/
│ └── workflows/
│ ├── ci.yaml # CI pipeline (lint, test, build)
│ └── release.yaml # Release pipeline (binaries, docker)
└── README.md
- Parses CLI flags:
-port,-models,-debug - Creates and runs the server
- Default port: 5092, default models dir:
./models
Configstruct: Port, ModelsDir, Debug settingsServerstruct: wraps config, transcriber, and HTTP muxNew()- Initializes transcriber and routesRun()- Starts HTTP listenerClose()- Releases resources
handleTranscription()- Main endpoint, parses multipart form, returns transcriptionhandleTranslation()- Delegates to transcription (Parakeet is English-focused)handleModels()- Returns available models (parakeet-tdt-0.6b, whisper-1 alias)handleHealth()- Health check endpoint- Response format helpers:
formatSRTTime(),formatVTTTime() - CORS and error response utilities
TranscriptionResponse- Simple JSON response with textVerboseTranscriptionResponse- Detailed response with segments, timingSegment- Transcription segment with timing infoErrorResponse,ErrorDetail- OpenAI-compatible error formatModelInfo,ModelsResponse- Model listing types
DebugMode- Global flag for verbose loggingConfig- Model configuration (features_size, subsampling_factor)Transcriber- Main inference structNewTranscriber()- Loads config, vocab, initializes ONNX RuntimeTranscribe()- Main entry: audio -> mel -> encoder -> TDT decode -> textloadAudio()- Format detection and parsingrunInference()- Encoder ONNX session executiontdtDecode()- TDT greedy decoding loop with state managementtokensToText()- Token IDs to text with cleanup
MelFilterbank- Mel-scale filterbank feature extractorNewMelFilterbank()- Creates filterbank with NeMo defaults (128 mels, 512 FFT)Extract()- Computes mel features with Hann windowingnormalize()- Per-utterance mean/variance normalizationfft()- Radix-2 Cooley-Tukey FFT implementation- Mel/Hz conversion helpers
parseWAV()- WAV parser supporting multiple chunk layoutsconvertToFloat32()- Supports 8/16/24/32-bit PCM and 32-bit floatresample()- Linear interpolation resampling to 16kHz
| Method | Path | Description |
|---|---|---|
| POST | /v1/audio/transcriptions |
Transcribe audio (OpenAI-compatible) |
| POST | /v1/audio/translations |
Translate audio (delegates to transcription) |
| GET | /v1/models |
List available models |
| GET | /health |
Health check |
file(required) - Audio file (multipart form, max 25MB)model- Accepted but ignored (only one model)language- ISO-639-1 code (default: "en")response_format- json, text, srt, vtt, verbose_json (default: "json")prompt,temperature- Accepted but ignored
- Go standard naming (camelCase for private, PascalCase for exported)
- Descriptive function names:
parseWAV,convertToFloat32,tdtDecode - Type suffixes for ONNX tensors:
inputTensor,outputTensor,lengthTensor
- Wrap errors with
fmt.Errorf("context: %w", err) - Return early on error
- Cleanup resources with
defer(tensor.Destroy(), file.Close())
- Create tensors with
ort.NewTensor(shape, data) - Use
ort.NewAdvancedSession()for named inputs/outputs - Always call
.Destroy()on tensors and sessions after use - Memory-conscious: tensors created and destroyed per inference step in decode loop
- JSON structs use tags:
json:"field_name"withomitemptywhere appropriate - OpenAI-compatible response structures
- Conformer architecture with 1024-dim output
- Input: mel features [batch, 128 features, time]
- Output: encoded features [batch, 1024, time/8]
- Subsampling factor: 8x
- Token-and-Duration Transducer
- Vocab size: 8193 tokens (8192 + blank)
- Duration classes: 5 (predicts how many encoder steps to advance)
- LSTM state: 2 layers x 640 dim
- Greedy decoding with max 10 tokens per timestep
▁token 123
- SentencePiece format with
▁(U+2581) as word boundary marker - Special token:
<blk>(blank) at index 8192
- Must be installed separately (not vendored)
- Set
ONNXRUNTIME_LIBenv var if not in standard paths - Auto-detection checks common paths in Makefile and transcriber.go
- Use
make deps-onnxruntimeto install (requires sudo) - Compatible version: 1.21.x for onnxruntime_go v1.19.0
encoder-model.int8.onnx(~652MB) orencoder-model.onnx(~2.5GB)decoder_joint-model.int8.onnx(~18MB) ordecoder_joint-model.onnx(~72MB)config.json,vocab.txt,nemo128.onnx- Download via
make modelsor manually from HuggingFace
- Currently only WAV format is supported
- WebM, OGG, MP3, M4A return "requires ffmpeg conversion - not yet implemented"
- All audio resampled to 16kHz mono internally
- Minimum audio length: 100ms (1600 samples at 16kHz)
- Tensors must be destroyed manually (no GC)
- The TDT decode loop creates/destroys tensors each iteration
- Memory usage: ~2GB RAM for int8 models, ~6GB for fp32
modelparameter accepted but ignored (only one model)promptandtemperatureparameters accepted but ignoredlanguagedefaults to "en" if not specified- Translation endpoint just calls transcription
| Variable | Description | Default |
|---|---|---|
ONNXRUNTIME_LIB |
Path to libonnxruntime.so | Auto-detect |
From go.mod:
go 1.25.5
github.com/yalue/onnxruntime_go v1.19.0
No other external Go dependencies. Standard library used for HTTP, JSON, audio processing, FFT.
- Runs on push/PR to main/master
- Jobs: lint (Go 1.22), test (Go 1.25), build (Go 1.25)
- Lint checks: go vet, gofmt
- Triggers on version tags (v*)
- Builds binaries for linux/darwin/windows (amd64/arm64)
- Creates GitHub release with checksums
- Pushes Docker images to ghcr.io (int8 and fp32 variants)
- Multi-stage build with golang:1.25-bookworm builder
- Runtime: debian:bookworm-slim with ONNX Runtime 1.21.0
- Models embedded in image during build
- Health check included
- Exposed port: 5092
- Add case in
internal/asr/transcriber.go:loadAudio() - Implement parser in
internal/asr/audio.go - Ensure output is
[]float32normalized to [-1, 1] at 16kHz
- Add/modify structs in
internal/server/types.go - Update relevant handler in
internal/server/handlers.go - Follow OpenAI response format conventions
- Add handler method to
internal/server/handlers.go - Register route in
internal/server/server.go:setupRoutes() - Add types to
internal/server/types.goif needed
- Encoder dim:
internal/asr/transcriber.go:247(encoderDim := int64(1024)) - LSTM state:
internal/asr/transcriber.go:314-315(stateDim,numLayers) - Max tokens per step:
internal/asr/transcriber.go:39(maxTokensPerStep: 10) - Mel features:
internal/asr/mel.go:25-27(nFFT, hopLength, winLength)
- Add target with
## Descriptioncomment for help - Use
@prefix for silent commands - Add to
.PHONYif not a file target
- Tag with semver:
git tag v1.0.0 - Push tag:
git push origin v1.0.0 - Release pipeline builds and publishes automatically