AGENTS.md

This document helps AI agents work effectively in this codebase.

Project Overview

Parakeet ASR Server - A Go-based automatic speech recognition (ASR) server using NVIDIA's Parakeet TDT 0.6B model in ONNX format. Provides an OpenAI Whisper-compatible API for audio transcription.

Key Technologies

Language: Go 1.25+
ML Runtime: ONNX Runtime 1.21.x (CPU inference)
Model: NVIDIA Parakeet TDT 0.6B (Conformer-based encoder with Token-and-Duration Transducer decoder)
API: REST, OpenAI Whisper-compatible

Essential Commands

# Build
make build                  # Build to ./bin/parakeet

# Run
make run                    # Build and run with debug mode
make run-dev                # Run with custom port (5092) for development
./bin/parakeet              # Run binary directly
./bin/parakeet -port 8080 -models /path/to/models -debug=true

# Download models
make models                 # Download int8 models (default, ~670MB)
make models-int8            # Download int8 quantized models
make models-fp32            # Download full precision models (~2.5GB)

# Test
make test                   # Run tests
make test-coverage          # Run with coverage

# Code quality
make fmt                    # Format code
make vet                    # Run go vet
make lint                   # Run all linters (vet + fmt)

# Dependencies
make deps                   # Download Go dependencies
make deps-tidy              # Tidy dependencies
make deps-onnxruntime       # Install ONNX Runtime library

# Docker
make docker-build-int8      # Build image with int8 models
make docker-build-fp32      # Build image with fp32 models
make docker-run-int8        # Run container with int8 models
make docker-run-fp32        # Run container with fp32 models

# Release
make release                # Build all platforms
make release-linux          # Build Linux binaries (amd64/arm64)
make release-darwin         # Build macOS binaries (amd64/arm64)
make release-windows        # Build Windows binary (amd64)

Project Structure

parakeet/
├── main.go                 # Entry point, CLI flags, server initialization
├── internal/
│   ├── asr/
│   │   ├── transcriber.go  # ONNX inference pipeline, TDT decoding
│   │   ├── mel.go          # Mel filterbank feature extraction (FFT, windowing)
│   │   └── audio.go        # WAV parsing, resampling to 16kHz
│   └── server/
│       ├── server.go       # HTTP server, route setup, lifecycle management
│       ├── handlers.go     # API endpoint handlers, response formatting
│       └── types.go        # Request/response type definitions
├── models/                 # ONNX models (downloaded separately)
├── bin/                    # Build output directory
├── Makefile                # Build recipes
├── Dockerfile              # Multi-stage container build
├── .github/
│   └── workflows/
│       ├── ci.yaml         # CI pipeline (lint, test, build)
│       └── release.yaml    # Release pipeline (binaries, docker)
└── README.md

Code Organization

`main.go` (Entry Point)

Parses CLI flags: -port, -models, -debug
Creates and runs the server
Default port: 5092, default models dir: ./models

`internal/server/` (HTTP Server Package)

`server.go`

Config struct: Port, ModelsDir, Debug settings
Server struct: wraps config, transcriber, and HTTP mux
New() - Initializes transcriber and routes
Run() - Starts HTTP listener
Close() - Releases resources

`handlers.go`

handleTranscription() - Main endpoint, parses multipart form, returns transcription
handleTranslation() - Delegates to transcription (Parakeet is English-focused)
handleModels() - Returns available models (parakeet-tdt-0.6b, whisper-1 alias)
handleHealth() - Health check endpoint
Response format helpers: formatSRTTime(), formatVTTTime()
CORS and error response utilities

`types.go`

TranscriptionResponse - Simple JSON response with text
VerboseTranscriptionResponse - Detailed response with segments, timing
Segment - Transcription segment with timing info
ErrorResponse, ErrorDetail - OpenAI-compatible error format
ModelInfo, ModelsResponse - Model listing types

`internal/asr/` (ASR Package)

`transcriber.go`

DebugMode - Global flag for verbose logging
Config - Model configuration (features_size, subsampling_factor)
Transcriber - Main inference struct
NewTranscriber() - Loads config, vocab, initializes ONNX Runtime
Transcribe() - Main entry: audio -> mel -> encoder -> TDT decode -> text
loadAudio() - Format detection and parsing
runInference() - Encoder ONNX session execution
tdtDecode() - TDT greedy decoding loop with state management
tokensToText() - Token IDs to text with cleanup

`mel.go`

MelFilterbank - Mel-scale filterbank feature extractor
NewMelFilterbank() - Creates filterbank with NeMo defaults (128 mels, 512 FFT)
Extract() - Computes mel features with Hann windowing
normalize() - Per-utterance mean/variance normalization
fft() - Radix-2 Cooley-Tukey FFT implementation
Mel/Hz conversion helpers

`audio.go`

parseWAV() - WAV parser supporting multiple chunk layouts
convertToFloat32() - Supports 8/16/24/32-bit PCM and 32-bit float
resample() - Linear interpolation resampling to 16kHz

API Endpoints

Method	Path	Description
POST	`/v1/audio/transcriptions`	Transcribe audio (OpenAI-compatible)
POST	`/v1/audio/translations`	Translate audio (delegates to transcription)
GET	`/v1/models`	List available models
GET	`/health`	Health check

Transcription Parameters

file (required) - Audio file (multipart form, max 25MB)
model - Accepted but ignored (only one model)
language - ISO-639-1 code (default: "en")
response_format - json, text, srt, vtt, verbose_json (default: "json")
prompt, temperature - Accepted but ignored

Code Patterns & Conventions

Naming

Go standard naming (camelCase for private, PascalCase for exported)
Descriptive function names: parseWAV, convertToFloat32, tdtDecode
Type suffixes for ONNX tensors: inputTensor, outputTensor, lengthTensor

Error Handling

Wrap errors with fmt.Errorf("context: %w", err)
Return early on error
Cleanup resources with defer (tensor.Destroy(), file.Close())

ONNX Runtime Usage

Create tensors with ort.NewTensor(shape, data)
Use ort.NewAdvancedSession() for named inputs/outputs
Always call .Destroy() on tensors and sessions after use
Memory-conscious: tensors created and destroyed per inference step in decode loop

Response Formats

JSON structs use tags: json:"field_name" with omitempty where appropriate
OpenAI-compatible response structures

Model Architecture Details

Encoder

Conformer architecture with 1024-dim output
Input: mel features [batch, 128 features, time]
Output: encoded features [batch, 1024, time/8]
Subsampling factor: 8x

TDT Decoder

Token-and-Duration Transducer
Vocab size: 8193 tokens (8192 + blank)
Duration classes: 5 (predicts how many encoder steps to advance)
LSTM state: 2 layers x 640 dim
Greedy decoding with max 10 tokens per timestep

Vocab Format

▁token 123

SentencePiece format with ▁ (U+2581) as word boundary marker
Special token: <blk> (blank) at index 8192

Important Gotchas

ONNX Runtime Library

Must be installed separately (not vendored)
Set ONNXRUNTIME_LIB env var if not in standard paths
Auto-detection checks common paths in Makefile and transcriber.go
Use make deps-onnxruntime to install (requires sudo)
Compatible version: 1.21.x for onnxruntime_go v1.19.0

Model Files Required

encoder-model.int8.onnx (~652MB) or encoder-model.onnx (~2.5GB)
decoder_joint-model.int8.onnx (~18MB) or decoder_joint-model.onnx (~72MB)
config.json, vocab.txt, nemo128.onnx
Download via make models or manually from HuggingFace

Audio Format Limitations

Currently only WAV format is supported
WebM, OGG, MP3, M4A return "requires ffmpeg conversion - not yet implemented"
All audio resampled to 16kHz mono internally
Minimum audio length: 100ms (1600 samples at 16kHz)

Tensor Memory Management

Tensors must be destroyed manually (no GC)
The TDT decode loop creates/destroys tensors each iteration
Memory usage: ~2GB RAM for int8 models, ~6GB for fp32

API Compatibility Notes

model parameter accepted but ignored (only one model)
prompt and temperature parameters accepted but ignored
language defaults to "en" if not specified
Translation endpoint just calls transcription

Environment Variables

Variable	Description	Default
`ONNXRUNTIME_LIB`	Path to libonnxruntime.so	Auto-detect

Dependencies

From go.mod:

go 1.25.5
github.com/yalue/onnxruntime_go v1.19.0

No other external Go dependencies. Standard library used for HTTP, JSON, audio processing, FFT.

CI/CD

CI Pipeline (`.github/workflows/ci.yaml`)

Runs on push/PR to main/master
Jobs: lint (Go 1.22), test (Go 1.25), build (Go 1.25)
Lint checks: go vet, gofmt

Release Pipeline (`.github/workflows/release.yaml`)

Triggers on version tags (v*)
Builds binaries for linux/darwin/windows (amd64/arm64)
Creates GitHub release with checksums
Pushes Docker images to ghcr.io (int8 and fp32 variants)

Docker Build

Multi-stage build with golang:1.25-bookworm builder
Runtime: debian:bookworm-slim with ONNX Runtime 1.21.0
Models embedded in image during build
Health check included
Exposed port: 5092

Common Tasks for Agents

Adding a New Audio Format

Add case in internal/asr/transcriber.go:loadAudio()
Implement parser in internal/asr/audio.go
Ensure output is []float32 normalized to [-1, 1] at 16kHz

Modifying API Response

Add/modify structs in internal/server/types.go
Update relevant handler in internal/server/handlers.go
Follow OpenAI response format conventions

Adding a New Endpoint

Add handler method to internal/server/handlers.go
Register route in internal/server/server.go:setupRoutes()
Add types to internal/server/types.go if needed

Changing Inference Parameters

Encoder dim: internal/asr/transcriber.go:247 (encoderDim := int64(1024))
LSTM state: internal/asr/transcriber.go:314-315 (stateDim, numLayers)
Max tokens per step: internal/asr/transcriber.go:39 (maxTokensPerStep: 10)
Mel features: internal/asr/mel.go:25-27 (nFFT, hopLength, winLength)

Adding a New Makefile Target

Add target with ## Description comment for help
Use @ prefix for silent commands
Add to .PHONY if not a file target

Creating a Release

Tag with semver: git tag v1.0.0
Push tag: git push origin v1.0.0
Release pipeline builds and publishes automatically

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Project Overview

Key Technologies

Essential Commands

Project Structure

Code Organization

main.go (Entry Point)

internal/server/ (HTTP Server Package)

server.go

handlers.go

types.go

internal/asr/ (ASR Package)

transcriber.go

mel.go

audio.go

API Endpoints

Transcription Parameters

Code Patterns & Conventions

Naming

Error Handling

ONNX Runtime Usage

Response Formats

Model Architecture Details

Encoder

TDT Decoder

Vocab Format

Important Gotchas

ONNX Runtime Library

Model Files Required

Audio Format Limitations

Tensor Memory Management

API Compatibility Notes

Environment Variables

Dependencies

CI/CD

CI Pipeline (.github/workflows/ci.yaml)

Release Pipeline (.github/workflows/release.yaml)

Docker Build

Common Tasks for Agents

Adding a New Audio Format

Modifying API Response

Adding a New Endpoint

Changing Inference Parameters

Adding a New Makefile Target

Creating a Release

`main.go` (Entry Point)

`internal/server/` (HTTP Server Package)

`server.go`

`handlers.go`

`types.go`

`internal/asr/` (ASR Package)

`transcriber.go`

`mel.go`

`audio.go`

CI Pipeline (`.github/workflows/ci.yaml`)

Release Pipeline (`.github/workflows/release.yaml`)