All notable changes to the Born ML Framework will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.7.14 - 2026-03-04
First external contribution! Thanks to @jsully1720.
Added:
- ONNX
Equaloperator — binary element-wise comparison returning bool tensor - New
comparison_ops.gocategory for ONNX comparison operators registerComparisonOps()wired into operator registry
ONNX operators: 38 → 39
Links:
0.7.13 - 2026-03-02
Update WebGPU backend to v0.4.1 with critical ABI compliance fixes.
Updated Dependencies:
go-webgpu/webgpuv0.4.0 → v0.4.1go-webgpu/goffiv0.4.0 → v0.4.1 (indirect)
Upstream Bug Fixes (ABI compliance):
- Float32 encoding: correct XMM bit patterns via
math.Float32bits - AMD64 Unix stack: arguments beyond 6 GP registers properly pushed to stack
- ARM64 Unix stack: arguments beyond 8 GP registers correctly spilled to stack
- AMD64 struct returns (9-16 bytes): RAX+RDX register pair properly assembled
- AMD64 sret pointer: structs > 16 bytes use caller buffer as first argument (RDI)
- ARM64 HFA spilling: Homogeneous Floating-Point Aggregate overflow follows AAPCS64
Upstream Enhancements:
runtime.KeepAliveprevents GC of argument pointers during FFI callsErrTooManyArgumentsoverflow detection for calls exceeding 15 arguments
Impact: Critical ABI correctness fixes for multi-platform GPU backend reliability.
Links:
- Upstream release: go-webgpu v0.4.1
0.7.12 - 2026-02-27
Update WebGPU backend to v0.4.0 with FFI hardening and improved library loading.
Updated Dependencies:
go-webgpu/webgpuv0.3.2 → v0.4.0
Upstream Improvements:
- Null handle guards on 27 public FFI methods — prevents SIGSEGV on nil/released objects
ptrFromUintptrhelper — eliminates allgo vetunsafe.Pointer warningsWGPU_NATIVE_PATHenv var for custom wgpu-native library pathloadLibraryreturns(Library, error)with proper error propagation- Windows DLL eager loading — errors surface at init, not at first use
- Enhanced
Init()error messages with library path and remediation suggestions - 85 new null guard test cases
Impact: Significantly improved safety and debuggability of GPU backend initialization.
Links:
- Upstream release: go-webgpu v0.4.0
0.7.11 - 2026-02-27
Update WebGPU backend to v0.3.2 with crosscall2 callback integration.
Updated Dependencies:
go-webgpu/webgpuv0.3.1 → v0.3.2go-webgpu/goffiv0.3.9 → v0.4.0 (indirect)
Upstream Improvements:
- crosscall2 integration — callbacks now work from C-library-created threads (Metal, wgpu-native)
- fakecgo trampoline register fixes synced with purego v0.10.0
Impact: Improved callback reliability on macOS Metal and native WebGPU implementations.
Links:
- Upstream release: go-webgpu v0.3.2
0.7.10 - 2026-02-18
Update WebGPU backend to v0.3.1 with critical ARM64 callback fix.
Updated Dependencies:
go-webgpu/webgpuv0.3.0 → v0.3.1go-webgpu/goffiv0.3.8 → v0.3.9 (indirect)
Upstream Fixes:
- ARM64 callback trampoline rewrite — fixes LR corruption for callbacks at index > 0
- Symbol rename to prevent linker collision with purego
Code Quality:
- Removed 101 unused
//nolint:gosecdirectives (gosec linter updated, no longer flags these) - Standardized remaining nolint comments to short format
Impact: Critical fix for macOS Apple Silicon and Linux ARM64 users.
Links:
- Upstream release: go-webgpu v0.3.1
0.7.9 - 2026-02-09
Update WebGPU backend to v0.3.0 with new capability-querying API and typed errors.
Updated Dependencies:
go-webgpu/webgpuv0.2.1 → v0.3.0
New Upstream Features Available:
Surface.GetCapabilities()— query supported formats, present modes, alpha modesDevice.GetFeatures()/Device.HasFeature()— feature enumerationDevice.GetLimits()— device limits (experimental)- Typed errors with
errors.Is()/errors.As()support (ErrValidation,ErrOutOfMemory,ErrInternal,ErrDeviceLost) - Resource leak detection via
SetDebugMode(true)/ReportLeaks()
Links:
- Upstream release: go-webgpu v0.3.0
0.7.8 - 2026-01-29
Migrate WebGPU backend to use unified gputypes for future dual-backend support.
Updated Dependencies:
go-webgpu/webgpuv0.1.4 → v0.2.1go-webgpu/goffiv0.3.7 → v0.3.8dlclark/regexp2v1.10.0 → v1.11.5google/uuidv1.3.0 → v1.6.0- Added
gogpu/gputypesv0.2.0 (new dependency)
Changes:
- Migrated all WebGPU types from
wgpu.*togputypes.*:BufferUsage,BufferUsageStorage,BufferUsageCopySrc,BufferUsageCopyDstPowerPreferenceHighPerformance
- Updated 10 files in
internal/backend/webgpu/ - Fixed 3 prealloc warnings in linter (examples + internal/nn)
Why This Matters:
- Prepares codebase for Pure Go WebGPU backend (
gogpu/wgpu) - Unified type system enables future dual-backend architecture
- Build tags will allow:
go build(Rust FFI) vsgo build -tags purego(Pure Go)
Links:
- Upstream release: go-webgpu v0.2.0
- GoGPU ecosystem: github.com/gogpu
- Integration plan: TASK-110
0.7.7 - 2026-01-06
Refactored public API packages to use proper Go interfaces instead of type aliases where possible.
Improvements:
tensor/: AddedBackendinterface with 40+ methods (was type alias)nn/: AddedModuleinterface with full method definitionsonnx/: AddedModelinterface for ONNX model operationsoptim/: Now uses publicnn.Parameterin function signaturesautodiff/: Now uses publictensortypesbackend/cpu,backend/webgpu: Added compile-time interface checks
Technical Details:
- Improves pkg.go.dev documentation by hiding internal paths
- External packages can now properly import and use the public API
- Some interfaces (
Optimizer,ModelReader) remain as type aliases due to Go's type system constraints
Fixed Issues:
- #25 — ONNX package not accessible from external packages
0.7.6 - 2026-01-03
Comprehensive ARM64 Darwin support with enhanced struct handling, tested on M3 Pro hardware.
Updated Dependencies:
go-webgpu/webgpuv0.1.3 → v0.1.4go-webgpu/goffiv0.3.6 → v0.3.7
Improvements:
- Proper layout for nested and complex struct types
- Automatic struct layout computation for integer/float combinations
- Enhanced struct return handling (9-16 bytes) utilizing X0 and X1 registers
Fixed Issues:
- Resolved segmentation fault in string output benchmarks on Darwin systems
Contributors:
- @ppoage — ARM64 Darwin implementation, Objective-C test suite, assembly verification
Links:
- Upstream release: go-webgpu v0.1.4
0.7.5 - 2025-12-29
Update GPU backend dependencies with critical ARM64 fixes for Apple Silicon.
Updated Dependencies:
go-webgpu/webgpuv0.1.2 → v0.1.3go-webgpu/goffiv0.3.5 → v0.3.6
Fixed Issues:
- ARM64 HFA returns (NSRect with 4×float64 now correctly returns all values on Apple Silicon)
- Large struct returns (structs exceeding 16 bytes now properly use X8 register)
- macOS ARM64 display (blank window issue where GPU dimensions returned 0×0)
Links:
- Upstream release: go-webgpu v0.1.3
0.7.4 - 2025-12-27
Add WithBias option to nn.NewLinear for creating Linear layers without bias term.
New API:
// With bias (default, backwards compatible)
layer := nn.NewLinear(784, 128, backend)
// Without bias (for LLaMA-style models, LM head, etc.)
lmHead := nn.NewLinear(hiddenSize, vocabSize, backend, nn.WithBias(false))Changes:
- Add
LinearOptiontype andWithBias(bool)functional option - Add
HasBias()method for introspection - Update
SwiGLUFFNto use public API - Export
WithBiasin publicnnpackage
Use Cases:
- LM Head in language models (GPT, LLaMA, HRM)
- Attention projections (some architectures)
- SwiGLU FFN layers
Links:
- PR: #22
0.7.3 - 2025-12-27
Hotfix release updating GPU backend dependencies to latest versions.
Updated Dependencies:
go-webgpu/webgpuv0.1.1 → v0.1.2go-webgpu/goffiv0.3.3 → v0.3.5
Links:
- PR: #21
0.7.2 - 2025-12-24
Hotfix release updating GPU backend dependencies for improved stability.
Updated Dependencies:
go-webgpu/webgpuv0.1.0 → v0.1.1go-webgpu/goffiv0.3.1 → v0.3.3
Documentation:
- Updated
.claude/CLAUDE.mdto v3.0 (optimized structure, accurate project info) - Added
TASK-110-backend-strategy-gogpu.mdfor future GPU backend strategy planning
Links:
- PR: #18
0.7.1 - 2025-12-16
Patch release addressing cognitive complexity concerns raised by community contributor @marcelloh. Applied Burn framework patterns for improved code quality and maintainability.
Pre-Slice Bounds Elimination (internal/backend/cpu/conv2d.go, maxpool2d.go):
- Extract row slices BEFORE inner loops to eliminate bounds checks
- Hierarchical pre-slicing for nested loop structures
- Enables Go compiler to prove safety and optimize vectorization
Stride Specialization (internal/backend/cpu/conv2d.go):
- Separate fast paths for
stride=1, padding=0case (most common) - Specialized functions:
conv2dFloat32Stride1NoPad,conv2dInputBackwardFloat32Stride1NoPad - Enables compiler auto-vectorization for common case
Flash Attention CPU Refactor (internal/nn/flash_attention.go):
- Complexity reduced: 111 → <30 (removed
//nolint:gocognitdirective) - Extracted
FlashDims,FlashConfigstructs for configuration - Helper functions:
flashAttentionScoreBlock,flashAttentionExtractValues,flashAttentionProcessQuery - Each helper under 50 AST nodes (Go compiler inlines automatically)
Autodiff Orchestration (internal/autodiff/ops/):
- Separated orchestration from computation (Burn pattern)
- New files:
conv2d_backward.go,maxpool2d_backward.goin CPU backend autodiff/ops/conv2d.go: 409 → 67 lines (delegation only)- Extended Backend interface with backward operation methods
Parallel Execution Utilities (internal/parallel/):
- New package for reusable parallel execution patterns
parallel.Config- configurable parallelism settingsparallel.For()- parallel for-loop with automatic sequential fallbackparallel.ForBatch()- optimized for batch×channels iteration pattern- Ready for integration into CPU backend operations
Backend Interface Extended (internal/tensor/backend.go):
Conv2DInputBackward(input, kernel, grad, stride, padding)- gradient w.r.t. inputConv2DKernelBackward(input, kernel, grad, stride, padding)- gradient w.r.t. kernelMaxPool2DBackward(input, grad, maxIndices, kernelSize, stride)- gradient propagation- WebGPU backend updated with stub implementations
Code Quality:
- All files properly formatted (
go fmt ./...) - 0 linter issues (golangci-lint)
- All tests passing
- No performance regression
Files Changed: 13 files, +994/-460 lines
Links:
- Issue: #14
- PR: #15
- Community: Thanks to @marcelloh for the detailed analysis!
0.7.0 - 2025-12-10
Major release focused on inference optimization for LLM deployment.
Flash Attention 2 (internal/nn/flash_attention.go, internal/nn/online_softmax.go):
- O(N) Memory - Tiled computation never materializes full N×N attention matrix
- Online Softmax - Incremental softmax with rescaling for numerical stability
- WebGPU Shader - WGSL compute shader with workgroup shared memory
- Configurable Tiles - Block sizes 64 and 128 supported
- Head Dimensions - Supports 64, 96, 128, 256
- Causal Masking - Built-in support for autoregressive models
- CPU Reference - Validation implementation for correctness testing
- 2x+ Speedup - On sequences 8K+ vs standard attention
Speculative Decoding (internal/generate/speculative.go):
- Draft Model - Small model generates K candidate tokens speculatively
- Parallel Verification - Target model verifies all candidates in single batch
- Modified Rejection Sampling - Mathematically correct token acceptance
- 2-4x Speedup - For autoregressive text generation
- Configurable - Draft steps (K), temperature, sampling parameters
GGUF Import (internal/gguf/):
- Parser - Complete GGUF v3 format parsing (types, metadata, tensor info)
- Loader - Memory-mapped tensor data loading
- K-Quant Dequantization - Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, Q4_1, Q5_0, Q5_1
- Converter - GGUF tensors to Born tensor format
- llama.cpp Ecosystem - Load LLaMA, Mistral, DeepSeek, Qwen models
Code Quality:
- Fixed 226 gosec G115 integer overflow warnings across codebase
- All files properly formatted (gofmt)
- 0 linter issues (golangci-lint)
Tests:
- Flash Attention: GPU vs CPU correctness validation (< 1e-4 error)
- Speculative Decoding: 11 tests, 93.1% coverage
- GGUF: 52 tests, 75% coverage
Files Added:
internal/nn/flash_attention.go- Flash Attention moduleinternal/nn/online_softmax.go- Online softmax implementationinternal/nn/flash_attention_test.go- CPU testsinternal/nn/flash_attention_gpu_test.go- GPU testsinternal/backend/webgpu/flash_attention.go- GPU executioninternal/backend/webgpu/shaders.go- Added flashAttentionShaderinternal/generate/speculative.go- Speculative decodinginternal/generate/speculative_test.go- Speculative testsinternal/gguf/- Complete GGUF package (types, parser, loader, dequant, convert)
0.6.0 - 2025-12-04
Major release adding ONNX model import and GPU-resident lazy evaluation for dramatically improved performance.
ONNX Import API (internal/onnx/):
- ONNX Parser - Parse
.onnxmodel files (protobuf format) - Model Loader - Load weights and construct computation graph
- 30+ Operators - Standard ONNX operator support:
- Activations: ReLU, Sigmoid, Tanh, Softmax, GELU, LeakyReLU
- Math: MatMul, Add, Mul, Div, Sub, Sqrt, Pow, Exp, Log
- Shape: Reshape, Transpose, Squeeze, Unsqueeze, Concat, Split
- Utility: Gather, Slice, Cast, Constant, Identity, Flatten
- Operator Registry - Extensible operator registration system
Lazy GPU Evaluation (internal/tensor/lazy_gpu.go):
- GPU-Resident Tensors - Data stays on GPU until explicitly needed
- LazyGPUData - Reference to GPU buffer with lazy CPU transfer
- Automatic Memory Management -
runtime.SetFinalizerfor GPU buffer cleanup - Zero CPU Round-trips - Chained operations stay entirely on GPU
Command Batching (internal/backend/webgpu/):
- Batch GPU Commands - Accumulate commands instead of immediate submit
- Reduced Sync Overhead - ~200 submits → 1-2 per operation chain
- FlushCommands() - Explicit synchronization when needed
- Performance Impact: ~90s/step → <5s/step for model training
GPU-to-GPU Copy:
- CopyBufferToBuffer - Direct GPU memory transfer
- No CPU Round-trip - Eliminated GPU→CPU→GPU transfers in lazy chains
- ~100x Speedup - Per-operation transfer overhead eliminated
Raw Tensor Operations (internal/tensor/raw_ops.go):
- 50+ Operations - Comprehensive tensor manipulation
- Argmax, TopK - Selection operations
- Type Conversions - Float32, Int32, Bool conversions
- Broadcasting - NumPy-style shape broadcasting
- Advanced Indexing - Gather, Scatter operations
Bug Fixes:
- Fixed GPU memory leak when lazy tensors go out of scope
- Fixed typed accessors (AsInt32, AsInt64, etc.) bypassing lazy realization
- Fixed Where and Sum operations missing lazy mode support
Tests:
- 15+ new ONNX tests (parser, loader, operators)
- Lazy mode chain tests
- Command batching tests
Files Added:
internal/onnx/- Complete ONNX import packageinternal/tensor/lazy_gpu.go- Lazy GPU data structuresinternal/tensor/raw_ops.go- Raw tensor operationsinternal/backend/webgpu/lazy_compute.go- Lazy GPU operationsinternal/backend/webgpu/gpu_*.go- GPU tensor and autodiff support
0.5.5 - 2025-12-03
Critical performance fix for transformer training on WebGPU backend.
Problem Fixed:
- Multi-dimensional Transpose operations (3D+) were falling back to CPU
- Expand (broadcasting) was CPU-only
- Result: ~60s/batch for small transformer models (should be <1s)
New GPU Operations:
- TransposeND shader - N-dimensional transpose on GPU (up to 6D)
- Expand shader - NumPy-style broadcasting on GPU
- Both support
float32andint32data types
Performance Impact:
- ~60x speedup for attention operations
- Transformer training now usable on WebGPU
Tests:
- 9 new tests:
TestTranspose3D,TestTranspose4D,TestTranspose5D,TestExpandBroadcast, etc.
Files Changed:
internal/backend/webgpu/shaders.go- Added WGSL shadersinternal/backend/webgpu/compute.go- AddedrunTransposeND,runExpandinternal/backend/webgpu/ops.go- Removed CPU fallbackinternal/backend/webgpu/ops_extended.go- Removed CPU fallbackinternal/backend/webgpu/ops_nd_test.go- New test file
0.5.4 - 2025-12-03
Production-ready model serialization with Format v2 best practices.
New Features:
- Born Native Format v2 (
.born) - SHA-256 checksum, security validation - Checkpoint API - Save/resume training with optimizer state
- SafeTensors Export - HuggingFace ecosystem compatibility
- Memory-Mapped Reader - Efficient loading for 70GB+ models
API:
nn.Save(model, "model.born", "ModelType", metadata)- Save modelnn.Load("model.born", backend, model)- Load modelnn.SaveCheckpoint(path, model, optimizer, epoch, step, loss)- Save checkpointnn.LoadCheckpoint(path, backend, model, optimizer)- Resume trainingserialization.WriteSafeTensors(path, tensors, metadata)- Export for HuggingFace
New Package:
internal/serialization- Format writer/reader, validation, mmap
Tests:
- 26 new tests for serialization, checkpoints, SafeTensors
0.5.3 - 2025-12-02
Bug Fixes:
- Comparison ops - Now always return
float32(0.0/1.0), even forint32inputs - Sum int32 - Added WGSL shader for int32 sum reduction
- Sum scalar shape - Fixed return shape from
[1]to[]for proper scalar handling - Where int32 condition - Added support for int32 condition tensors
- Where broadcasting - Added NumPy-style broadcasting (like Burn)
- Gather backward - Support for int32, int64, float32 index tensors
New Functions:
runComparisonOp- Dedicated function for comparison operationsint32ToFloat32- Helper for int32 to float32 conversion
Tests:
- 3 new Gather backward tests (int64 indices, boundary, dim0 2D)
0.5.2 - 2025-12-01
- Added public
backend/webgpupackage withNewBackend()function - Windows build tag support for WebGPU
- Updated README with WebGPU API example
0.5.1 - 2025-12-01
- Minor fixes after v0.5.0 release
0.5.0 - 2025-12-01
Major release adding complete LLM inference support! Run LLaMA, Mistral, DeepSeek, and other modern language models with Born.
Grouped Query Attention (GQA) (internal/nn/gqa.go):
- GroupedQueryAttention - Memory-efficient attention for LLaMA 2/3, Mistral
- RepeatKV - KV head broadcasting (e.g., 8 KV heads → 32 Q heads)
- MQA helper - Multi-Query Attention config (extreme GQA with 1 KV head)
- Full RoPE integration with KV-cache support
- 4:1 memory savings for KV-cache vs standard MHA
SwiGLU & GLU Variants (internal/nn/glu.go, internal/nn/swiglu_ffn.go):
- SwiGLU -
x * SiLU(gate)activation (LLaMA, Mistral) - GeGLU -
x * GELU(gate)activation - ReGLU -
x * ReLU(gate)activation - GLU -
x * sigmoid(gate)(classic) - SwiGLUFFN - Complete feed-forward module with gate/up/down projections
- Configurable bias (LLaMA uses no bias)
Model Loader (internal/loader/):
- GGUF format support - Read LLaMA, Mistral, DeepSeek model files
- GGUFReader - Parse metadata and tensor info
- Weight Mappers - Architecture-specific weight name translation
LLaMAMapper- LLaMA 1/2/3 modelsMistralMapper- Mistral 7B and variantsDeepSeekMapper- DeepSeek models
- DetectArchitecture - Auto-detect model type from tensor names
- Support for F32, F16 dtypes (quantized types require dequant)
Tokenizer Integration (internal/tokenizer/):
- TikToken - OpenAI's BPE tokenizer (GPT-3.5, GPT-4)
- BPE Tokenizer - Generic Byte Pair Encoding
- HuggingFace format - Load tokenizer.json from HF models
- Chat Templates - Format multi-turn conversations
- ChatML (OpenAI style)
- LLaMA (Meta format)
- Mistral (with [INST] tags)
- Special tokens - BOS, EOS, PAD, UNK handling
- AutoLoad - Auto-detect tokenizer type from path
Sampling Strategies (internal/generate/sampling.go):
- Temperature - Control randomness (0 = greedy)
- Top-K - Sample from top K tokens
- Top-P (nucleus) - Sample from smallest set with P cumulative probability
- Min-P - Filter tokens below P * max_prob threshold
- Repetition Penalty - Penalize repeated tokens
- Frequency Penalty - Penalize based on token frequency
- Presence Penalty - Penalize based on token presence
- Configurable seed - Reproducible sampling
Text Generation (internal/generate/generator.go):
- TextGenerator - High-level API for text generation
- Streaming API - Token-by-token generation with channels
- Chat API - Multi-turn conversation with templates
- GenerateConfig - Max tokens, min tokens, stop strings/tokens
- GenerateResult - Token, token ID, done flag, reason
- KV-cache integration - Efficient autoregressive generation
- Echo prompt - Optionally include prompt in output
Multi-Output Autodiff (internal/autodiff/ops/):
- MultiOutputOperation - Interface for ops with multiple outputs
- BackwardMulti - Compute gradients for multi-output ops
- ChunkOp - Fixed backward pass for tensor chunking
- GatherOp - Scatter-add gradient computation
Public API (nn/, generate/, tokenizer/, loader/):
- Complete public wrappers for all new types
- Type aliases for seamless internal/public integration
- Documentation with examples
- 100+ new unit tests across all LLM modules
- Comprehensive sampling tests - All strategies validated
- Generator tests - Streaming, stop conditions, chat
- Tokenizer tests - Encode/decode roundtrip, special tokens
- 0 golangci-lint issues
| Package | Tests | Status |
|---|---|---|
| internal/nn (GQA, SwiGLU) | 35+ | ✅ |
| internal/tokenizer | 27 | ✅ |
| internal/generate | 17 | ✅ |
| internal/loader | 10+ | ✅ |
| internal/autodiff/ops | 20+ | ✅ |
import (
"github.com/born-ml/born/generate"
"github.com/born-ml/born/tokenizer"
"github.com/born-ml/born/loader"
)
// Load tokenizer
tok, _ := tokenizer.NewTikTokenForModel("gpt-4")
// Load model
model, _ := loader.OpenModel("llama-7b.gguf")
// Create generator
gen := generate.NewTextGenerator(model, tok, generate.SamplingConfig{
Temperature: 0.7,
TopP: 0.9,
TopK: 40,
})
// Generate text
result, _ := gen.Generate("Hello!", generate.GenerateConfig{MaxTokens: 100})
// Or stream tokens
stream, _ := gen.GenerateStream("Once upon", generate.GenerateConfig{MaxTokens: 50})
for chunk := range stream {
fmt.Print(chunk.Token)
}
// Chat with templates
messages := []tokenizer.ChatMessage{
{Role: "user", Content: "What is 2+2?"},
}
response, _ := gen.Chat(messages, tokenizer.NewChatMLTemplate(), config)| Feature | Benchmark |
|---|---|
| GQA 32Q/8KV | 4x KV-cache memory savings |
| SwiGLU FFN | 2.7x expansion (vs 4x standard) |
| TikToken | ~1M tokens/sec encoding |
| Top-P sampling | O(n log n) sorting |
0.4.0 - 2025-12-01
Major release adding complete transformer architecture support! Build GPT, LLaMA, BERT, and modern LLM architectures with Born.
Attention Mechanisms (internal/nn/):
- Scaled Dot-Product Attention (SDPA) - Core attention with optional mask and dropout
- Multi-Head Attention (MHA) - Full implementation with WQ, WK, WV, WO projections
- KV-Cache - Efficient autoregressive generation (3.94x speedup for 100 tokens)
Normalization Layers (internal/nn/):
- LayerNorm - Classic layer normalization with learnable gamma/beta
- RMSNorm - Root Mean Square normalization (LLaMA style)
Positional Encodings (internal/nn/):
- RoPE (Rotary Position Embedding) - Used by LLaMA, Mistral, DeepSeek
- ALiBi (Attention with Linear Biases) - Used by BLOOM, MPT
- Sinusoidal - Original Transformer positional encoding
- Learned - Trainable position embeddings (GPT-2 style)
Transformer Building Blocks (internal/nn/):
- TransformerBlock - Complete transformer layer with:
- Pre-Norm (LLaMA style) and Post-Norm (original) support
- RMSNorm or LayerNorm selection
- Configurable attention and FFN dimensions
- FFN (Feed-Forward Network) - SiLU activation (LLaMA style)
- ForwardWithCache - Efficient inference with KV-cache
Tensor Operations (internal/tensor/, internal/backend/cpu/):
- BatchMatMul - Native 3D/4D batched matrix multiplication
[B, M, K] @ [B, K, N] → [B, M, N](3D)[B, H, M, K] @ [B, H, K, N] → [B, H, M, N](4D)
- Refactored SDPA to use BatchMatMul (-40% code)
- Scalar gradient broadcasting - Fixed
reduceBroadcastpanic when propagating scalar gradients - Multi-dim Softmax backward - Now supports 3D/4D tensors (not just 2D)
- 70+ new unit tests across attention modules
- Comprehensive benchmarks for all new components
- 0 golangci-lint issues
- KV-Cache: 3.94x speedup verified
- Parameter counts verified (7.1M per transformer block, matching GPT-2)
import (
"github.com/born-ml/born/nn"
"github.com/born-ml/born/tensor"
)
// Create a transformer block (GPT-2 style)
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true, // Pre-Norm (LLaMA)
UseRMSNorm: true, // RMSNorm (LLaMA)
NormEps: 1e-5,
}
block := nn.NewTransformerBlock(config, backend)
// Forward pass
x := tensor.Randn[float32](tensor.Shape{1, 512, 768}, backend)
output := block.Forward(x, nil)
// With KV-Cache for generation
cache := nn.NewKVCache(1, 12, 2048, 64, backend)
for i := 0; i < 100; i++ {
token := getNextToken()
output := block.ForwardWithCache(token, cache)
}| Operation | Benchmark |
|---|---|
| SDPA (512 seq) | 89.2% coverage |
| MHA (768d/12h) | 2.3M params verified |
| KV-Cache (100 tokens) | 3.94x speedup |
| TransformerBlock | ~7.1M params/block |
| RoPE (2048 seq) | Pre-computed cos/sin |
0.3.0 - 2025-11-30
Major release adding essential operations for modern transformer architectures (LLaMA, Mistral, GPT), the HRM Model, and 31 type-safe public API operations!
Math Operations (internal/backend/cpu/math.go, internal/autodiff/ops/):
Exp()- Exponential function with gradient supportSqrt()- Square root with stable gradientsRsqrt()- Reciprocal square root (1/√x) for normalization layersCos()- Cosine for RoPE (Rotary Position Embedding)Sin()- Sine for RoPE implementations
Reduction Operations (internal/backend/cpu/reduce.go):
SumDim(dim, keepDim)- Sum along dimension with optional keepDimMeanDim(dim, keepDim)- Mean along dimension with optional keepDim- Supports negative dimensions (-1 for last dimension)
- Broadcasting-aware for gradient computation
Tensor Manipulation (internal/backend/cpu/manipulation.go):
Cat(tensors, dim)- Concatenate tensors along dimensionChunk(n, dim)- Split tensor into n equal chunksUnsqueeze(dim)- Add dimension of size 1Squeeze(dim)- Remove dimensions of size 1
Indexing Operations (internal/backend/cpu/indexing.go):
Gather(dim, index)- Select elements using index tensorWhere(condition, x, y)- Conditional element selection
Neural Network Layers (internal/nn/):
- SiLU (Swish) activation:
x * sigmoid(x)with autodiff - RMSNorm layer: Root Mean Square Normalization with learnable gamma
- Embedding layer: Token lookup table for NLP models
Gradient Control (internal/autodiff/):
NoGrad(func)- Context manager to disable gradient recording (inference mode)Detach()- Break gradient chain while keeping tensor values
Public API Operations (internal/tensor/ops_extended.go, tensor/):
31 type-safe operations now available via github.com/born-ml/born/tensor:
- Scalar (4):
MulScalar,AddScalar,SubScalar,DivScalar - Math (6):
Log,Exp,Sqrt,Rsqrt,Cos,Sin - Activation (1):
Softmax(dim) - Comparison (12):
Greater/Gt,Lower/Lt,GreaterEqual/Ge,LowerEqual/Le,Equal/Eq,NotEqual/Ne - Boolean (3):
Or,And,Not - Reduction (2):
Sum,Argmax - Type Conversion (6):
Int32,Int64,Float32,Float64,Uint8,Bool - Shape (1):
Expand
Example usage:
import "github.com/born-ml/born/tensor"
x := tensor.Randn[float32](tensor.Shape{2, 3}, backend)
y := x.MulScalar(2.0) // Scalar operations
mask := x.Greater(y) // Comparison (returns Tensor[bool, B])
z := x.Softmax(-1) // Activation
total := x.Sum() // Reduction
i := x.Int32() // Type conversion- 112 new unit tests added across all features
- 0 golangci-lint issues (maintained strict quality standards)
- All autodiff operations validated with numerical gradient checking
- Comprehensive edge case coverage (negative dims, broadcasting, etc.)
| Package | Coverage | Tests |
|---|---|---|
| backend/cpu (math) | 79.0% | 23 |
| backend/cpu (reduce) | 80.2% | 17 |
| backend/cpu (manipulation) | - | 29 |
| backend/cpu (indexing) | - | 11 |
| autodiff/ops | 69.6% | - |
| nn (SiLU, RMSNorm, Embedding) | - | 18 |
| Total Phase 2.5 | - | 112 |
- Updated
tensor.Backendinterface with new operations - Extended
.golangci.ymlwith exclusions for intentional patterns - WebGPU backend stubs added for all new operations (CPU-only for now)
internal/backend/cpu/
├── math.go # Exp, Sqrt, Rsqrt, Cos, Sin
├── math_test.go # 23 tests
├── reduce.go # SumDim, MeanDim
├── reduce_test.go # 17 tests
├── manipulation.go # Cat, Chunk, Unsqueeze, Squeeze
├── indexing.go # Gather, Where
└── indexing_test.go # 11 tests
internal/autodiff/ops/
├── exp.go, sqrt.go, rsqrt.go, cos.go, sin.go
├── sumdim.go, meandim.go
├── silu.go
├── embedding.go
├── math_test.go
├── reduce_test.go
└── silu_test.go
internal/nn/
├── rmsnorm.go # RMSNorm layer
├── rmsnorm_test.go # 8 tests
├── embedding.go # Embedding layer
├── embedding_test.go # 8 tests
└── activation.go # Added SiLU
internal/tensor/
└── ops_extended.go # 31 public API wrappers (470 lines)
internal/backend/cpu/
├── scalar.go # MulScalar, AddScalar, SubScalar, DivScalar
├── activation.go # Softmax (n-dimensional, numerically stable)
├── comparison.go # Greater, Lower, Equal, etc.
├── boolean.go # Or, And, Not
├── conversion.go # Cast for all dtype pairs
└── shape.go # Expand with broadcasting
internal/backend/webgpu/
└── ops_extended.go # Stubs + working Softmax
With Phase 2.5 primitives, Born can now support:
Transformer Components:
- ✅ RoPE (Rotary Position Embedding) - built from
Cos,Sin,Cat - ✅ SwiGLU activation - built from
Linear,SiLU,Chunk - ✅ RMSNorm - directly available as layer
- ✅ Stablemax (HRM) - built from
Where,SumDim,Gather
Modern LLM Architectures:
- ✅ LLaMA (Meta)
- ✅ Mistral AI models
- ✅ GPT-style transformers
- ✅ HRM (Hierarchical Reasoning Model)
Inference Capabilities:
- ✅ Token embedding lookup
- ✅ Position encoding (RoPE)
- ✅ Layer normalization (RMSNorm)
- ✅ Modern activations (SiLU/Swish)
- ✅ Gradient control for inference (
NoGrad,Detach)
- Multi-head attention (MHA) layer
- Layer normalization variants
- More positional encodings (Absolute, Learned)
- KV-cache for efficient inference
- Linux/macOS WebGPU support
0.2.0 - 2025-11-28
Major release introducing GPU acceleration via WebGPU - the first production-ready Go ML framework with zero-CGO GPU support!
WebGPU Backend (internal/backend/webgpu/):
- Zero-CGO GPU acceleration via go-webgpu v0.1.0
- WGSL compute shaders for all tensor operations
- Buffer pool with size-based categorization for memory efficiency
- Memory statistics tracking (allocations, peak usage, pool hits/misses)
- Graceful degradation when wgpu_native.dll not available (panic recovery)
GPU Operations:
- Element-wise:
Add,Sub,Mul,Div - Matrix:
MatMul(tiled algorithm, 16x16 workgroups) - Shape:
Reshape,Transpose - Activations:
ReLU,Sigmoid,Tanh,Softmax
CPU Backend Enhancements:
Softmaxoperation added- Backend now implements full
tensor.Backendinterface
Examples:
examples/mnist-gpu/- CPU vs WebGPU benchmark (~123x MatMul speedup)
Documentation:
docs/PHILOSOPHY.md- Framework philosophy and design principlesdocs/USE_CASES.md- Real-world use cases and deployment scenarios- Updated README with performance benchmarks
Benchmarks (NVIDIA RTX GPU vs CPU):
| Operation | Size | CPU | WebGPU | Speedup |
|---|---|---|---|---|
| MatMul | 1024×1024 | 847ms | 6.9ms | 123x |
| MatMul | 512×512 | 105ms | 2.1ms | 50x |
| MatMul | 256×256 | 13ms | 1.3ms | 10x |
| Add | 1M elements | 1.2ms | 0.15ms | 8x |
MNIST MLP Inference (batch=256):
- CPU: ~45ms/batch
- WebGPU: ~4.1ms/batch
- Speedup: 10.9x
- Build tags added for Windows-only WebGPU code (
//go:build windows) go.sumnow committed (was incorrectly in .gitignore)- Updated all documentation for v0.2.0 milestone
- 13 new WebGPU operation tests (ops_test.go)
- 7 buffer pool tests (buffer_pool_test.go)
- 26 benchmark functions for CPU vs GPU comparison
- All tests pass on Ubuntu, macOS, Windows
- WebGPU tests skip gracefully on systems without GPU support
internal/backend/webgpu/
├── backend.go # WebGPU backend initialization
├── ops.go # Operation implementations
├── compute.go # Compute pipeline management
├── shaders.go # WGSL shader sources
├── buffer_pool.go # GPU buffer pooling
├── *_test.go # Tests and benchmarks
examples/mnist-gpu/
└── main.go # GPU benchmark example
docs/
├── PHILOSOPHY.md # Framework philosophy
└── USE_CASES.md # Use cases
- Windows: Full WebGPU support (requires wgpu_native.dll)
- Linux/macOS: CPU backend only (WebGPU builds skipped)
- WebGPU on Linux/macOS planned for future release
- BatchNorm2D for training stability
- Dropout for regularization
- Model serialization (save/load)
- Linux WebGPU support via Vulkan
- ONNX model import
0.1.1 - 2025-11-17
BREAKING (but necessary): v0.1.0 had no usable public API! All packages were in internal/ which cannot be imported by external projects. This hotfix adds proper public packages.
Public API Packages:
github.com/born-ml/born/tensor- Type-safe tensor operationsgithub.com/born-ml/born/nn- Neural network modules (Linear, Conv2D, MaxPool2D, etc.)github.com/born-ml/born/optim- Optimizers (SGD, Adam)github.com/born-ml/born/backend/cpu- CPU backendgithub.com/born-ml/born/autodiff- Automatic differentiation
Documentation:
- Comprehensive package documentation for pkg.go.dev
- Usage examples in each package
- API reference comments on all public types/functions
- Updated examples to use public API
- README updated with correct import paths
Before (v0.1.0 - broken for external use):
import "github.com/born-ml/born/internal/tensor" // ❌ Cannot import!After (v0.1.1 - works!):
import "github.com/born-ml/born/tensor" // ✅ Public API- All tests pass (internal tests unchanged)
- golangci-lint: 0 issues
- Public packages compile successfully
- Examples work with new imports
- +876 lines of public API code
- 9 new public files (doc.go + package wrappers)
- 5 public packages created
0.1.0 - 2025-11-17
First public release of Born ML Framework - a modern, type-safe machine learning framework for Go.
Released in celebration of Go's 16th anniversary (November 10, 2009 - 2025) 🎂
- Tensor API with generic type safety (
Tensor[T, B]) - Shape validation with NumPy-style broadcasting
- Zero-copy operations where possible
- Device abstraction (CPU, with GPU planned)
- Tape-based reverse-mode autodiff
- Decorator pattern (wraps any backend with autodiff)
- Gradient tape with operation recording
- Backward pass with efficient chain rule
- Linear layers with Xavier initialization
- Conv2D (2D convolution) with im2col algorithm
- MaxPool2D (2D max pooling)
- Activation functions: ReLU, Sigmoid, Tanh
- Loss functions: CrossEntropyLoss with numerical stability
- Parameter management for optimization
- SGD with momentum
- Adam with bias correction
- CPU Backend with optimized implementations
- Im2col algorithm for efficient convolutions
- Float32 and Float64 support
- Batch processing
MNIST Classification:
- MLP (2-layer): 97.44% accuracy (101,770 parameters)
- CNN (LeNet-5): 98.18% accuracy (44,426 parameters)
- MNIST MLP - Fully connected network example
- MNIST CNN - Convolutional neural network example (LeNet-5 style)
- 33 new tests for Conv2D and MaxPool2D
- Numerical gradient verification for all autodiff operations
- Integration tests for end-to-end workflows
- Overall test coverage: 53.7%
Zero External Dependencies (core framework):
- Pure Go implementation
- Standard library only
- Type-safe generics (Go 1.25+)
- Comprehensive README with quickstart
- Example code with detailed comments
- API documentation in code
- ReshapeOp - Enables gradient flow through reshape operations (critical for Conv2D bias)
- TransposeOp - Proper gradient propagation for matrix transposes
- Im2col Algorithm - Efficient convolution via matrix multiplication
- Max Index Tracking - For MaxPool2D gradient routing
- Xavier Initialization - For stable training
- CPU-only (GPU support planned for v0.2.0)
- No model save/load yet
- Limited data augmentation
- No distributed training
- BatchNorm2D for training stability
- Dropout for regularization
- Model serialization
- Data augmentation
- GPU backend (CUDA)
None (initial release)
N/A (initial release)
- Claude Code AI Assistant
- Born ML Project Team