Skip to content

Latest commit

 

History

History
1258 lines (937 loc) · 43.7 KB

File metadata and controls

1258 lines (937 loc) · 43.7 KB

Changelog

All notable changes to the Born ML Framework will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.7.14 - 2026-03-04

🎉 Community Contribution — ONNX Equal Operator

First external contribution! Thanks to @jsully1720.

Added:

  • ONNX Equal operator — binary element-wise comparison returning bool tensor
  • New comparison_ops.go category for ONNX comparison operators
  • registerComparisonOps() wired into operator registry

ONNX operators: 38 → 39

Links:

  • PR: #34 by @jsully1720
  • Issue: #35

0.7.13 - 2026-03-02

🔧 Dependencies Update

Update WebGPU backend to v0.4.1 with critical ABI compliance fixes.

Updated Dependencies:

  • go-webgpu/webgpu v0.4.0 → v0.4.1
  • go-webgpu/goffi v0.4.0 → v0.4.1 (indirect)

Upstream Bug Fixes (ABI compliance):

  • Float32 encoding: correct XMM bit patterns via math.Float32bits
  • AMD64 Unix stack: arguments beyond 6 GP registers properly pushed to stack
  • ARM64 Unix stack: arguments beyond 8 GP registers correctly spilled to stack
  • AMD64 struct returns (9-16 bytes): RAX+RDX register pair properly assembled
  • AMD64 sret pointer: structs > 16 bytes use caller buffer as first argument (RDI)
  • ARM64 HFA spilling: Homogeneous Floating-Point Aggregate overflow follows AAPCS64

Upstream Enhancements:

  • runtime.KeepAlive prevents GC of argument pointers during FFI calls
  • ErrTooManyArguments overflow detection for calls exceeding 15 arguments

Impact: Critical ABI correctness fixes for multi-platform GPU backend reliability.

Links:


0.7.12 - 2026-02-27

🔧 Dependencies Update

Update WebGPU backend to v0.4.0 with FFI hardening and improved library loading.

Updated Dependencies:

  • go-webgpu/webgpu v0.3.2 → v0.4.0

Upstream Improvements:

  • Null handle guards on 27 public FFI methods — prevents SIGSEGV on nil/released objects
  • ptrFromUintptr helper — eliminates all go vet unsafe.Pointer warnings
  • WGPU_NATIVE_PATH env var for custom wgpu-native library path
  • loadLibrary returns (Library, error) with proper error propagation
  • Windows DLL eager loading — errors surface at init, not at first use
  • Enhanced Init() error messages with library path and remediation suggestions
  • 85 new null guard test cases

Impact: Significantly improved safety and debuggability of GPU backend initialization.

Links:


0.7.11 - 2026-02-27

🔧 Dependencies Update

Update WebGPU backend to v0.3.2 with crosscall2 callback integration.

Updated Dependencies:

  • go-webgpu/webgpu v0.3.1 → v0.3.2
  • go-webgpu/goffi v0.3.9 → v0.4.0 (indirect)

Upstream Improvements:

  • crosscall2 integration — callbacks now work from C-library-created threads (Metal, wgpu-native)
  • fakecgo trampoline register fixes synced with purego v0.10.0

Impact: Improved callback reliability on macOS Metal and native WebGPU implementations.

Links:


0.7.10 - 2026-02-18

🔧 Dependencies Update

Update WebGPU backend to v0.3.1 with critical ARM64 callback fix.

Updated Dependencies:

  • go-webgpu/webgpu v0.3.0 → v0.3.1
  • go-webgpu/goffi v0.3.8 → v0.3.9 (indirect)

Upstream Fixes:

  • ARM64 callback trampoline rewrite — fixes LR corruption for callbacks at index > 0
  • Symbol rename to prevent linker collision with purego

Code Quality:

  • Removed 101 unused //nolint:gosec directives (gosec linter updated, no longer flags these)
  • Standardized remaining nolint comments to short format

Impact: Critical fix for macOS Apple Silicon and Linux ARM64 users.

Links:


0.7.9 - 2026-02-09

🔧 Dependencies Update

Update WebGPU backend to v0.3.0 with new capability-querying API and typed errors.

Updated Dependencies:

  • go-webgpu/webgpu v0.2.1 → v0.3.0

New Upstream Features Available:

  • Surface.GetCapabilities() — query supported formats, present modes, alpha modes
  • Device.GetFeatures() / Device.HasFeature() — feature enumeration
  • Device.GetLimits() — device limits (experimental)
  • Typed errors with errors.Is() / errors.As() support (ErrValidation, ErrOutOfMemory, ErrInternal, ErrDeviceLost)
  • Resource leak detection via SetDebugMode(true) / ReportLeaks()

Links:


0.7.8 - 2026-01-29

🔧 GoGPU Ecosystem Integration (Phase 1)

Migrate WebGPU backend to use unified gputypes for future dual-backend support.

Updated Dependencies:

  • go-webgpu/webgpu v0.1.4 → v0.2.1
  • go-webgpu/goffi v0.3.7 → v0.3.8
  • dlclark/regexp2 v1.10.0 → v1.11.5
  • google/uuid v1.3.0 → v1.6.0
  • Added gogpu/gputypes v0.2.0 (new dependency)

Changes:

  • Migrated all WebGPU types from wgpu.* to gputypes.*:
    • BufferUsage, BufferUsageStorage, BufferUsageCopySrc, BufferUsageCopyDst
    • PowerPreferenceHighPerformance
  • Updated 10 files in internal/backend/webgpu/
  • Fixed 3 prealloc warnings in linter (examples + internal/nn)

Why This Matters:

  • Prepares codebase for Pure Go WebGPU backend (gogpu/wgpu)
  • Unified type system enables future dual-backend architecture
  • Build tags will allow: go build (Rust FFI) vs go build -tags purego (Pure Go)

Links:


0.7.7 - 2026-01-06

🔧 Public API Improvements

Refactored public API packages to use proper Go interfaces instead of type aliases where possible.

Improvements:

  • tensor/: Added Backend interface with 40+ methods (was type alias)
  • nn/: Added Module interface with full method definitions
  • onnx/: Added Model interface for ONNX model operations
  • optim/: Now uses public nn.Parameter in function signatures
  • autodiff/: Now uses public tensor types
  • backend/cpu, backend/webgpu: Added compile-time interface checks

Technical Details:

  • Improves pkg.go.dev documentation by hiding internal paths
  • External packages can now properly import and use the public API
  • Some interfaces (Optimizer, ModelReader) remain as type aliases due to Go's type system constraints

Fixed Issues:

  • #25 — ONNX package not accessible from external packages

0.7.6 - 2026-01-03

🔧 ARM64 Darwin Enhancement

Comprehensive ARM64 Darwin support with enhanced struct handling, tested on M3 Pro hardware.

Updated Dependencies:

  • go-webgpu/webgpu v0.1.3 → v0.1.4
  • go-webgpu/goffi v0.3.6 → v0.3.7

Improvements:

  • Proper layout for nested and complex struct types
  • Automatic struct layout computation for integer/float combinations
  • Enhanced struct return handling (9-16 bytes) utilizing X0 and X1 registers

Fixed Issues:

  • Resolved segmentation fault in string output benchmarks on Darwin systems

Contributors:

  • @ppoage — ARM64 Darwin implementation, Objective-C test suite, assembly verification

Links:


0.7.5 - 2025-12-29

🔧 ARM64 Hotfix

Update GPU backend dependencies with critical ARM64 fixes for Apple Silicon.

Updated Dependencies:

  • go-webgpu/webgpu v0.1.2 → v0.1.3
  • go-webgpu/goffi v0.3.5 → v0.3.6

Fixed Issues:

  • ARM64 HFA returns (NSRect with 4×float64 now correctly returns all values on Apple Silicon)
  • Large struct returns (structs exceeding 16 bytes now properly use X8 register)
  • macOS ARM64 display (blank window issue where GPU dimensions returned 0×0)

Links:


0.7.4 - 2025-12-27

✨ New Feature: Linear Layer Without Bias

Add WithBias option to nn.NewLinear for creating Linear layers without bias term.

New API:

// With bias (default, backwards compatible)
layer := nn.NewLinear(784, 128, backend)

// Without bias (for LLaMA-style models, LM head, etc.)
lmHead := nn.NewLinear(hiddenSize, vocabSize, backend, nn.WithBias(false))

Changes:

  • Add LinearOption type and WithBias(bool) functional option
  • Add HasBias() method for introspection
  • Update SwiGLUFFN to use public API
  • Export WithBias in public nn package

Use Cases:

  • LM Head in language models (GPT, LLaMA, HRM)
  • Attention projections (some architectures)
  • SwiGLU FFN layers

Links:


0.7.3 - 2025-12-27

🔧 Dependencies Update

Hotfix release updating GPU backend dependencies to latest versions.

Updated Dependencies:

  • go-webgpu/webgpu v0.1.1 → v0.1.2
  • go-webgpu/goffi v0.3.3 → v0.3.5

Links:


0.7.2 - 2025-12-24

🔧 Dependencies Update

Hotfix release updating GPU backend dependencies for improved stability.

Updated Dependencies:

  • go-webgpu/webgpu v0.1.0 → v0.1.1
  • go-webgpu/goffi v0.3.1 → v0.3.3

Documentation:

  • Updated .claude/CLAUDE.md to v3.0 (optimized structure, accurate project info)
  • Added TASK-110-backend-strategy-gogpu.md for future GPU backend strategy planning

Links:


0.7.1 - 2025-12-16

🔧 Code Quality Refactoring (Issue #14)

Patch release addressing cognitive complexity concerns raised by community contributor @marcelloh. Applied Burn framework patterns for improved code quality and maintainability.

Pre-Slice Bounds Elimination (internal/backend/cpu/conv2d.go, maxpool2d.go):

  • Extract row slices BEFORE inner loops to eliminate bounds checks
  • Hierarchical pre-slicing for nested loop structures
  • Enables Go compiler to prove safety and optimize vectorization

Stride Specialization (internal/backend/cpu/conv2d.go):

  • Separate fast paths for stride=1, padding=0 case (most common)
  • Specialized functions: conv2dFloat32Stride1NoPad, conv2dInputBackwardFloat32Stride1NoPad
  • Enables compiler auto-vectorization for common case

Flash Attention CPU Refactor (internal/nn/flash_attention.go):

  • Complexity reduced: 111 → <30 (removed //nolint:gocognit directive)
  • Extracted FlashDims, FlashConfig structs for configuration
  • Helper functions: flashAttentionScoreBlock, flashAttentionExtractValues, flashAttentionProcessQuery
  • Each helper under 50 AST nodes (Go compiler inlines automatically)

Autodiff Orchestration (internal/autodiff/ops/):

  • Separated orchestration from computation (Burn pattern)
  • New files: conv2d_backward.go, maxpool2d_backward.go in CPU backend
  • autodiff/ops/conv2d.go: 409 → 67 lines (delegation only)
  • Extended Backend interface with backward operation methods

Parallel Execution Utilities (internal/parallel/):

  • New package for reusable parallel execution patterns
  • parallel.Config - configurable parallelism settings
  • parallel.For() - parallel for-loop with automatic sequential fallback
  • parallel.ForBatch() - optimized for batch×channels iteration pattern
  • Ready for integration into CPU backend operations

Backend Interface Extended (internal/tensor/backend.go):

  • Conv2DInputBackward(input, kernel, grad, stride, padding) - gradient w.r.t. input
  • Conv2DKernelBackward(input, kernel, grad, stride, padding) - gradient w.r.t. kernel
  • MaxPool2DBackward(input, grad, maxIndices, kernelSize, stride) - gradient propagation
  • WebGPU backend updated with stub implementations

Code Quality:

  • All files properly formatted (go fmt ./...)
  • 0 linter issues (golangci-lint)
  • All tests passing
  • No performance regression

Files Changed: 13 files, +994/-460 lines

Links:


0.7.0 - 2025-12-10

⚡ Flash Attention 2 + Speculative Decoding + GGUF Import

Major release focused on inference optimization for LLM deployment.

Flash Attention 2 (internal/nn/flash_attention.go, internal/nn/online_softmax.go):

  • O(N) Memory - Tiled computation never materializes full N×N attention matrix
  • Online Softmax - Incremental softmax with rescaling for numerical stability
  • WebGPU Shader - WGSL compute shader with workgroup shared memory
  • Configurable Tiles - Block sizes 64 and 128 supported
  • Head Dimensions - Supports 64, 96, 128, 256
  • Causal Masking - Built-in support for autoregressive models
  • CPU Reference - Validation implementation for correctness testing
  • 2x+ Speedup - On sequences 8K+ vs standard attention

Speculative Decoding (internal/generate/speculative.go):

  • Draft Model - Small model generates K candidate tokens speculatively
  • Parallel Verification - Target model verifies all candidates in single batch
  • Modified Rejection Sampling - Mathematically correct token acceptance
  • 2-4x Speedup - For autoregressive text generation
  • Configurable - Draft steps (K), temperature, sampling parameters

GGUF Import (internal/gguf/):

  • Parser - Complete GGUF v3 format parsing (types, metadata, tensor info)
  • Loader - Memory-mapped tensor data loading
  • K-Quant Dequantization - Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, Q4_1, Q5_0, Q5_1
  • Converter - GGUF tensors to Born tensor format
  • llama.cpp Ecosystem - Load LLaMA, Mistral, DeepSeek, Qwen models

Code Quality:

  • Fixed 226 gosec G115 integer overflow warnings across codebase
  • All files properly formatted (gofmt)
  • 0 linter issues (golangci-lint)

Tests:

  • Flash Attention: GPU vs CPU correctness validation (< 1e-4 error)
  • Speculative Decoding: 11 tests, 93.1% coverage
  • GGUF: 52 tests, 75% coverage

Files Added:

  • internal/nn/flash_attention.go - Flash Attention module
  • internal/nn/online_softmax.go - Online softmax implementation
  • internal/nn/flash_attention_test.go - CPU tests
  • internal/nn/flash_attention_gpu_test.go - GPU tests
  • internal/backend/webgpu/flash_attention.go - GPU execution
  • internal/backend/webgpu/shaders.go - Added flashAttentionShader
  • internal/generate/speculative.go - Speculative decoding
  • internal/generate/speculative_test.go - Speculative tests
  • internal/gguf/ - Complete GGUF package (types, parser, loader, dequant, convert)

0.6.0 - 2025-12-04

🚀 ONNX Import & Lazy GPU Mode

Major release adding ONNX model import and GPU-resident lazy evaluation for dramatically improved performance.

ONNX Import API (internal/onnx/):

  • ONNX Parser - Parse .onnx model files (protobuf format)
  • Model Loader - Load weights and construct computation graph
  • 30+ Operators - Standard ONNX operator support:
    • Activations: ReLU, Sigmoid, Tanh, Softmax, GELU, LeakyReLU
    • Math: MatMul, Add, Mul, Div, Sub, Sqrt, Pow, Exp, Log
    • Shape: Reshape, Transpose, Squeeze, Unsqueeze, Concat, Split
    • Utility: Gather, Slice, Cast, Constant, Identity, Flatten
  • Operator Registry - Extensible operator registration system

Lazy GPU Evaluation (internal/tensor/lazy_gpu.go):

  • GPU-Resident Tensors - Data stays on GPU until explicitly needed
  • LazyGPUData - Reference to GPU buffer with lazy CPU transfer
  • Automatic Memory Management - runtime.SetFinalizer for GPU buffer cleanup
  • Zero CPU Round-trips - Chained operations stay entirely on GPU

Command Batching (internal/backend/webgpu/):

  • Batch GPU Commands - Accumulate commands instead of immediate submit
  • Reduced Sync Overhead - ~200 submits → 1-2 per operation chain
  • FlushCommands() - Explicit synchronization when needed
  • Performance Impact: ~90s/step → <5s/step for model training

GPU-to-GPU Copy:

  • CopyBufferToBuffer - Direct GPU memory transfer
  • No CPU Round-trip - Eliminated GPU→CPU→GPU transfers in lazy chains
  • ~100x Speedup - Per-operation transfer overhead eliminated

Raw Tensor Operations (internal/tensor/raw_ops.go):

  • 50+ Operations - Comprehensive tensor manipulation
  • Argmax, TopK - Selection operations
  • Type Conversions - Float32, Int32, Bool conversions
  • Broadcasting - NumPy-style shape broadcasting
  • Advanced Indexing - Gather, Scatter operations

Bug Fixes:

  • Fixed GPU memory leak when lazy tensors go out of scope
  • Fixed typed accessors (AsInt32, AsInt64, etc.) bypassing lazy realization
  • Fixed Where and Sum operations missing lazy mode support

Tests:

  • 15+ new ONNX tests (parser, loader, operators)
  • Lazy mode chain tests
  • Command batching tests

Files Added:

  • internal/onnx/ - Complete ONNX import package
  • internal/tensor/lazy_gpu.go - Lazy GPU data structures
  • internal/tensor/raw_ops.go - Raw tensor operations
  • internal/backend/webgpu/lazy_compute.go - Lazy GPU operations
  • internal/backend/webgpu/gpu_*.go - GPU tensor and autodiff support

0.5.5 - 2025-12-03

⚡ WebGPU Performance Hotfix

Critical performance fix for transformer training on WebGPU backend.

Problem Fixed:

  • Multi-dimensional Transpose operations (3D+) were falling back to CPU
  • Expand (broadcasting) was CPU-only
  • Result: ~60s/batch for small transformer models (should be <1s)

New GPU Operations:

  • TransposeND shader - N-dimensional transpose on GPU (up to 6D)
  • Expand shader - NumPy-style broadcasting on GPU
  • Both support float32 and int32 data types

Performance Impact:

  • ~60x speedup for attention operations
  • Transformer training now usable on WebGPU

Tests:

  • 9 new tests: TestTranspose3D, TestTranspose4D, TestTranspose5D, TestExpandBroadcast, etc.

Files Changed:

  • internal/backend/webgpu/shaders.go - Added WGSL shaders
  • internal/backend/webgpu/compute.go - Added runTransposeND, runExpand
  • internal/backend/webgpu/ops.go - Removed CPU fallback
  • internal/backend/webgpu/ops_extended.go - Removed CPU fallback
  • internal/backend/webgpu/ops_nd_test.go - New test file

0.5.4 - 2025-12-03

💾 Model Serialization

Production-ready model serialization with Format v2 best practices.

New Features:

  • Born Native Format v2 (.born) - SHA-256 checksum, security validation
  • Checkpoint API - Save/resume training with optimizer state
  • SafeTensors Export - HuggingFace ecosystem compatibility
  • Memory-Mapped Reader - Efficient loading for 70GB+ models

API:

  • nn.Save(model, "model.born", "ModelType", metadata) - Save model
  • nn.Load("model.born", backend, model) - Load model
  • nn.SaveCheckpoint(path, model, optimizer, epoch, step, loss) - Save checkpoint
  • nn.LoadCheckpoint(path, backend, model, optimizer) - Resume training
  • serialization.WriteSafeTensors(path, tensors, metadata) - Export for HuggingFace

New Package:

  • internal/serialization - Format writer/reader, validation, mmap

Tests:

  • 26 new tests for serialization, checkpoints, SafeTensors

0.5.3 - 2025-12-02

🐛 WebGPU Backend Fixes (HRM Compatibility)

Bug Fixes:

  • Comparison ops - Now always return float32 (0.0/1.0), even for int32 inputs
  • Sum int32 - Added WGSL shader for int32 sum reduction
  • Sum scalar shape - Fixed return shape from [1] to [] for proper scalar handling
  • Where int32 condition - Added support for int32 condition tensors
  • Where broadcasting - Added NumPy-style broadcasting (like Burn)
  • Gather backward - Support for int32, int64, float32 index tensors

New Functions:

  • runComparisonOp - Dedicated function for comparison operations
  • int32ToFloat32 - Helper for int32 to float32 conversion

Tests:

  • 3 new Gather backward tests (int64 indices, boundary, dim0 2D)

0.5.2 - 2025-12-01

✨ Public WebGPU API

  • Added public backend/webgpu package with NewBackend() function
  • Windows build tag support for WebGPU
  • Updated README with WebGPU API example

0.5.1 - 2025-12-01

🐛 Fixes

  • Minor fixes after v0.5.0 release

0.5.0 - 2025-12-01

🚀 Phase 5: LLM Support

Major release adding complete LLM inference support! Run LLaMA, Mistral, DeepSeek, and other modern language models with Born.

✨ Added

Grouped Query Attention (GQA) (internal/nn/gqa.go):

  • GroupedQueryAttention - Memory-efficient attention for LLaMA 2/3, Mistral
  • RepeatKV - KV head broadcasting (e.g., 8 KV heads → 32 Q heads)
  • MQA helper - Multi-Query Attention config (extreme GQA with 1 KV head)
  • Full RoPE integration with KV-cache support
  • 4:1 memory savings for KV-cache vs standard MHA

SwiGLU & GLU Variants (internal/nn/glu.go, internal/nn/swiglu_ffn.go):

  • SwiGLU - x * SiLU(gate) activation (LLaMA, Mistral)
  • GeGLU - x * GELU(gate) activation
  • ReGLU - x * ReLU(gate) activation
  • GLU - x * sigmoid(gate) (classic)
  • SwiGLUFFN - Complete feed-forward module with gate/up/down projections
  • Configurable bias (LLaMA uses no bias)

Model Loader (internal/loader/):

  • GGUF format support - Read LLaMA, Mistral, DeepSeek model files
  • GGUFReader - Parse metadata and tensor info
  • Weight Mappers - Architecture-specific weight name translation
    • LLaMAMapper - LLaMA 1/2/3 models
    • MistralMapper - Mistral 7B and variants
    • DeepSeekMapper - DeepSeek models
  • DetectArchitecture - Auto-detect model type from tensor names
  • Support for F32, F16 dtypes (quantized types require dequant)

Tokenizer Integration (internal/tokenizer/):

  • TikToken - OpenAI's BPE tokenizer (GPT-3.5, GPT-4)
  • BPE Tokenizer - Generic Byte Pair Encoding
  • HuggingFace format - Load tokenizer.json from HF models
  • Chat Templates - Format multi-turn conversations
    • ChatML (OpenAI style)
    • LLaMA (Meta format)
    • Mistral (with [INST] tags)
  • Special tokens - BOS, EOS, PAD, UNK handling
  • AutoLoad - Auto-detect tokenizer type from path

Sampling Strategies (internal/generate/sampling.go):

  • Temperature - Control randomness (0 = greedy)
  • Top-K - Sample from top K tokens
  • Top-P (nucleus) - Sample from smallest set with P cumulative probability
  • Min-P - Filter tokens below P * max_prob threshold
  • Repetition Penalty - Penalize repeated tokens
  • Frequency Penalty - Penalize based on token frequency
  • Presence Penalty - Penalize based on token presence
  • Configurable seed - Reproducible sampling

Text Generation (internal/generate/generator.go):

  • TextGenerator - High-level API for text generation
  • Streaming API - Token-by-token generation with channels
  • Chat API - Multi-turn conversation with templates
  • GenerateConfig - Max tokens, min tokens, stop strings/tokens
  • GenerateResult - Token, token ID, done flag, reason
  • KV-cache integration - Efficient autoregressive generation
  • Echo prompt - Optionally include prompt in output

Multi-Output Autodiff (internal/autodiff/ops/):

  • MultiOutputOperation - Interface for ops with multiple outputs
  • BackwardMulti - Compute gradients for multi-output ops
  • ChunkOp - Fixed backward pass for tensor chunking
  • GatherOp - Scatter-add gradient computation

Public API (nn/, generate/, tokenizer/, loader/):

  • Complete public wrappers for all new types
  • Type aliases for seamless internal/public integration
  • Documentation with examples

📊 Testing

  • 100+ new unit tests across all LLM modules
  • Comprehensive sampling tests - All strategies validated
  • Generator tests - Streaming, stop conditions, chat
  • Tokenizer tests - Encode/decode roundtrip, special tokens
  • 0 golangci-lint issues

🧪 Test Coverage

Package Tests Status
internal/nn (GQA, SwiGLU) 35+
internal/tokenizer 27
internal/generate 17
internal/loader 10+
internal/autodiff/ops 20+

🎯 What You Can Build Now

import (
    "github.com/born-ml/born/generate"
    "github.com/born-ml/born/tokenizer"
    "github.com/born-ml/born/loader"
)

// Load tokenizer
tok, _ := tokenizer.NewTikTokenForModel("gpt-4")

// Load model
model, _ := loader.OpenModel("llama-7b.gguf")

// Create generator
gen := generate.NewTextGenerator(model, tok, generate.SamplingConfig{
    Temperature: 0.7,
    TopP:        0.9,
    TopK:        40,
})

// Generate text
result, _ := gen.Generate("Hello!", generate.GenerateConfig{MaxTokens: 100})

// Or stream tokens
stream, _ := gen.GenerateStream("Once upon", generate.GenerateConfig{MaxTokens: 50})
for chunk := range stream {
    fmt.Print(chunk.Token)
}

// Chat with templates
messages := []tokenizer.ChatMessage{
    {Role: "user", Content: "What is 2+2?"},
}
response, _ := gen.Chat(messages, tokenizer.NewChatMLTemplate(), config)

📈 Performance

Feature Benchmark
GQA 32Q/8KV 4x KV-cache memory savings
SwiGLU FFN 2.7x expansion (vs 4x standard)
TikToken ~1M tokens/sec encoding
Top-P sampling O(n log n) sorting

0.4.0 - 2025-12-01

🚀 Phase 4: Attention Mechanisms

Major release adding complete transformer architecture support! Build GPT, LLaMA, BERT, and modern LLM architectures with Born.

✨ Added

Attention Mechanisms (internal/nn/):

  • Scaled Dot-Product Attention (SDPA) - Core attention with optional mask and dropout
  • Multi-Head Attention (MHA) - Full implementation with WQ, WK, WV, WO projections
  • KV-Cache - Efficient autoregressive generation (3.94x speedup for 100 tokens)

Normalization Layers (internal/nn/):

  • LayerNorm - Classic layer normalization with learnable gamma/beta
  • RMSNorm - Root Mean Square normalization (LLaMA style)

Positional Encodings (internal/nn/):

  • RoPE (Rotary Position Embedding) - Used by LLaMA, Mistral, DeepSeek
  • ALiBi (Attention with Linear Biases) - Used by BLOOM, MPT
  • Sinusoidal - Original Transformer positional encoding
  • Learned - Trainable position embeddings (GPT-2 style)

Transformer Building Blocks (internal/nn/):

  • TransformerBlock - Complete transformer layer with:
    • Pre-Norm (LLaMA style) and Post-Norm (original) support
    • RMSNorm or LayerNorm selection
    • Configurable attention and FFN dimensions
  • FFN (Feed-Forward Network) - SiLU activation (LLaMA style)
  • ForwardWithCache - Efficient inference with KV-cache

Tensor Operations (internal/tensor/, internal/backend/cpu/):

  • BatchMatMul - Native 3D/4D batched matrix multiplication
    • [B, M, K] @ [B, K, N] → [B, M, N] (3D)
    • [B, H, M, K] @ [B, H, K, N] → [B, H, M, N] (4D)
  • Refactored SDPA to use BatchMatMul (-40% code)

🔧 Fixed

  • Scalar gradient broadcasting - Fixed reduceBroadcast panic when propagating scalar gradients
  • Multi-dim Softmax backward - Now supports 3D/4D tensors (not just 2D)

📊 Testing

  • 70+ new unit tests across attention modules
  • Comprehensive benchmarks for all new components
  • 0 golangci-lint issues
  • KV-Cache: 3.94x speedup verified
  • Parameter counts verified (7.1M per transformer block, matching GPT-2)

🎯 What You Can Build Now

import (
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/tensor"
)

// Create a transformer block (GPT-2 style)
config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,   // Pre-Norm (LLaMA)
    UseRMSNorm: true,   // RMSNorm (LLaMA)
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)

// Forward pass
x := tensor.Randn[float32](tensor.Shape{1, 512, 768}, backend)
output := block.Forward(x, nil)

// With KV-Cache for generation
cache := nn.NewKVCache(1, 12, 2048, 64, backend)
for i := 0; i < 100; i++ {
    token := getNextToken()
    output := block.ForwardWithCache(token, cache)
}

📈 Performance

Operation Benchmark
SDPA (512 seq) 89.2% coverage
MHA (768d/12h) 2.3M params verified
KV-Cache (100 tokens) 3.94x speedup
TransformerBlock ~7.1M params/block
RoPE (2048 seq) Pre-computed cos/sin

0.3.0 - 2025-11-30

🚀 Phase 2.5: Transformer Primitives + Public API

Major release adding essential operations for modern transformer architectures (LLaMA, Mistral, GPT), the HRM Model, and 31 type-safe public API operations!

✨ Added

Math Operations (internal/backend/cpu/math.go, internal/autodiff/ops/):

  • Exp() - Exponential function with gradient support
  • Sqrt() - Square root with stable gradients
  • Rsqrt() - Reciprocal square root (1/√x) for normalization layers
  • Cos() - Cosine for RoPE (Rotary Position Embedding)
  • Sin() - Sine for RoPE implementations

Reduction Operations (internal/backend/cpu/reduce.go):

  • SumDim(dim, keepDim) - Sum along dimension with optional keepDim
  • MeanDim(dim, keepDim) - Mean along dimension with optional keepDim
  • Supports negative dimensions (-1 for last dimension)
  • Broadcasting-aware for gradient computation

Tensor Manipulation (internal/backend/cpu/manipulation.go):

  • Cat(tensors, dim) - Concatenate tensors along dimension
  • Chunk(n, dim) - Split tensor into n equal chunks
  • Unsqueeze(dim) - Add dimension of size 1
  • Squeeze(dim) - Remove dimensions of size 1

Indexing Operations (internal/backend/cpu/indexing.go):

  • Gather(dim, index) - Select elements using index tensor
  • Where(condition, x, y) - Conditional element selection

Neural Network Layers (internal/nn/):

  • SiLU (Swish) activation: x * sigmoid(x) with autodiff
  • RMSNorm layer: Root Mean Square Normalization with learnable gamma
  • Embedding layer: Token lookup table for NLP models

Gradient Control (internal/autodiff/):

  • NoGrad(func) - Context manager to disable gradient recording (inference mode)
  • Detach() - Break gradient chain while keeping tensor values

Public API Operations (internal/tensor/ops_extended.go, tensor/):

31 type-safe operations now available via github.com/born-ml/born/tensor:

  • Scalar (4): MulScalar, AddScalar, SubScalar, DivScalar
  • Math (6): Log, Exp, Sqrt, Rsqrt, Cos, Sin
  • Activation (1): Softmax(dim)
  • Comparison (12): Greater/Gt, Lower/Lt, GreaterEqual/Ge, LowerEqual/Le, Equal/Eq, NotEqual/Ne
  • Boolean (3): Or, And, Not
  • Reduction (2): Sum, Argmax
  • Type Conversion (6): Int32, Int64, Float32, Float64, Uint8, Bool
  • Shape (1): Expand

Example usage:

import "github.com/born-ml/born/tensor"

x := tensor.Randn[float32](tensor.Shape{2, 3}, backend)
y := x.MulScalar(2.0)           // Scalar operations
mask := x.Greater(y)            // Comparison (returns Tensor[bool, B])
z := x.Softmax(-1)              // Activation
total := x.Sum()                // Reduction
i := x.Int32()                  // Type conversion

📊 Testing

  • 112 new unit tests added across all features
  • 0 golangci-lint issues (maintained strict quality standards)
  • All autodiff operations validated with numerical gradient checking
  • Comprehensive edge case coverage (negative dims, broadcasting, etc.)

🧪 Test Coverage

Package Coverage Tests
backend/cpu (math) 79.0% 23
backend/cpu (reduce) 80.2% 17
backend/cpu (manipulation) - 29
backend/cpu (indexing) - 11
autodiff/ops 69.6% -
nn (SiLU, RMSNorm, Embedding) - 18
Total Phase 2.5 - 112

🔧 Changed

  • Updated tensor.Backend interface with new operations
  • Extended .golangci.yml with exclusions for intentional patterns
  • WebGPU backend stubs added for all new operations (CPU-only for now)

📦 New Files

internal/backend/cpu/
├── math.go              # Exp, Sqrt, Rsqrt, Cos, Sin
├── math_test.go         # 23 tests
├── reduce.go            # SumDim, MeanDim
├── reduce_test.go       # 17 tests
├── manipulation.go      # Cat, Chunk, Unsqueeze, Squeeze
├── indexing.go          # Gather, Where
└── indexing_test.go     # 11 tests

internal/autodiff/ops/
├── exp.go, sqrt.go, rsqrt.go, cos.go, sin.go
├── sumdim.go, meandim.go
├── silu.go
├── embedding.go
├── math_test.go
├── reduce_test.go
└── silu_test.go

internal/nn/
├── rmsnorm.go           # RMSNorm layer
├── rmsnorm_test.go      # 8 tests
├── embedding.go         # Embedding layer
├── embedding_test.go    # 8 tests
└── activation.go        # Added SiLU

internal/tensor/
└── ops_extended.go      # 31 public API wrappers (470 lines)

internal/backend/cpu/
├── scalar.go            # MulScalar, AddScalar, SubScalar, DivScalar
├── activation.go        # Softmax (n-dimensional, numerically stable)
├── comparison.go        # Greater, Lower, Equal, etc.
├── boolean.go           # Or, And, Not
├── conversion.go        # Cast for all dtype pairs
└── shape.go             # Expand with broadcasting

internal/backend/webgpu/
└── ops_extended.go      # Stubs + working Softmax

🎯 What This Enables

With Phase 2.5 primitives, Born can now support:

Transformer Components:

  • RoPE (Rotary Position Embedding) - built from Cos, Sin, Cat
  • SwiGLU activation - built from Linear, SiLU, Chunk
  • RMSNorm - directly available as layer
  • Stablemax (HRM) - built from Where, SumDim, Gather

Modern LLM Architectures:

  • ✅ LLaMA (Meta)
  • ✅ Mistral AI models
  • ✅ GPT-style transformers
  • HRM (Hierarchical Reasoning Model)

Inference Capabilities:

  • ✅ Token embedding lookup
  • ✅ Position encoding (RoPE)
  • ✅ Layer normalization (RMSNorm)
  • ✅ Modern activations (SiLU/Swish)
  • ✅ Gradient control for inference (NoGrad, Detach)

🚀 Coming in v0.4.0

  • Multi-head attention (MHA) layer
  • Layer normalization variants
  • More positional encodings (Absolute, Learned)
  • KV-cache for efficient inference
  • Linux/macOS WebGPU support

0.2.0 - 2025-11-28

🚀 Phase 2: WebGPU GPU Backend

Major release introducing GPU acceleration via WebGPU - the first production-ready Go ML framework with zero-CGO GPU support!

✨ Added

WebGPU Backend (internal/backend/webgpu/):

  • Zero-CGO GPU acceleration via go-webgpu v0.1.0
  • WGSL compute shaders for all tensor operations
  • Buffer pool with size-based categorization for memory efficiency
  • Memory statistics tracking (allocations, peak usage, pool hits/misses)
  • Graceful degradation when wgpu_native.dll not available (panic recovery)

GPU Operations:

  • Element-wise: Add, Sub, Mul, Div
  • Matrix: MatMul (tiled algorithm, 16x16 workgroups)
  • Shape: Reshape, Transpose
  • Activations: ReLU, Sigmoid, Tanh, Softmax

CPU Backend Enhancements:

  • Softmax operation added
  • Backend now implements full tensor.Backend interface

Examples:

  • examples/mnist-gpu/ - CPU vs WebGPU benchmark (~123x MatMul speedup)

Documentation:

  • docs/PHILOSOPHY.md - Framework philosophy and design principles
  • docs/USE_CASES.md - Real-world use cases and deployment scenarios
  • Updated README with performance benchmarks

📊 Performance

Benchmarks (NVIDIA RTX GPU vs CPU):

Operation Size CPU WebGPU Speedup
MatMul 1024×1024 847ms 6.9ms 123x
MatMul 512×512 105ms 2.1ms 50x
MatMul 256×256 13ms 1.3ms 10x
Add 1M elements 1.2ms 0.15ms 8x

MNIST MLP Inference (batch=256):

  • CPU: ~45ms/batch
  • WebGPU: ~4.1ms/batch
  • Speedup: 10.9x

🔧 Changed

  • Build tags added for Windows-only WebGPU code (//go:build windows)
  • go.sum now committed (was incorrectly in .gitignore)
  • Updated all documentation for v0.2.0 milestone

🧪 Testing

  • 13 new WebGPU operation tests (ops_test.go)
  • 7 buffer pool tests (buffer_pool_test.go)
  • 26 benchmark functions for CPU vs GPU comparison
  • All tests pass on Ubuntu, macOS, Windows
  • WebGPU tests skip gracefully on systems without GPU support

📦 New Files

internal/backend/webgpu/
├── backend.go          # WebGPU backend initialization
├── ops.go              # Operation implementations
├── compute.go          # Compute pipeline management
├── shaders.go          # WGSL shader sources
├── buffer_pool.go      # GPU buffer pooling
├── *_test.go           # Tests and benchmarks
examples/mnist-gpu/
└── main.go             # GPU benchmark example
docs/
├── PHILOSOPHY.md       # Framework philosophy
└── USE_CASES.md        # Use cases

⚠️ Platform Support

  • Windows: Full WebGPU support (requires wgpu_native.dll)
  • Linux/macOS: CPU backend only (WebGPU builds skipped)
  • WebGPU on Linux/macOS planned for future release

🚀 Coming in v0.3.0

  • BatchNorm2D for training stability
  • Dropout for regularization
  • Model serialization (save/load)
  • Linux WebGPU support via Vulkan
  • ONNX model import

0.1.1 - 2025-11-17

🔥 Critical Hotfix

BREAKING (but necessary): v0.1.0 had no usable public API! All packages were in internal/ which cannot be imported by external projects. This hotfix adds proper public packages.

✨ Added

Public API Packages:

  • github.com/born-ml/born/tensor - Type-safe tensor operations
  • github.com/born-ml/born/nn - Neural network modules (Linear, Conv2D, MaxPool2D, etc.)
  • github.com/born-ml/born/optim - Optimizers (SGD, Adam)
  • github.com/born-ml/born/backend/cpu - CPU backend
  • github.com/born-ml/born/autodiff - Automatic differentiation

Documentation:

  • Comprehensive package documentation for pkg.go.dev
  • Usage examples in each package
  • API reference comments on all public types/functions

🔧 Changed

  • Updated examples to use public API
  • README updated with correct import paths

📦 Migration from v0.1.0

Before (v0.1.0 - broken for external use):

import "github.com/born-ml/born/internal/tensor"  // ❌ Cannot import!

After (v0.1.1 - works!):

import "github.com/born-ml/born/tensor"  // ✅ Public API

🧪 Testing

  • All tests pass (internal tests unchanged)
  • golangci-lint: 0 issues
  • Public packages compile successfully
  • Examples work with new imports

📊 Statistics

  • +876 lines of public API code
  • 9 new public files (doc.go + package wrappers)
  • 5 public packages created

0.1.0 - 2025-11-17

🎉 Initial Release

First public release of Born ML Framework - a modern, type-safe machine learning framework for Go.

Released in celebration of Go's 16th anniversary (November 10, 2009 - 2025) 🎂

✨ Features

Core Framework

  • Tensor API with generic type safety (Tensor[T, B])
  • Shape validation with NumPy-style broadcasting
  • Zero-copy operations where possible
  • Device abstraction (CPU, with GPU planned)

Automatic Differentiation

  • Tape-based reverse-mode autodiff
  • Decorator pattern (wraps any backend with autodiff)
  • Gradient tape with operation recording
  • Backward pass with efficient chain rule

Neural Network Modules

  • Linear layers with Xavier initialization
  • Conv2D (2D convolution) with im2col algorithm
  • MaxPool2D (2D max pooling)
  • Activation functions: ReLU, Sigmoid, Tanh
  • Loss functions: CrossEntropyLoss with numerical stability
  • Parameter management for optimization

Optimizers

  • SGD with momentum
  • Adam with bias correction

Backend

  • CPU Backend with optimized implementations
  • Im2col algorithm for efficient convolutions
  • Float32 and Float64 support
  • Batch processing

📊 Validated Performance

MNIST Classification:

  • MLP (2-layer): 97.44% accuracy (101,770 parameters)
  • CNN (LeNet-5): 98.18% accuracy (44,426 parameters)

📚 Examples

  • MNIST MLP - Fully connected network example
  • MNIST CNN - Convolutional neural network example (LeNet-5 style)

🧪 Testing

  • 33 new tests for Conv2D and MaxPool2D
  • Numerical gradient verification for all autodiff operations
  • Integration tests for end-to-end workflows
  • Overall test coverage: 53.7%

🏗️ Architecture

Zero External Dependencies (core framework):

  • Pure Go implementation
  • Standard library only
  • Type-safe generics (Go 1.25+)

📖 Documentation

  • Comprehensive README with quickstart
  • Example code with detailed comments
  • API documentation in code

🔧 Technical Highlights

  1. ReshapeOp - Enables gradient flow through reshape operations (critical for Conv2D bias)
  2. TransposeOp - Proper gradient propagation for matrix transposes
  3. Im2col Algorithm - Efficient convolution via matrix multiplication
  4. Max Index Tracking - For MaxPool2D gradient routing
  5. Xavier Initialization - For stable training

⚠️ Known Limitations

  • CPU-only (GPU support planned for v0.2.0)
  • No model save/load yet
  • Limited data augmentation
  • No distributed training

🚀 Coming in v0.2.0

  • BatchNorm2D for training stability
  • Dropout for regularization
  • Model serialization
  • Data augmentation
  • GPU backend (CUDA)

Release Notes

Breaking Changes

None (initial release)

Migration Guide

N/A (initial release)

Contributors

  • Claude Code AI Assistant
  • Born ML Project Team