GitHub - paiml/albor: LLM from first principles trained only from Sovereign AI components

Albor (Spanish: "dawn") — A sovereign Python code completion model trained from first principles using only the Sovereign AI stack.

Specification Book · Big Code Leaderboard · Full Spec

What Is Albor?

A 350M-parameter decoder-only transformer trained entirely in Rust with zero Python dependencies. Python-only following the phi-1 playbook: maximum concentration on one language, distilled from Qwen3-Coder-Next (80B MoE), then optimized through fine-tuning, merging, pruning, and quantization into a fast, local, zero-dependency code completion engine.

The goal is twofold:

Produce a usable Python code assist model that runs anywhere Rust compiles
Identify and fix every gap in the Sovereign AI stack that blocks end-to-end LLM development

Current status (2026-03-03): Phase 3 — 350M retraining with v2 data (139M tokens). First run failed (ALB-060: epochs=1 only ran 43/5000 steps). Fixed with C-TRAINCFG-001 contract + expanded dataset (68K sequences). 50-step test verified (loss 10.39→5.92). 24+ upstream gaps fixed, 8 provable contracts pass audit. Qwen2.5-Coder-3B interim teacher validated for distillation.

Leaderboard Target

Big Code Models Leaderboard — no sub-1B model has ever appeared on this board. Albor aims to be the first.

Model	Params	HumanEval pass@1	On Leaderboard
phi-1	1.3B	50.6%	Yes
DeciCoder-1B	1.0B	19.3%	Yes (smallest)
SantaCoder	1.1B	18.1%	Yes
StarCoderBase-1B	1.0B	15.2%	Yes
albor-distill (target)	350M	>15%	Submission target
CodeGen-350M-mono	350M	12.8%	No

Architecture

LLaMA-style decoder-only transformer
├── 24 layers, 1024 hidden dim, 16 attention heads, 4 KV heads (GQA)
├── SwiGLU FFN (4096 intermediate), RoPE, RMSNorm (pre-norm)
├── 32,768 vocab (ByteLevel BPE v2), 1024 context (GPU-resident; 2048 arch max)
├── ~370M parameters, GPU-resident with AdamW optimizer on 4090 (12 GB VRAM)
└── Fill-in-the-middle (FIM) trained for code completion

Improvement Ladder

Stage 1: Pre-train base model          → albor-base       (~8% HumanEval)
Stage 2: Distill from Qwen3-Coder-Next → albor-distill    (~13-15%)
Stage 3: Instruction fine-tune (LoRA)  → albor-instruct   (~14-16%)
Stage 4: Merge with complementary model → albor-merged     (~15-17%)
Stage 5: Prune for efficiency          → albor-pruned     (~12-14%)
Stage 6: Quantize for deployment       → albor-q4         (~14-16%, <50ms/tok CPU)

Sovereign AI Stack

Every component is pure Rust. No PyTorch, no Python, no external ML frameworks.

Component	Role
aprender (`apr`)	Unified CLI for all model operations
entrenar	Training engine, autograd, optimizers, LoRA
trueno	SIMD/GPU tensor backend
realizar	Inference engine (teacher model, eval, serving)
alimentar	Data pipeline, Parquet I/O, HF Hub import
forjar	Pipeline orchestration (DAG engine, multi-machine)
presentar	Training visualization (TUI + WASM dashboards)
repartir	Distributed compute
batuta	Stack orchestration, falsification
bashrs	Shell fragment validation
provable-contracts	Design-by-contract verification
pmat	TDG scoring, compliance, fault patterns
certeza	Three-tier test effectiveness

Hardware

Machine	Role	Key Spec
lambda	Student training (GPU)	RTX 4090 (24 GB), Threadripper
intel	Teacher inference, eval, data	300 GB RAM, Xeon W-3245, 2x W5700X

Single Entry Point

apr pipeline plan configs/pipeline/albor.yaml    # Show full DAG, estimate everything
apr pipeline apply configs/pipeline/albor.yaml   # Execute (resumable, multi-machine)
apr pipeline status                              # What's converged / pending / failed

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.config		.config
.dvc		.dvc
.github		.github
assets		assets
benches		benches
configs		configs
contracts		contracts
docs		docs
emc		emc
examples		examples
fuzz		fuzz
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pmat-baseline.json		.pmat-baseline.json
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
LINEAGE.md		LINEAGE.md
MODEL_CARD.md		MODEL_CARD.md
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
data_catalog.yaml		data_catalog.yaml
deny.toml		deny.toml
pmat.toml		pmat.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Is Albor?

Leaderboard Target

Architecture

Improvement Ladder

Sovereign AI Stack

Hardware

Single Entry Point

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What Is Albor?

Leaderboard Target

Architecture

Improvement Ladder

Sovereign AI Stack

Hardware

Single Entry Point

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages