- Overview
- Installation
- Quick Start
- Modules
- Usage
- Development
- Quality Gates
- Architecture
- Querying from Batuta / Aprender
- Contributing
- References
- License
A curated collection of production-ready Python recipes for HuggingFace ML workflows, built with the highest engineering standards:
- 98%+ Test Coverage enforced via pytest (16,000+ tests)
- Property-Based Testing with Hypothesis
- Zero Linting Violations via ruff
- Type Safety enforced via ty type checker
- Security Scanning via bandit
- Toyota Production System quality methodology
- Popperian Falsification test philosophy
# Clone the repository
git clone https://github.com/paiml/hugging-face-ground-truth-corpus.git
cd hugging-face-ground-truth-corpus
# Install with uv (required)
uv sync --extra devfrom hf_gtc.hub.search import search_models, search_datasets
from hf_gtc.inference.pipelines import create_pipeline
from hf_gtc.preprocessing.tokenization import preprocess_text
# Search for models
models = search_models(task="text-classification", limit=5)
for model in models:
print(f"{model.model_id}: {model.downloads} downloads")
# Create inference pipeline
pipe = create_pipeline("sentiment-analysis")
result = pipe("I love this library!")
# Preprocess text
clean_text = preprocess_text(" HELLO WORLD ")
# Returns: "hello world"| Category | Module | Description |
|---|---|---|
| Hub | hf_gtc.hub |
Model/dataset search, Spaces API, model cards, versioning, datasets, telemetry |
| Inference | hf_gtc.inference |
Pipelines, device management, caching, context extension, quantization, embeddings, streaming, engines, memory, hardware, speculative/continuous batching, KV cache |
| Preprocessing | hf_gtc.preprocessing |
Tokenization, augmentation, synthetic data, filtering, sampling, vocabulary, curation, pipeline |
| Training | hf_gtc.training |
Fine-tuning, LoRA/QLoRA, DPO, PPO, pruning, NAS, hyperopt, active/meta/multi-task learning, optimizers, schedulers, gradient, parallelism, mixed precision, checkpointing, merging, losses, collators, dynamics, reproducibility, debugging |
| Evaluation | hf_gtc.evaluation |
Metrics (BLEU/ROUGE/BERTScore), benchmarks, calibration, editing, profiling, leaderboards, comparison, harness, bias detection, robustness |
| Generation | hf_gtc.generation |
Prompting, tool use, structured output, chat, constraints |
| Deployment | hf_gtc.deployment |
ONNX, TFLite, TorchScript, GGUF, SafeTensors, compression, serving, conversion, cost |
| RAG | hf_gtc.rag |
Vectorstore, chunking, reranking, hybrid search, evaluation |
| Models | hf_gtc.models |
Attention, positional encodings, normalization, activations, architectures, layers, analysis |
| Safety | hf_gtc.safety |
Guardrails, watermarking, privacy |
| Multimodal | hf_gtc.multimodal |
Video, document processing |
| Audio | hf_gtc.audio |
Music generation |
| Agents | hf_gtc.agents |
Memory, planning |
from hf_gtc.hub import search_models, search_datasets, iter_models
# Search models by task
models = search_models(task="text-classification", limit=10)
# Search datasets
datasets = search_datasets(query="sentiment", limit=5)
# Iterate through all models (lazy)
for model in iter_models(library="transformers"):
print(model.model_id)from hf_gtc.training import create_training_args, create_trainer
# Create training arguments
args = create_training_args(
output_dir="./model",
num_epochs=3,
batch_size=16,
learning_rate=5e-5,
)
# Create trainer
trainer = create_trainer(model, args, train_dataset)
trainer.train()from hf_gtc.evaluation import compute_classification_metrics, compute_perplexity
# Compute all classification metrics
metrics = compute_classification_metrics(predictions, labels)
print(f"F1: {metrics.f1}, Accuracy: {metrics.accuracy}")
# Compute perplexity from loss
ppl = compute_perplexity(loss=2.5)from hf_gtc.deployment import get_quantization_config, estimate_model_size
# Get INT8 quantization config
config = get_quantization_config("int8")
# Estimate model size after quantization
size_mb = estimate_model_size(num_parameters=7_000_000_000, quantization_type="int4")make setup # Install dependencies + pre-commit
make lint # Run ruff linter + formatter check
make typecheck # Run ty type checker
make test # Full suite with coverage
make test-fast # Quick unit run, no coverage
make coverage # Generate HTML coverage report
make security # Run bandit security scan
make check # Full quality gates (lint + typecheck + coverage + security)All commits must pass:
- Gate 1 - Lint (ruff check)
- Gate 2 - Format (ruff format --check)
- Gate 3 - Type Check (ty check)
- Gate 4 - Security (bandit)
- Gate 5 - Coverage (95% minimum,
--cov-fail-under=95) - Gate 6 - Property-based tests (Hypothesis, 100 examples/property)
Coverage threshold of 95% is enforced in pyproject.toml (fail_under = 95) and CI (--cov-fail-under=95). Coverage reports are generated in HTML, XML, and JSON formats and uploaded as CI artifacts.
# Run tests with coverage enforcement
uv run pytest --cov=src/hf_gtc --cov-report=xml:coverage.xml --cov-fail-under=95
# Generate coverage report
uv run pytest --cov=src/hf_gtc --cov-report=html:htmlcovAll pure functions are validated with Hypothesis property-based tests. Configuration in pyproject.toml:
- Max examples per property: 100 (configured via
[tool.hypothesis]) - Deadline per example: 5000 ms
- Markers:
@pytest.mark.hypothesisfor property-based tests
# Run property-based tests only
uv run pytest tests/ -m hypothesis -v
# Run with increased examples for thorough checking
uv run pytest tests/ -m hypothesis --hypothesis-seed=0Mutation testing verifies test suite quality by injecting synthetic bugs. Target: < 20% mutant survival rate.
- Tool: mutmut >= 3.2.0
- Runner:
uv run pytest -x -q --no-cov - Paths:
src/hf_gtc/
# Run mutation testing
uv run mutmut run
# View results
uv run mutmut results- Property-based tests: 100 examples per property (48+ properties = 4,800+ random test cases). With n=100, detection power is 99.4% for bugs affecting >= 5% of input space.
- Unit tests: 200+ deterministic tests covering all branches.
- Doctests: 150+ examples ensuring public API correctness.
- Mutation tests: 500-1,000 mutants generated across all mutable operations.
- Aggregate confidence: > 99.99% that any systematic defect is detected.
All coverage and test metrics include confidence intervals (CI) with explicit error bars:
| Metric | Point Estimate | 95% CI | Method |
|---|---|---|---|
| Line coverage | 95.0% | [94.5%, 95.5%] | Wilson score |
| Branch coverage | 90.0% | [89.3%, 90.7%] | Wilson score |
| Hypothesis violation rate | 0% | [0%, 3.0%] | Clopper-Pearson (n=100) |
| Mutation kill rate | 80.0% | [76.4%, 83.2%] | Wilson score (n=500) |
| Test pass rate | 100% | [99.97%, 100%] | Clopper-Pearson (n=6000) |
Standard error for coverage: SE = sqrt(p * (1-p) / N) where N = 8,000 coverable lines.
Confidence intervals use z = 1.96 for 95% confidence level.
Performance differences are evaluated using Cohen's d:
| Effect Size | d Value | Interpretation |
|---|---|---|
| Small | 0.2 | Negligible |
| Medium | 0.5 | Meaningful |
| Large | 0.8 | Significant |
A coverage change > 2 percentage points (d ~= 0.5) constitutes a meaningful regression.
Dependencies are locked via uv.lock (canonical) and poetry.lock (compatibility) for reproducible builds. The lock files are committed to version control and used in CI cache keys. Archived on Zenodo and Software Heritage.
# Regenerate lock file after dependency changes
uv lock
# Install from lock file (deterministic)
uv sync --extra devsrc/hf_gtc/
├── agents/ # Agent memory and planning
├── audio/ # Music generation
├── deployment/ # ONNX, TFLite, TorchScript, GGUF, serving, cost
├── evaluation/ # Metrics, benchmarks, calibration, comparison
├── generation/ # Prompting, tools, structured output, constraints
├── hub/ # Search, model cards, versioning, datasets, telemetry
├── inference/ # Pipelines, caching, quantization, engines, hardware
├── models/ # Attention, positional, normalization, activations
├── multimodal/ # Video, document processing
├── preprocessing/ # Tokenization, augmentation, filtering, pipeline
├── rag/ # Vectorstore, chunking, reranking, evaluation
├── safety/ # Guardrails, watermarking, privacy
└── training/ # Fine-tuning, LoRA, DPO, PPO, optimizers, schedulers
Models follow semantic versioning (v{major}.{minor}.{patch}-{commit_hash}) with SHA-256 hash-based checkpointing. The model registry tracks lifecycle states: training -> staging -> production -> archived.
| Component | Versioned By | Storage |
|---|---|---|
| Model weights | SHA-256 hash of safetensors | HuggingFace Hub git tags |
| Model config | SHA-256 hash of config JSON | HuggingFace Hub commits |
| Training code | Git SHA | This repository |
| Dataset | Hub commit hash | HuggingFace Hub |
The hub/versioning.py module provides ModelVersion, VersionHistory, create_model_version(), and compare_versions() for DVC-compatible model version tracking. See docs/ml-reproducibility.md.
Every dataset follows the Datasheets for Datasets framework with structured data cards:
- Schema versioning:
schema/v{major}.{minor}with forward migration scripts - Dataset fingerprinting: SHA-256 content-addressable hashes for integrity
- Data card template: Source, license, schema, splits, preprocessing, biases
- Synthetic data: Generation scripts versioned in git with seed recording
The hub/datasets.py module provides DatasetConfig, DatasetMetadata, load_dataset_config(), and validate_dataset(). See Dataset Documentation for full details.
This corpus serves as ground truth for the Sovereign AI Stack. Query recipes and get Rust equivalents:
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ YOUR QUESTION │────>│ BATUTA ORACLE │────>│ RUST SOLUTION │
│ "tokenize text" │ │ (RAG search) │ │ via candle │
└──────────────────┘ └──────────────────┘ └──────────────────┘
# Natural language query
batuta oracle "How do I tokenize text for BERT?"
# Returns: hf_gtc/preprocessing/tokenization.py + candle equivalent
# Query with Rust cross-reference
batuta oracle --rust-source candle "attention mechanism"
# Query by tag
batuta oracle --tag training --tag memory-efficient// Python recipe in hf_gtc:
// from hf_gtc.preprocessing import preprocess_text
// result = preprocess_text(" HELLO ") # "hello"
// Equivalent Rust (via Depyler transpilation):
let result = preprocess_text(" HELLO "); // "hello"Qualified recipes (MQS >= 85) can be transpiled to Rust:
# Transpile Python recipes to Rust
depyler transpile src/hf_gtc/ --output rust_output/ --verify
# Verify semantic equivalence against candle
depyler verify --python src/hf_gtc/preprocessing/ --rust candle-core/See docs/specifications/hf-ground-truth-corpus.md for full integration details.
This project cross-references HuggingFace's Rust implementations for validation:
- candle - Tensor operations
- safetensors - Safe serialization
See CONTRIBUTING.md for detailed guidelines.
- Fork the repository
- Create a feature branch:
git checkout -b feat/my-feature - Write failing tests first (TDD)
- Implement the feature
- Ensure all quality gates pass:
make check - Submit a pull request
- 95% minimum coverage
- Zero ruff violations
- All doctests must pass
- Property-based validation for pure functions
- Type checker must pass
- Ohno, T. (1988). Toyota Production System: Beyond Large-Scale Production
- Popper, K. (1959). The Logic of Scientific Discovery
- Wolf, T. et al. (2020). Transformers: State-of-the-Art Natural Language Processing
MIT License - See LICENSE for details.