Skip to content

Commit 1ce16f2

Browse files
committed
feat: Add vital articles wiki example and trainer tooling
* Update examples/wiki.rs so the crawler seeds come from Wikipedia’s Level 1–5 Vital Articles lists instead of a few topical pages, improving coverage when running the example. * Introduce a new wiki_trainer/ Python project (Typer CLI, README, pyproject.toml, uv.lock, artifacts scaffolding) that can 1) filter Fastcrawl’s chunk JSONL into shuffled train/eval splits and 2) fine-tune Hugging Face causal LMs with configurable hyperparameters via prepare- data and train commands. * Implement dataset cleaning/splitting helpers (data.py), structured config objects (config.py), and a transformers-based training pipeline (training.py) that handles tokenization, batching, evaluation/save cadence, and tokenizer/model persistence. * Note: the new directory currently also tracks __pycache__/ and .pyc artifacts under wiki_trainer/src/wiki_trainer/, which will end up in the commit unless they’re removed or gitignored.
1 parent a04f95a commit 1ce16f2

File tree

16 files changed

+2169
-4
lines changed

16 files changed

+2169
-4
lines changed

examples/wiki.rs

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,15 @@ use std::sync::Arc;
44
use url::Url;
55

66
const SEEDS: &[&str] = &[
7-
"https://en.wikipedia.org/wiki/Web_crawler",
8-
"https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol",
9-
"https://en.wikipedia.org/wiki/Capybara",
10-
"https://en.wikipedia.org/wiki/Cat",
7+
// "https://en.wikipedia.org/wiki/Web_crawler",
8+
// "https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol",
9+
// "https://en.wikipedia.org/wiki/Capybara",
10+
// "https://en.wikipedia.org/wiki/Cat",
11+
"https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/1",
12+
"https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/2",
13+
"https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/3",
14+
"https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/4",
15+
"https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5",
1116
];
1217

1318
fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {

wiki_trainer/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
artifacts/

wiki_trainer/.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.13

wiki_trainer/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Wiki Trainer
2+
3+
Utilities for turning Fastcrawl's Wikipedia chunks into Hugging Face datasets and fine-tuning a causal language model using `transformers` + `uv`.
4+
5+
## Prerequisites
6+
7+
- Python 3.13 (already provided by the `uv` shim installed at repository root)
8+
- `uv` >= 0.9 for dependency + virtualenv management
9+
- GPU drivers/tooling that can run PyTorch (install CUDA/cuDNN or use CPU for smoke tests)
10+
- A local snapshot of chunks, e.g. `data/wiki_embeddings.jsonl` produced by Fastcrawl's embedder pipeline
11+
12+
## Setup
13+
14+
```sh
15+
cd wiki_trainer
16+
UV_CACHE_DIR=../.cache/uv uv sync # creates .venv and installs dependencies (torch via the `training` extra)
17+
source .venv/bin/activate
18+
```
19+
20+
`uv sync` respects the `pyproject.toml` optional dependency group named `training`, so PyTorch + bitsandbytes are installed automatically. Adjust `UV_CACHE_DIR` if you keep cache files elsewhere (the repo root already has `.cache/uv`).
21+
22+
## Converting chunks to train/eval JSONL
23+
24+
Run the `prepare-data` subcommand to down-select and split the chunk corpus. By default it expects OpenAI-style embedding JSONL rows (with `text`, `url`, etc.), but it also works with normalized Fastcrawl pages that include `body_text` or `chunks[].text`.
25+
26+
```sh
27+
uv run wiki-trainer prepare-data \
28+
../data/wiki_embeddings.jsonl \
29+
--output-dir artifacts/datasets \
30+
--min-chars 200 \
31+
--max-chars 1600 \
32+
--max-chunks 50000 \
33+
--eval-ratio 0.02
34+
```
35+
36+
The command writes `train.jsonl` and `eval.jsonl` into `artifacts/datasets`. Each row keeps the original text plus metadata columns (`source_url`, `chunk_id`, `section_path`) so you can trace model behavior back to specific chunks.
37+
38+
## Fine-tuning a model
39+
40+
Once the dataset exists, call `wiki-trainer train` with your preferred Hugging Face checkpoint. The defaults target `distilgpt2`, but you can swap in any causal LM (TinyLlama, Mistral, etc.) so long as it fits on your hardware.
41+
42+
```sh
43+
uv run wiki-trainer train \
44+
artifacts/datasets \
45+
--model-name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
46+
--output-dir artifacts/checkpoints/tinyllama \
47+
--context-length 1024 \
48+
--epochs 2 \
49+
--batch-size 1 \
50+
--grad-accum 16 \
51+
--learning-rate 1e-4 \
52+
--eval-steps 100
53+
```
54+
55+
The CLI wraps Hugging Face's `Trainer` so standard knobs (batch size, gradient accumulation, precision flags) are exposed. Logs/checkpoints land under `artifacts/checkpoints/...` by default.
56+
57+
## Tips
58+
59+
- **Filtering.** Increase `--min-chars` to drop stubby chunks or pass `--max-chunks`/`--eval-ratio` to control dataset size.
60+
- **Precision.** Use `--bf16` or `--fp16` once your hardware + drivers support it; otherwise leave them disabled for CPU proof-of-life runs.
61+
- **Custom schedules.** Edit `wiki_trainer/config.py` to add weight-decay or warmup strategies, then re-export the CLI arguments if you need more control.
62+
- **Streaming/large corpora.** `prepare-data` currently loads the filtered samples into memory before shuffling. For multi-million chunk runs consider chunked pre-processing or swapping the implementation for a disk-backed shuffle buffer.
63+
64+
## Repository integration
65+
66+
The project stays isolated inside `wiki_trainer/` so it can evolve independently of the Rust crawler. Use `uv run wiki-trainer --help` to see every flag, and keep data artifacts under `wiki_trainer/artifacts/` (already referenced in the defaults) so they stay out of the Rust workspace.

wiki_trainer/pyproject.toml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
[project]
2+
name = "wiki-trainer"
3+
version = "0.1.0"
4+
description = "Train custom language models on Fastcrawl Wikipedia chunks"
5+
readme = "README.md"
6+
requires-python = ">=3.13"
7+
dependencies = [
8+
"typer[all]>=0.12.5",
9+
"datasets>=2.19.1",
10+
"transformers>=4.45.0",
11+
"accelerate>=0.34.0",
12+
"sentencepiece>=0.2.0",
13+
"tqdm>=4.66.0",
14+
"numpy>=1.26.0",
15+
]
16+
17+
[project.optional-dependencies]
18+
training = ["torch>=2.4.1", "bitsandbytes>=0.43.1"]
19+
20+
[dependency-groups]
21+
training = ["torch>=2.4.1", "bitsandbytes>=0.43.1"]
22+
23+
[project.scripts]
24+
wiki-trainer = "wiki_trainer.cli:app"
25+
26+
[build-system]
27+
requires = ["hatchling"]
28+
build-backend = "hatchling.build"
29+
30+
[tool.hatch.build.targets.wheel]
31+
packages = ["src/wiki_trainer"]
32+
33+
[tool.uv]
34+
default-groups = ["training"]
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
"""Utilities for preparing Wikipedia chunks and fine-tuning local language models."""
2+
3+
from .config import DatasetConfig, TrainingConfig
4+
from .data import prepare_dataset
5+
from .training import train_model
6+
7+
__all__ = ["DatasetConfig", "TrainingConfig", "prepare_dataset", "train_model"]
468 Bytes
Binary file not shown.
5.09 KB
Binary file not shown.
3.93 KB
Binary file not shown.
7.26 KB
Binary file not shown.

0 commit comments

Comments
 (0)