|
3 | 3 | <picture> |
4 | 4 | <img alt="TrainCheck logo" width="55%" src="./docs/assets/images/traincheck_logo.png"> |
5 | 5 | </picture> |
6 | | -<h1>Silent Error Detection for Deep Learning Training</h1> |
| 6 | +<h1>TrainCheck: Training with Confidence</h1> |
7 | 7 |
|
8 | 8 | [](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml) |
9 | 9 | [](https://discord.gg/ZvYewjsQ9D) |
10 | 10 |
|
11 | 11 | </div> |
12 | 12 |
|
13 | | -> ***Training with Confidence*** |
14 | 13 |
|
15 | | -TrainCheck is a lightweight, extensible tool for runtime monitoring of “silent” bugs in deep‑learning training pipelines. Instead of waiting for a crash or a bad model, TrainCheck: |
16 | | -1. **Automatically instruments** your existing training scripts (e.g., from [pytorch/examples](https://github.com/pytorch/examples) or [huggingface/transformers/examples](https://github.com/huggingface/transformers/tree/main/examples)), inserting tracing hooks with minimal code changes. |
17 | | -2. **Learns precise invariants**–precise properties that should hold during training across API calls and model updates-by analyzing executions of known-good runs. |
18 | | -3. **Catches silent issues early**–by checking invariants on new or modified training jobs, alerting you immediately if something didn't happen as expected (e.g., model weight inconsistency, mixed precision not applied successfully, unexpected tensor shapes). On violation, TrainCheck flags the point of divergence—so users can diagnose silent issues before they derail your model. |
| 14 | +**TrainCheck** is a lightweight tool for proactively catching **silent errors** in deep learning training runs. It detects correctness issues, such as code bugs and faulty hardware, early and pinpoints their root cause. |
| 15 | + |
| 16 | +TrainCheck has detected silent errors in a wide range of real-world training scenarios, from large-scale LLM pretraining (such as BLOOM-176B) to small-scale tutorial runs by deep learning beginners. |
| 17 | + |
| 18 | +📌 For a list of successful cases, see: TODO |
| 19 | + |
| 20 | +## What It Does |
| 21 | + |
| 22 | +TrainCheck uses **training invariants**, which are semantic rules that describe expected behavior during training, to detect bugs as they happen. These invariants can be extracted from any correct run, including those produced by official examples and tutorials. There is no need to curate inputs or write manual assertions. |
| 23 | + |
| 24 | +TrainCheck performs three core functions: |
| 25 | + |
| 26 | +1. **Instruments your training code** |
| 27 | + Inserts lightweight tracing into existing scripts (such as [pytorch/examples](https://github.com/pytorch/examples) or [transformers](https://github.com/huggingface/transformers/tree/main/examples)) with minimal code changes. |
| 28 | + |
| 29 | +2. **Learns invariants from correct runs** |
| 30 | + Discovers expected relationships across APIs, tensors, and training steps to build a model of normal behavior. |
| 31 | + |
| 32 | +3. **Checks new or modified runs** |
| 33 | + Validates behavior against the learned invariants and flags silent errors, such as missing gradient clipping, weight desynchronization, or broken mixed precision, right when they occur. |
| 34 | + |
| 35 | +This picture illustrates the TrainCheck workflow: |
19 | 36 |
|
20 | 37 |  |
21 | 38 |
|
22 | 39 | Under the hood, TrainCheck decomposes into three CLI tools: |
23 | 40 | - **Instrumentor** (`traincheck-collect`) |
24 | 41 | Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics. |
25 | 42 | - **Inference Engine** (`traincheck-infer`) |
26 | | - Consumes one or more trace logs from successful runs to infer low‑level invariants. |
| 43 | + Consumes one or more trace logs from successful runs to infer training invariants. |
27 | 44 | - **Checker** (`traincheck-check`) |
28 | 45 | Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants. |
29 | 46 |
|
30 | | -## Status |
31 | | - |
32 | | -TrainCheck is under active development. Features may be incomplete and the documentation is evolving—if you give it a try, please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. Currently, the **Checker** operates in a semi‑online mode: you invoke it against the live, growing trace output to catch silent bugs as they appear. Fully automatic monitoring is on the roadmap, and we welcome feedback and contributions from early adopters. |
33 | | - |
34 | | -## Try TrainCheck |
| 47 | +## 🔥 Try TrainCheck |
35 | 48 |
|
36 | | -1. **Install** |
37 | | - Follow the [Installation Guide](./docs/installation-guide.md) to get TrainCheck set up on your machine. |
38 | | - |
39 | | -2. **Explore** |
40 | | - Work through our "[5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md)" tutorial. You’ll learn how to: |
| 49 | +Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to: |
41 | 50 | - Instrument a training script and collect a trace |
42 | | - - Automatically infer low‑level invariants |
43 | | - - Run the Checker in semi‑online mode to uncover silent bugs |
| 51 | + - Automatically infer invariants |
| 52 | + - Uncover silent bugs in the training script |
44 | 53 |
|
45 | 54 | ## Documentation |
46 | 55 |
|
47 | | -Please visit [TrainCheck Technical Doc](./docs/technical-doc.md). |
| 56 | +- **[Installation Guide](./docs/installation-guide.md)** |
| 57 | +- **[Usage Guide: Scenarios and Limitations](./docs/usage-guide.md)** |
| 58 | +- **[TrainCheck Technical Doc](./docs/technical-doc.md)** |
| 59 | +- **[TrainCheck Dev RoadMap](./ROADMAP.md)** |
| 60 | + |
| 61 | +## Status |
| 62 | + |
| 63 | +TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. |
| 64 | +We welcome feedback and contributions from early adopters. |
48 | 65 |
|
49 | 66 | ## Contributing |
50 | 67 |
|
51 | 68 | We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](./CONTRIBUTING.md) for how to get involved. |
52 | 69 |
|
| 70 | +## License |
| 71 | + |
| 72 | +TrainCheck is licensed under the [Apache License 2.0](./LICENSE). |
| 73 | + |
53 | 74 | ## Citation |
54 | 75 |
|
55 | 76 | If TrainCheck is relevant to your work, please cite our paper: |
|
0 commit comments