Skip to content

Commit 23e6d5c

Browse files
authored
Add usage advice and expand roadmap (#5)
* Refine docs for open source prep * reduce README clutter and update the concise workflow figure * update usage guide
1 parent 3bf1d2e commit 23e6d5c

File tree

5 files changed

+110
-21
lines changed

5 files changed

+110
-21
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ We encourage contributions in the following areas:
1212
- 🔍 **Testing**: Adding realistic traces and increasing coverage for components that are not thoroughly tested.
1313
- 🚧 **Engineering Improvements**: Enhancing log formatting, improving CLI usability, and performing code cleanup.
1414

15-
**For specific tasks and upcoming features where we need assistance, please see our [ROADMAP (TBD)](./ROADMAP.md) for planned directions and priorities.**
15+
**For specific tasks and upcoming features where we need assistance, please see our [ROADMAP](./ROADMAP.md) for planned directions and priorities.**
1616

1717
## ⚠️ Important Information for Contributors
1818

README.md

Lines changed: 41 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -3,53 +3,74 @@
33
<picture>
44
<img alt="TrainCheck logo" width="55%" src="./docs/assets/images/traincheck_logo.png">
55
</picture>
6-
<h1>Silent Error Detection for Deep Learning Training</h1>
6+
<h1>TrainCheck: Training with Confidence</h1>
77

88
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml)
99
[![Chat on Discord](https://img.shields.io/discord/1362661016760090736?label=Discord&logo=discord&style=flat)](https://discord.gg/ZvYewjsQ9D)
1010

1111
</div>
1212

13-
> ***Training with Confidence***
1413

15-
TrainCheck is a lightweight, extensible tool for runtime monitoring of “silent” bugs in deep‑learning training pipelines. Instead of waiting for a crash or a bad model, TrainCheck:
16-
1. **Automatically instruments** your existing training scripts (e.g., from [pytorch/examples](https://github.com/pytorch/examples) or [huggingface/transformers/examples](https://github.com/huggingface/transformers/tree/main/examples)), inserting tracing hooks with minimal code changes.
17-
2. **Learns precise invariants**–precise properties that should hold during training across API calls and model updates-by analyzing executions of known-good runs.
18-
3. **Catches silent issues early**–by checking invariants on new or modified training jobs, alerting you immediately if something didn't happen as expected (e.g., model weight inconsistency, mixed precision not applied successfully, unexpected tensor shapes). On violation, TrainCheck flags the point of divergence—so users can diagnose silent issues before they derail your model.
14+
**TrainCheck** is a lightweight tool for proactively catching **silent errors** in deep learning training runs. It detects correctness issues, such as code bugs and faulty hardware, early and pinpoints their root cause.
15+
16+
TrainCheck has detected silent errors in a wide range of real-world training scenarios, from large-scale LLM pretraining (such as BLOOM-176B) to small-scale tutorial runs by deep learning beginners.
17+
18+
📌 For a list of successful cases, see: TODO
19+
20+
## What It Does
21+
22+
TrainCheck uses **training invariants**, which are semantic rules that describe expected behavior during training, to detect bugs as they happen. These invariants can be extracted from any correct run, including those produced by official examples and tutorials. There is no need to curate inputs or write manual assertions.
23+
24+
TrainCheck performs three core functions:
25+
26+
1. **Instruments your training code**
27+
Inserts lightweight tracing into existing scripts (such as [pytorch/examples](https://github.com/pytorch/examples) or [transformers](https://github.com/huggingface/transformers/tree/main/examples)) with minimal code changes.
28+
29+
2. **Learns invariants from correct runs**
30+
Discovers expected relationships across APIs, tensors, and training steps to build a model of normal behavior.
31+
32+
3. **Checks new or modified runs**
33+
Validates behavior against the learned invariants and flags silent errors, such as missing gradient clipping, weight desynchronization, or broken mixed precision, right when they occur.
34+
35+
This picture illustrates the TrainCheck workflow:
1936

2037
![Workflow](docs/assets/images/workflow.png)
2138

2239
Under the hood, TrainCheck decomposes into three CLI tools:
2340
- **Instrumentor** (`traincheck-collect`)
2441
Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
2542
- **Inference Engine** (`traincheck-infer`)
26-
Consumes one or more trace logs from successful runs to infer low‑level invariants.
43+
Consumes one or more trace logs from successful runs to infer training invariants.
2744
- **Checker** (`traincheck-check`)
2845
Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.
2946

30-
## Status
31-
32-
TrainCheck is under active development. Features may be incomplete and the documentation is evolving—if you give it a try, please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support. Currently, the **Checker** operates in a semi‑online mode: you invoke it against the live, growing trace output to catch silent bugs as they appear. Fully automatic monitoring is on the roadmap, and we welcome feedback and contributions from early adopters.
33-
34-
## Try TrainCheck
47+
## 🔥 Try TrainCheck
3548

36-
1. **Install**
37-
Follow the [Installation Guide](./docs/installation-guide.md) to get TrainCheck set up on your machine.
38-
39-
2. **Explore**
40-
Work through our "[5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md)" tutorial. You’ll learn how to:
49+
Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to:
4150
- Instrument a training script and collect a trace
42-
- Automatically infer low‑level invariants
43-
- Run the Checker in semi‑online mode to uncover silent bugs
51+
- Automatically infer invariants
52+
- Uncover silent bugs in the training script
4453

4554
## Documentation
4655

47-
Please visit [TrainCheck Technical Doc](./docs/technical-doc.md).
56+
- **[Installation Guide](./docs/installation-guide.md)**
57+
- **[Usage Guide: Scenarios and Limitations](./docs/usage-guide.md)**
58+
- **[TrainCheck Technical Doc](./docs/technical-doc.md)**
59+
- **[TrainCheck Dev RoadMap](./ROADMAP.md)**
60+
61+
## Status
62+
63+
TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support.
64+
We welcome feedback and contributions from early adopters.
4865

4966
## Contributing
5067

5168
We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](./CONTRIBUTING.md) for how to get involved.
5269

70+
## License
71+
72+
TrainCheck is licensed under the [Apache License 2.0](./LICENSE).
73+
5374
## Citation
5475

5576
If TrainCheck is relevant to your work, please cite our paper:

ROADMAP.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# TrainCheck Roadmap
2+
3+
This document outlines planned directions for the TrainCheck project. The roadmap is aspirational and subject to change as we gather feedback from the community.
4+
5+
## Short Term
6+
7+
- **Online monitoring** – integrate the checker directly into the collection process so violations are reported immediately during training.
8+
- **Pre-inferred invariant library** – ship a curated set of invariants for common PyTorch and HuggingFace workflows to reduce the need for manual inference.
9+
- **Improved distributed support** – better handling of multi-GPU and multi-node runs, including tracing of distributed backends.
10+
- **High-quality invariants** – publish well-tested invariants for PyTorch, DeepSpeed, and Transformers out of the box.
11+
- **Demo assets** – publish a short demo video and GIFs illustrating the TrainCheck workflow.
12+
- **Expanded documentation** – add guidance on choosing reference runs and diagnosing issues, plus deeper technical docs.
13+
- **Stability fixes and tests** – resolve proxy dump bugs and add end-to-end tests for the full instrumentation→inference→checking pipeline.
14+
- **Call graph updates** – document the call-graph generation process and keep graphs in sync with recent PyTorch versions.
15+
- **Repository cleanup** – remove obsolete files and artifacts.
16+
17+
## Medium Term
18+
19+
- **Extensible instrumentation** – allow plugins for third-party libraries and custom frameworks.
20+
- **Smarter invariant filtering** – tooling to help users manage large numbers of invariants and suppress benign ones.
21+
- **Performance improvements** – explore parallel inference and more efficient trace storage formats.
22+
23+
## Long Term
24+
25+
- **Cross-framework support** – expand beyond PyTorch to additional deep learning frameworks.
26+
- **Automated root-cause analysis** – provide hints or suggested fixes when a violation is detected.
27+
28+
We welcome contributions in any of these areas. If you have ideas or want to help, please check the [CONTRIBUTING guide](./CONTRIBUTING.md) and open an issue to discuss!

docs/assets/images/workflow.png

109 KB
Loading

docs/usage-guide.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# 🧪 TrainCheck: Usage Guide
2+
3+
TrainCheck helps detect and diagnose silent errors in deep learning training runs—issues that don't crash your code but silently break correctness.
4+
5+
## 🚀 Quick Start
6+
7+
Check out the [5-minute guide](./docs/5-min.md) for a minimal working example.
8+
9+
## ✅ Common Use Cases
10+
11+
TrainCheck is useful when your training process doesn’t converge, behaves inconsistently, or silently fails. It can help you:
12+
13+
- **Monitor** long-running training jobs and catch issues early
14+
- **Debug** finished runs and pinpoint where things went wrong
15+
- **Sanity-check** new pipelines, code changes, or infrastructure upgrades
16+
17+
TrainCheck detects a range of correctness issues—like misused APIs, incorrect training logic, or hardware faults—without requiring labels or modifications to your training code.
18+
19+
**While TrainCheck focuses on correctness, it’s also useful for *ruling out bugs* so you can focus on algorithm design with confidence.**
20+
21+
## 🧠 Tips for Effective Use
22+
23+
1. **Use short runs to reduce overhead.**
24+
If your hardware is stable, you can validate just the beginning of training. Use smaller models and fewer iterations to speed up turnaround time.
25+
26+
2. **Choose good reference runs for inference.**
27+
- If you have a past run of the same code that worked well, just use that.
28+
- You can also use small-scale example pipelines that cover different features of the framework (e.g., various optimizers, mixed precision, optional flags).
29+
- If you're debugging a new or niche feature with limited history, try using the official example as a reference. Even if the example is not bug-free, invariant violations can still highlight behavioral differences between your run and the example, helping you debug faster.
30+
31+
3. **Minimize scale when collecting traces.**
32+
- Shrink the pipeline by using a smaller model, running for only ~10 iterations, and using the minimal necessary compute setup (e.g., 2 nodes for distributed training).
33+
34+
35+
## 🚧 Current Limitations
36+
37+
- **Eager mode only.** TrainCheck instrumentor currently works only in PyTorch eager mode. Features like `torch.compile` are disabled during instrumentation.
38+
39+
- **Not fully real-time (yet).** Invariant checking is semi-online. Full real-time support is planned but not yet available.
40+

0 commit comments

Comments
 (0)