Skip to content

Commit f760d6f

Browse files
committed
add: static website generation workflow
1 parent 08b24f4 commit f760d6f

File tree

3 files changed

+127
-0
lines changed

3 files changed

+127
-0
lines changed

.github/workflows/deploy-docs.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: Deploy Docs
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
8+
jobs:
9+
deploy:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
- name: Configure Git
14+
run: |
15+
git config user.name "GitHub Actions Bot"
16+
git config user.email "github-actions[bot]@users.noreply.github.com"
17+
- name: Set up Python
18+
uses: actions/setup-python@v5
19+
with:
20+
python-version: 3.x
21+
- name: Install dependencies
22+
run: |
23+
pip install mkdocs mkdocs-readthedocs-theme
24+
- name: Deploy docs
25+
run: mkdocs gh-deploy --force

docs/README.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
2+
<div align="center">
3+
<picture>
4+
<img alt="TrainCheck logo" width="55%" src="./assets/images/traincheck_logo.png">
5+
</picture>
6+
<h1>TrainCheck: Training with Confidence</h1>
7+
8+
</div>
9+
10+
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml)
11+
[![Chat on Discord](https://img.shields.io/discord/1362661016760090736?label=Discord&logo=discord&style=flat)](https://discord.gg/ZvYewjsQ9D)
12+
13+
14+
**TrainCheck** is a lightweight tool for proactively catching **silent errors** in deep learning training runs. It detects correctness issues, such as code bugs and faulty hardware, early and pinpoints their root cause.
15+
16+
TrainCheck has detected silent errors in a wide range of real-world training scenarios, from large-scale LLM pretraining (such as BLOOM-176B) to small-scale tutorial runs by deep learning beginners.
17+
18+
📌 For a list of successful cases, see: TODO
19+
20+
## What It Does
21+
22+
TrainCheck uses **training invariants**, which are semantic rules that describe expected behavior during training, to detect bugs as they happen. These invariants can be extracted from any correct run, including those produced by official examples and tutorials. There is no need to curate inputs or write manual assertions.
23+
24+
TrainCheck performs three core functions:
25+
26+
1. **Instruments your training code**
27+
Inserts lightweight tracing into existing scripts (such as [pytorch/examples](https://github.com/pytorch/examples) or [transformers](https://github.com/huggingface/transformers/tree/main/examples)) with minimal code changes.
28+
29+
2. **Learns invariants from correct runs**
30+
Discovers expected relationships across APIs, tensors, and training steps to build a model of normal behavior.
31+
32+
3. **Checks new or modified runs**
33+
Validates behavior against the learned invariants and flags silent errors, such as missing gradient clipping, weight desynchronization, or broken mixed precision, right when they occur.
34+
35+
This picture illustrates the TrainCheck workflow:
36+
37+
![Workflow](assets/images/workflow.png)
38+
39+
Under the hood, TrainCheck decomposes into three CLI tools:
40+
- **Instrumentor** (`traincheck-collect`)
41+
Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
42+
- **Inference Engine** (`traincheck-infer`)
43+
Consumes one or more trace logs from successful runs to infer training invariants.
44+
- **Checker** (`traincheck-check`)
45+
Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.
46+
47+
## 🔥 Try TrainCheck
48+
49+
Work through [5‑Minute Experience with TrainCheck](./5-min-tutorial.md). You’ll learn how to:
50+
- Instrument a training script and collect a trace
51+
- Automatically infer invariants
52+
- Uncover silent bugs in the training script
53+
54+
## Documentation
55+
56+
- **[Installation Guide](./installation-guide.md)**
57+
- **[Usage Guide: Scenarios and Limitations](./usage-guide.md)**
58+
- **[TrainCheck Technical Doc](./technical-doc.md)**
59+
- **[TrainCheck Dev RoadMap](./ROADMAP.md)**
60+
61+
## Status
62+
63+
TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support.
64+
We welcome feedback and contributions from early adopters.
65+
66+
## Contributing
67+
68+
We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](./CONTRIBUTING.md) for how to get involved.
69+
70+
## License
71+
72+
TrainCheck is licensed under the [Apache License 2.0](./LICENSE).
73+
74+
## Citation
75+
76+
If TrainCheck is relevant to your work, please cite our paper:
77+
```bib
78+
@inproceedings{TrainCheckOSDI2025,
79+
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
80+
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
81+
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
82+
series = {OSDI '25},
83+
month = {July},
84+
year = {2025},
85+
address = {Boston, MA, USA},
86+
publisher = {USENIX Association},
87+
}
88+
```
89+
90+
91+
## Artifact Evaluation
92+
93+
🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](./ae.md).

mkdocs.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
site_name: TrainCheck
2+
theme:
3+
name: readthedocs
4+
nav:
5+
- Home: README.md
6+
- "Installation Guide": ./installation-guide.md
7+
- "5 Minute Quick Start": ./5-min-tutorial.md
8+
- "Technical Documentation": ./technical-doc.md
9+
- "Usage Tips": usage-guide.md

0 commit comments

Comments
 (0)