|
2 | 2 |
|
3 | 3 | This document outlines planned directions for the TrainCheck project. The roadmap is aspirational and subject to change as we gather feedback from the community. |
4 | 4 |
|
5 | | -## Short Term |
| 5 | +## North Star |
| 6 | + |
| 7 | +TrainCheck should be a holistic, production-ready monitoring tool for ML training: low overhead, actionable diagnostics, and flexible enough to integrate with real-world training stacks. |
| 8 | + |
| 9 | +## Near Term (Top Priorities) |
| 10 | + |
| 11 | +- **Overhead & selective tracking** – make selective variable tracking in checking mode production-ready, and tighten micro/macro overhead numbers with clear baselines. |
| 12 | +- **Explainability** – generate better invariant descriptions on inference, and present violations with clearer pointers to the triggering API/variable context. |
| 13 | +- **Debuggability & flexibility** – support dynamic queries at violation time (e.g., show which variables did not change and their properties) and collect global snapshots early in training to ground debugging. |
| 14 | + |
| 15 | +## Near Term (Supporting Work) |
6 | 16 |
|
7 | 17 | - **Online monitoring** – integrate the checker directly into the collection process so violations are reported immediately during training. |
8 | | -- **Pre-inferred invariant library** – ship a curated set of invariants for common PyTorch and HuggingFace workflows to reduce the need for manual inference. |
9 | 18 | - **Improved distributed support** – better handling of multi-GPU and multi-node runs, including tracing of distributed backends. |
10 | | -- **High-quality invariants** – publish well-tested invariants for PyTorch, DeepSpeed, and Transformers out of the box. |
11 | | -- **Demo assets** – publish a short demo video and GIFs illustrating the TrainCheck workflow. |
12 | | -- **Expanded documentation** – add guidance on choosing reference runs and diagnosing issues, plus deeper technical docs. |
13 | | -- **Stability fixes and tests** – resolve proxy dump bugs and add end-to-end tests for the full instrumentation→inference→checking pipeline. |
14 | | -- **Call graph updates** – document the call-graph generation process and keep graphs in sync with recent PyTorch versions. |
15 | | -- **Repository cleanup** – remove obsolete files and artifacts. |
| 19 | +- **Stability fixes and tests** – add end-to-end tests for the full instrumentation→inference→checking pipeline and resolve known instrumentation edge cases. |
| 20 | +- **Expanded documentation** – guidance on choosing reference runs and diagnosing issues, plus deeper technical docs. |
16 | 21 |
|
17 | 22 | ## Medium Term |
18 | 23 |
|
19 | | -- **Extensible instrumentation** – allow plugins for third-party libraries and custom frameworks. |
20 | | -- **Smarter invariant filtering** – tooling to help users manage large numbers of invariants and suppress benign ones. |
21 | | -- **Performance improvements** – explore parallel inference and more efficient trace storage formats. |
| 24 | +- **Invariant management** – tooling to filter, group, and suppress benign invariants at scale. |
| 25 | +- **Extensible instrumentation** – plugins for third-party libraries and custom frameworks. |
| 26 | +- **Performance improvements** – parallel inference and more efficient trace storage formats. |
| 27 | +- **Pre-inferred invariant library** – curated, well-tested invariants for common PyTorch and HuggingFace workflows. |
22 | 28 |
|
23 | 29 | ## Long Term |
24 | 30 |
|
25 | | -- **Cross-framework support** – expand beyond PyTorch to additional deep learning frameworks. |
26 | 31 | - **Automated root-cause analysis** – provide hints or suggested fixes when a violation is detected. |
| 32 | +- **Cross-framework support** – expand beyond PyTorch to additional deep learning frameworks. |
27 | 33 |
|
28 | 34 | We welcome contributions in any of these areas. If you have ideas or want to help, please check the [CONTRIBUTING guide](./CONTRIBUTING.md) and open an issue to discuss! |
0 commit comments