Skip to content

Commit e11a961

Browse files
committed
Revise roadmap for production monitoring priorities
1 parent 3051402 commit e11a961

File tree

1 file changed

+18
-12
lines changed

1 file changed

+18
-12
lines changed

ROADMAP.md

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,33 @@
22

33
This document outlines planned directions for the TrainCheck project. The roadmap is aspirational and subject to change as we gather feedback from the community.
44

5-
## Short Term
5+
## North Star
6+
7+
TrainCheck should be a holistic, production-ready monitoring tool for ML training: low overhead, actionable diagnostics, and flexible enough to integrate with real-world training stacks.
8+
9+
## Near Term (Top Priorities)
10+
11+
- **Overhead & selective tracking** – make selective variable tracking in checking mode production-ready, and tighten micro/macro overhead numbers with clear baselines.
12+
- **Explainability** – generate better invariant descriptions on inference, and present violations with clearer pointers to the triggering API/variable context.
13+
- **Debuggability & flexibility** – support dynamic queries at violation time (e.g., show which variables did not change and their properties) and collect global snapshots early in training to ground debugging.
14+
15+
## Near Term (Supporting Work)
616

717
- **Online monitoring** – integrate the checker directly into the collection process so violations are reported immediately during training.
8-
- **Pre-inferred invariant library** – ship a curated set of invariants for common PyTorch and HuggingFace workflows to reduce the need for manual inference.
918
- **Improved distributed support** – better handling of multi-GPU and multi-node runs, including tracing of distributed backends.
10-
- **High-quality invariants** – publish well-tested invariants for PyTorch, DeepSpeed, and Transformers out of the box.
11-
- **Demo assets** – publish a short demo video and GIFs illustrating the TrainCheck workflow.
12-
- **Expanded documentation** – add guidance on choosing reference runs and diagnosing issues, plus deeper technical docs.
13-
- **Stability fixes and tests** – resolve proxy dump bugs and add end-to-end tests for the full instrumentation→inference→checking pipeline.
14-
- **Call graph updates** – document the call-graph generation process and keep graphs in sync with recent PyTorch versions.
15-
- **Repository cleanup** – remove obsolete files and artifacts.
19+
- **Stability fixes and tests** – add end-to-end tests for the full instrumentation→inference→checking pipeline and resolve known instrumentation edge cases.
20+
- **Expanded documentation** – guidance on choosing reference runs and diagnosing issues, plus deeper technical docs.
1621

1722
## Medium Term
1823

19-
- **Extensible instrumentation** – allow plugins for third-party libraries and custom frameworks.
20-
- **Smarter invariant filtering** – tooling to help users manage large numbers of invariants and suppress benign ones.
21-
- **Performance improvements** – explore parallel inference and more efficient trace storage formats.
24+
- **Invariant management** – tooling to filter, group, and suppress benign invariants at scale.
25+
- **Extensible instrumentation** – plugins for third-party libraries and custom frameworks.
26+
- **Performance improvements** – parallel inference and more efficient trace storage formats.
27+
- **Pre-inferred invariant library** – curated, well-tested invariants for common PyTorch and HuggingFace workflows.
2228

2329
## Long Term
2430

25-
- **Cross-framework support** – expand beyond PyTorch to additional deep learning frameworks.
2631
- **Automated root-cause analysis** – provide hints or suggested fixes when a violation is detected.
32+
- **Cross-framework support** – expand beyond PyTorch to additional deep learning frameworks.
2733

2834
We welcome contributions in any of these areas. If you have ideas or want to help, please check the [CONTRIBUTING guide](./CONTRIBUTING.md) and open an issue to discuss!

0 commit comments

Comments
 (0)