English | 简体中文
DeepTrace is a distributed training task diagnostic solution. It adopts a client-agent architecture with lightweight agent deployment that minimally intrudes on training tasks. The client and agent communicate via gRPC protocol, supporting real-time data streaming and command control. The distributed design caters to large-scale training cluster requirements with excellent scalability, allowing future expansion of new troubleshooting and analysis features.
- Distributed Architecture: Client-agent structure suitable for large training clusters
- Lightweight Agent: Minimal intrusion on training tasks, easy deployment
- Real-time Data Streaming: gRPC-based protocol supports real-time data transmission
- Multi-dimensional Diagnostics: Supports log analysis and stack tracing
- Extensible Design: Modular architecture for easy feature additions
- CLI Interface: Provides command-line tools for operational maintenance
- Client Components
- CLI Interface: Maintenance personnel's command-line tool
- Analysis Engine: Core logic processing data from Agents
- Training Cluster Layer
- Diagnostic Agent: Monitors training tasks (via logs/stacks)
- Communication Protocol
- Uses gRPC protocol
- Go 1.24+
make build
Built binaries will contain version information specified in Makefile.
After modifying protobuf definitions, run:
make generate
Execute hang detection:
client check-hang --job-id my_job -a address_file --threshold 120 --interval 5
Detailed API reference see proto file.
We welcome contributions! Before submitting a PR:
- Write test cases
- Update relevant documentation
- Ensure all tests pass
- Fork project
- Clone repository
- Create feature branch
- Commit changes
- Push to branch
- Create Pull Request
Apache License 2.0 - See LICENSE.
Please submit issues or contact maintainers for support.