Skip to content

DeepTrace: A lightweight, scalable real-time diagnostic and analysis tool for distributed training tasks.

License

Notifications You must be signed in to change notification settings

DeepLink-org/DeepTrace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepTrace

License

English | 简体中文

Project Overview

DeepTrace is a distributed training task diagnostic solution. It adopts a client-agent architecture with lightweight agent deployment that minimally intrudes on training tasks. The client and agent communicate via gRPC protocol, supporting real-time data streaming and command control. The distributed design caters to large-scale training cluster requirements with excellent scalability, allowing future expansion of new troubleshooting and analysis features.

Key Features

  • Distributed Architecture: Client-agent structure suitable for large training clusters
  • Lightweight Agent: Minimal intrusion on training tasks, easy deployment
  • Real-time Data Streaming: gRPC-based protocol supports real-time data transmission
  • Multi-dimensional Diagnostics: Supports log analysis and stack tracing
  • Extensible Design: Modular architecture for easy feature additions
  • CLI Interface: Provides command-line tools for operational maintenance

Architecture Design

Architecture Components:

  1. Client Components
    • CLI Interface: Maintenance personnel's command-line tool
    • Analysis Engine: Core logic processing data from Agents
  2. Training Cluster Layer
    • Diagnostic Agent: Monitors training tasks (via logs/stacks)
  3. Communication Protocol
    • Uses gRPC protocol

Installation Guide

Requirements

  • Go 1.24+

Build

make build

Built binaries will contain version information specified in Makefile.

Protobuf Generation

After modifying protobuf definitions, run:

make generate

Usage

Client Commands

Execute hang detection:

client check-hang --job-id my_job -a address_file --threshold 120 --interval 5

API Documentation

Detailed API reference see proto file.

Contributing

We welcome contributions! Before submitting a PR:

  1. Write test cases
  2. Update relevant documentation
  3. Ensure all tests pass

Development Setup

  1. Fork project
  2. Clone repository
  3. Create feature branch
  4. Commit changes
  5. Push to branch
  6. Create Pull Request

License

Apache License 2.0 - See LICENSE.

Contact

Please submit issues or contact maintainers for support.

About

DeepTrace: A lightweight, scalable real-time diagnostic and analysis tool for distributed training tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages