AI-Powered CI/CD Pipeline for AMD ROCm PyTorch Builds
Automated build orchestration with intelligent failure triage, reducing debugging time by 70% through ML-driven log analysis and root cause inference.
- π€ ML-Powered Analysis - NLP-based log parsing with semantic similarity matching and BERT embeddings
- β‘ GPU-Accelerated - Native AMD ROCm support with Kubernetes device plugins
- π Production Monitoring - Prometheus metrics + OpenTelemetry distributed tracing
- π Auto-Scaling - Kubernetes HPA with intelligent load balancing
- π Enterprise Security - JWT auth, RBAC, secret management, data encryption
- π§ Smart Recommendations - Automated fix suggestions based on historical patterns
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU Build Intelligence β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β API Layer β β Orchestrator β β Builder β β
β β (FastAPI) βββββΊβ (Coordinator)βββββΊβ (Executor) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Analyzer β β Storage β β Monitoring β β
β β (ML Engine) βββββΊβ (MongoDB) βββββΊβ (Prometheus) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Service | Description |
|---|---|
| API | FastAPI server with JWT auth, rate limiting, CORS |
| Orchestrator | Build coordination, resource allocation, load balancing |
| Builder | Environment setup, build execution, artifact management |
| Analyzer | ML-powered log parsing, pattern matching, root cause analysis |
| Storage | MongoDB for builds, Redis for caching, S3 for artifacts |
| Monitoring | Prometheus metrics, OpenTelemetry tracing, alerting |
- Python 3.11+
- MongoDB 7.0+
- Redis 7.0+
- Docker & Kubernetes (optional)
# Clone the repository
git clone https://github.com/Onchana01/PyTorch-gpu-build-AI.git
cd PyTorch-gpu-build-AI
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your settings
# Run the server
python -m src.api.main| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/api/v1/builds |
POST | Submit a new build |
/api/v1/builds/{id} |
GET | Get build status |
/api/v1/builds/{id}/logs |
GET | Get build logs |
/api/v1/analysis/{id} |
GET | Get failure analysis |
/docs |
GET | Interactive API docs |
Build Logs β Log Parser β Pattern Matcher β Root Cause Analyzer β Recommendations
β β β β
βΌ βΌ βΌ βΌ
Error Extraction Semantic Match Bayesian Inference Fix Suggestions
- Compilation Errors - Syntax, type, template errors
- Linking Errors - Undefined references, library conflicts
- Runtime Errors - Segfaults, memory issues, GPU errors
- Configuration Errors - CMake, environment, dependency issues
- Test Failures - Unit test, integration test failures
The system uses Bayesian causal inference to determine the most likely root cause:
# Example analysis output
{
"failure_category": "compilation_error",
"root_cause": "Missing ROCm HIP headers",
"confidence": 0.92,
"recommendations": [
"Install ROCm 6.0 development headers",
"Add /opt/rocm/include to CMAKE_PREFIX_PATH",
"Verify HIP_PLATFORM environment variable"
],
"similar_failures": 47,
"fix_success_rate": 0.89
}# Key metrics exposed
- build_requests_total
- build_duration_seconds
- build_success_rate
- analysis_latency_seconds
- gpu_utilization_percent
- queue_depthOpenTelemetry integration provides end-to-end request tracing:
API Request β Orchestrator β Builder β Analyzer β Storage
β β β β β
span1 span2 span3 span4 span5
| Variable | Description | Default |
|---|---|---|
MONGODB_URL |
MongoDB connection string | mongodb://localhost:27017 |
REDIS_URL |
Redis connection string | redis://localhost:6379 |
JWT_SECRET_KEY |
JWT signing key | (required) |
ROCM_DEFAULT_VERSION |
Default ROCm version | 6.0 |
MAX_CONCURRENT_BUILDS |
Max parallel builds | 10 |
BUILD_TIMEOUT_SECONDS |
Build timeout | 7200 |
# Apply Kubernetes manifests
kubectl apply -f kubernetes/
# Deploy with Helm
helm install gpu-build-intel ./helm/gpu-build-intelligence \
--namespace rocm-cicd \
--create-namespaceresources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
amd.com/gpu: 1 # For GPU-enabled builds| Metric | Value |
|---|---|
| Build Log Processing | 10K+ logs/day |
| Analysis Latency | <2 seconds |
| Fix Recommendation Accuracy | 85% |
| MTTR Reduction | 70% (45min β 12min) |
| Concurrent Builds | 50+ GPU-enabled |
- Python 3.11+ - Core language
- FastAPI - Async web framework
- Pydantic - Data validation
- Motor - Async MongoDB driver
- Redis - Distributed caching
- Kubernetes - Container orchestration
- Docker - Containerization
- Helm - Package management
- Prometheus - Metrics collection
- OpenTelemetry - Distributed tracing
- Grafana - Dashboards
- PyTorch - ML framework
- Sentence Transformers - Text embeddings
- scikit-learn - Pattern matching
gpu-build-intelligence/
βββ src/
β βββ api/ # FastAPI endpoints & middleware
β βββ orchestrator/ # Build coordination & scheduling
β βββ builder/ # Build execution & environments
β βββ analyzer/ # ML-powered log analysis
β βββ storage/ # Database & cache management
β βββ notification/ # Alerts & notifications
β βββ monitoring/ # Metrics & tracing
β βββ common/ # Shared utilities & config
βββ kubernetes/ # K8s manifests
βββ helm/ # Helm charts
βββ tests/ # Test suites
βββ docs/ # Documentation
βββ scripts/ # Utility scripts
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see LICENSE file.
- AMD ROCm Team for GPU compute platform
- PyTorch Team for the ML framework
- FastAPI for the excellent async web framework
Built with β€οΈ for the ML/GPU community