A platform for end-to-end ML model development.
XRay gives ML engineers self-service development environments on Kubernetes. Create a workspace, SSH in, write code, train models -- without waiting for ops to provision machines.
This project is a work in progress. Some features described below are planned but not yet implemented.
- Workspaces -- Spin up GPU-enabled dev environments with one command. SSH in from your terminal or IDE (VS Code, Cursor). Stop and start them as needed.
- Multi-tenant -- Organizations, teams, and projects. Each user gets their own workspaces scoped to a project.
- Multi-cluster -- Connect multiple Kubernetes clusters. Workspaces are created on the cluster you choose.
- Job submission -- Submit training jobs to shared clusters. (planned)
- Queuing and priority -- Fair-share scheduling across teams with priority levels. (planned)
There are three main components:
- Control Plane -- A Go server with a REST API and web UI. Manages users, teams, clusters, and workspaces. Stores state in PostgreSQL.
- Cluster Agent -- A lightweight binary that runs in each Kubernetes cluster. Connects to the control plane via WebSocket, creates/deletes workspace pods on command.
- CLI -- A command-line tool (
xray) for logging in, managing workspaces, and configuring SSH access.
For detailed architecture documentation, see .memory/architecture/:
- Architecture Overview - Tech stack, project structure, and architecture layers
- Authentication - OAuth flows (web + CLI PKCE), token refresh, user-identity pattern
- Teams & Organizations - 3-level hierarchy, memberships, authorization model
- Clusters & Agent - Cluster registration, agent WebSocket protocol, bastion, pod management, storage
- Compute Templates - Admin-defined resource presets, scheduling hints, admin/user views
- Workspaces - Workspace lifecycle, SSH key management, CLI SSH config, connect-info
- Jobs - Batch compute, multi-node execution, submission patterns
- Database Schema - ER diagram, migration history, query patterns
- Frontend - React SPA structure, auth state machine, API communication
- E2E Testing - Kind cluster setup, Docker images, test infrastructure
Work in progress. Contributions and feedback welcome.