diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000000..9bd67c8b72 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,74 @@ +# GitHub Copilot Code Review Instructions + +## Review Philosophy + +- Only comment when you have HIGH CONFIDENCE (>80%) that an issue exists +- Be concise: one sentence per comment when possible +- Focus on actionable feedback, not observations +- When reviewing text, only comment on clarity issues if the text is genuinely confusing or could lead to errors. + "Could be clearer" is not the same as "is confusing" - stay silent unless HIGH confidence it will cause problems + +## Priority Areas (Review These) + +### Security & Safety + +- Unsafe code blocks without justification +- Command injection risks (shell commands, user input) +- Path traversal vulnerabilities +- Credential exposure or hardcoded secrets +- Missing input validation on external data +- Improper error handling that could leak sensitive info + +### Correctness Issues + +- Logic errors that could cause panics or incorrect behavior +- Resource leaks (files, connections, memory) +- Off-by-one errors or boundary conditions +- Optional types that don't need to be optional +- Booleans that should default to false but are set as optional +- Overly defensive code that adds unnecessary checks +- Unnecessary comments that just restate what the code already shows (remove them) + +### Architecture & Patterns + +- Code that violates existing patterns in the codebase +- Missing error handling +- Code that is not following [Effective Go](https://go.dev/doc/effective_go), [Python PEP8](https://peps.python.org/pep-0008/), or [Rust API guidelines](https://rust-lang.github.io/api-guidelines/) +- Violating [Kubernetes API guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md) + +## Project-Specific Context + +- See [AGENTS.md](../AGENTS.md) in the root directory for project guidelines and architecture decisions. + +## CI Pipeline Context + +**Important**: You review PRs immediately, before CI completes. Do not flag issues that CI will catch. + +### What Our CI Checks + +- `.github/workflows/test-go.yaml` - Code generation, linting, and tests for Go source code +- `.github/workflows/test-python.yaml` - unit and integration tests for Python source code +- `.github/workflows/test-rust.yaml` - unit and integration tests for Rust source code +- `.github/workflows/test-e2e.yaml` - e2e tests +- `.github/workflows/build-and-push-images.yaml` - build and push container images + +## Skip These (IMPORTANT) + +Do not comment on: + +- **Auto generated code** - CI handles this (make generate) +- **Style/formatting** - CI handles this (gofmt, black, prettier) +- **Test failures** - CI handles this (full test suite) +- **Missing dependencies** - CI handles this + +## Response Format + +When you identify an issue: + +1. **State the problem** (1 sentence) +2. **Why it matters** (1 sentence, only if not obvious) +3. **Suggested fix** (code snippet or specific action) + +## When to Stay Silent + +If you're uncertain whether something is an issue, don't comment. False positives create noise and reduce trust in the review process. diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000000..7391acd8a8 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,346 @@ +## Who This Is For + +- **AI agents**: Automate repository tasks with minimal context +- **Contributors**: Humans using AI assistants or working directly +- **Maintainers**: Ensure assistants follow project conventions and CI rules + +## Agent Behavior Policy + +AI agents should: + +- Make atomic, minimal, and reversible changes. +- Prefer local analysis (e.g. `make generate`, `make fmt`, `make test`, `make test-python`) before proposing commits. +- NEVER modify configuration, CI/CD, or release automation unless explicitly requested. +- Scan the generated code for vulnerabilities and dependency upgrades. +- Avoid non-deterministic code or random seeds without fixtures. +- Use `AGENTS.md` and `Makefile` as the source of truth for development commands. + +Agents must NOT: + +- Bypass tests or linters +- Introduce dependencies without updating `go.mod` (Go) or `pyproject.toml` (Python) or `Cargo.toml` (Rust) +- Generate or commit large autogenerated files +- Modify CRD schemas or API versions without explicit instruction + +### Context Awareness + +Before writing code, agents should: + +- Read existing test cases and docstrings for pattern alignment +- Match import patterns from neighboring files +- Preserve existing logging and error-handling conventions +- Understand the plugin architecture and extension framework before modifying runtime code +- Review CRD schemas in `pkg/apis/` before changing API structures +- Call out for any breaking changes introduced and follow the deprecation policy + +For additional context see [the Kubeflow Trainer docs](https://www.kubeflow.org/docs/components/trainer/overview/). + +## Repository Map + +``` +kubeflow/trainer/ +├── .github/ # GitHub actions for CI/CD +├── charts/ # Helm charts for deployment +├── cmd/ # Command-line applications and binaries +│ ├── trainer-controller-manager/ # Main Trainer controller (Go) +│ ├── initializers/ # Dataset/model initializers (Python) +│ │ ├── dataset/ +│ │ └── model/ +│ ├── runtimes/ # Builtin ML training runtimes +│ │ ├── deepspeed/ # DeepSpeed runtime +│ │ └── mlx/ # MLX runtime +│ ├── trainers/ # Builtin trainers for LLM fine-tuning +│ │ └── torchtune/ # TorchTune fine-tuning trainer +│ └── data_cache/ # Distributed data cache service (Rust) +└── docs/ # Documentation and proposals +├── examples/ # Examples with TrainJobs +├── hack/ # Scripts to manage CI/CD and installation +├── manifests/ # Kustomize manifests for deployment +├── pkg/ # Core library packages (Go) +│ ├── apis/ # Kubernetes CRD API definitions +│ │ ├── trainer/v1alpha1/ # TrainJob, TrainingRuntime, and ClusterTrainingRuntime APIs +│ │ └── config/v1alpha1/ # Trainer config APIs +│ ├── config/ # Trainer config logic +│ ├── controller/ # Trainer Kubernetes controllers logic +│ ├── runtime/ # Trainer Extension Framework +│ │ ├── core/ # Core runtime implementation +│ │ └── framework/ # Implementation for the framework +│ │ ├── plugins/ # Implementation for the builtin plugins +│ │ │ ├── torch/ # PyTorch plugin +│ │ │ ├── mpi/ # MPI plugin +│ │ │ ├── jobset/ # JobSet plugin +│ │ │ └── ... +│ │ └── interface.go # Framework interface definitions +│ │ └── runtime.go # Implementation of Info object which carries information trough the plugin chain. +│ ├── webhooks/ # Kubernetes validation/mutation webhooks for Trainer +│ ├── data_cache/ # Distributed in-memory cache (Rust) +│ ├── initializers/ # Dataset and model initializers (Python) +│ └── util/ # Utility functions (Go) +├── test/ # Integration and E2E tests +│ ├── integration/ # Ginkgo integration tests +│ └── e2e/ # End-to-end tests +``` + +## Environment & Tooling + +- **Go**: primary language for controller, APIs, plugins +- **Python**: dataset and model initializer +- **Rust**: data cache +- **Build**: `make` (orchestration), `go build`, `cargo`, `docker` +- **Lint/format**: `golangci-lint`, `gofmt` (Go), `ruff` (Python), `cargo fmt` (Rust) +- **Tests**: `go test`, `ginkgo` (integration), `pytest` (Python), `cargo test` (Rust) +- **Code generation**: `controller-gen`, `openapi-gen` +- **Pre-commit**: Config provided and enforced in CI + +## Commands + + + +### Build + +Use available container runtime to build an image. For example: + +```bash +docker build . -f cmd/trainer-controller-manager/Dockerfile -t trainer-controller-manager:test +docker build . -f cmd/runtimes/deepspeed/Dockerfile -t deepspeed-runtime:test +``` + +### Testing + +```bash +make test # Go unit tests +make test-integration # Go integration test +make test-python # Python unit tests +make test-python-integration # Python integration tests +make test-rust # Rust unit tests +make test-e2e # End-to-end tests (requires Kind cluster) + +# Targeted tests +go test ./pkg/controller/... # Run all controller tests +go test -v -run TestTrainJobController ./pkg/controller/ # Run specific test function +``` + +### Local lint/format + +```bash +make fmt # Format Go code +make vet # Vet the Go code +make golangci-lint # Verify the Go code +``` + +**Code generation** (always run after modifying the APIs): + +```bash +make generate # Generate the required files +``` + +**Pre-commit**: + +```bash +pre-commit install # Install hooks +pre-commit run --all-files # Run all hooks manually +``` + + + +## Development Workflow for AI Agents + +**Preferred commands**: Use `make` targets to ensure consistency with CI + +**Before making changes**: + +1. Read existing code patterns, comments, and tests for alignment +2. Check the Core Development Principles below +3. Run quick start command for validation and testing + +**Commit/PR hygiene**: + +- Follow Conventional Commits in titles and messages. +- See the [check-pr-title.yaml](./.github/workflows/check-pr-title.yaml) for PR titles conventions. +- Include rationale ("why") in commit messages/PR descriptions +- Do not push secrets or change git config +- Scope discipline: only modify files relevant to the task; keep diffs minimal + +## Core Development Principles + +### 1. Maintain Stable Public Interfaces ⚠️ CRITICAL + +**Always preserve API compatibility for released versions. APIs are in alpha and evolving.** + +**API Stability Rules**: + +- **CRD schemas** (`pkg/apis/trainer/v1alpha1`): Changes require careful review + - Adding fields: Use `+optional` marker and provide defaults + - ALWAYS use [the CEL validation](https://kubernetes.io/docs/reference/using-api/cel/) whenever applicable + - Removing/renaming fields: Requires API version bump and migration plan + - Changing field types: Breaking change, requires deprecation period +- **Go public APIs**: Exported types, functions, interfaces + - Check if exported (capitalized names) + - Look for usage in examples, tests, and documentation + - Use deprecation comments for gradual removal +- **Plugin interfaces** (`pkg/runtime/framework/interface.go`): Breaking changes affect all plugins + +❌ **Bad - Breaking Change:** + +```go +// Changed field name in CRD without migration +type TrainJobSpec struct { + // Changed from `Suspend` to `Paused` + Paused *bool `json:"paused,omitempty"` +} +``` + +✅ **Good - Backward Compatible:** + +```go +type TrainJobSpec struct { + // Suspend pauses job execution without deleting resources. + // Useful for debugging or resource optimization. + // +optional + Suspend *bool `json:"suspend,omitempty"` + + // NewFeature enables experimental capability + // +optional + NewFeature *bool `json:"newFeature,omitempty"` +} +``` + +### 2. Code Quality Standards + +ALWAYS follow the existing patterns in the codebase. + +#### Go Code Standards + +❌ **Bad:** + +```go +func p(u, d interface{}) interface{} { + return u +} +``` + +✅ **Good:** + +```go +// ReconcileTrainJob reconciles a TrainJob object +func (r *TrainJobReconciler) ReconcileTrainJob(ctx context.Context, trainJob *trainv1alpha1.TrainJob) error { + log := ctrl.LoggerFrom(ctx) + log.V(1).Info("Reconciling TrainJob", "name", trainJob.Name, "namespace", trainJob.Namespace) + + // Implementation... + return nil +} +``` + +**Go Style Requirements**: + +- Follow + [Kubernetes code conventions](https://github.com/kubernetes/community/blob/master/contributors/guide/coding-conventions.md), + [Effective Go](https://go.dev/doc/effective_go), and + [Kubernetes API](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md) best practices. +- Use structured logging with `ctrl.LoggerFrom(ctx)` (Zap-based) +- Error handling: Always check errors, use `fmt.Errorf` for wrapping +- Naming: `camelCase` for unexported, `PascalCase` for exported +- Package names: Short, lowercase, no underscores + +#### Python Code Standards + +❌ **Bad - Missing provider pattern:** + +```python +class CustomModel: # Not inheriting from ModelProvider(ABC) + def download(self): + pass +``` + +✅ **Good - Following provider pattern:** + +```python +class HuggingFace(utils.ModelProvider): + """HuggingFace model initializer.""" + + def load_config(self) -> None: + config_dict = utils.get_config_from_env(types.HuggingFaceModelInitializer) + self.config = types.HuggingFaceModelInitializer(**config_dict) + + def download_model(self) -> None: + """Download model from HuggingFace Hub.""" + # Implementation... +``` + +**Python Style Requirements**: + +- Line length 100, Python 3.11+, double quotes, spaces indent +- Imports: isort via ruff; prefer absolute imports +- Naming: `snake_case` for functions/vars, `PascalCase` for classes, `UPPER_SNAKE_CASE` for constants +- Use descriptive variable names; break up complex functions (>20 lines) +- Use logging module (not print statements) for output + +#### Rust Code Standards + +- Follow Cargo conventions and rustfmt defaults + +```rust +/// Distributed cache server implementation +pub struct CacheServer { + config: ServerConfig, + state: Arc>, +} + +impl CacheServer { + /// Create new cache server instance + pub fn new(config: ServerConfig) -> Result { + Ok(Self { + config, + state: Arc::new(RwLock::new(CacheState::default())), + }) + } +} +``` + +### 3. Testing Requirements + +- Every new feature or bugfix MUST be covered by tests +- Every new test MUST follow the existing tests structure +- Unit tests should go to the same folder as source code +- Integration tests should go to the `test/integration/` directory + +#### Go Testing Patterns + +- File names must have `*_test.go` postfix +- Use dictionary to define test cases +- Every new function must have a corresponding test function prefixed with `Test` + - Example: `func RunEnforceMLPolicyPlugins()` -> `func TestRunEnforceMLPolicyPlugins()` +- Integration tests use Ginkgo framework + +#### Python Testing Patterns + +- File names must have `*_test.py` postfix +- Use pytest with fixtures +- Every new function must have a corresponding test function prefixed with `test_` + - Example: `def calculate_total()` -> `def test_calculate_total()` +- Use `pytest.mark.parametrize` with `TestCase` dataclass for multiple test scenarios: + +```python +@pytest.mark.parametrize( + "test_case", + [ + TestCase( + name="valid dataset URI", + expected_status=SUCCESS, + config={"uri": "hf://meta-llama/model"}, + expected_output={"scheme": "hf"}, + ), + TestCase( + name="invalid URI format", + expected_status=FAILED, + config={"uri": "invalid"}, + expected_error=ValueError, + ), + ], +) +def test_parse_dataset_uri(test_case): + # Test implementation using test_case attributes + result = parse_dataset_uri(**test_case.config) + assert result == test_case.expected_output +```