diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
new file mode 100644
index 0000000000..9bd67c8b72
--- /dev/null
+++ b/.github/copilot-instructions.md
@@ -0,0 +1,74 @@
+# GitHub Copilot Code Review Instructions
+
+## Review Philosophy
+
+- Only comment when you have HIGH CONFIDENCE (>80%) that an issue exists
+- Be concise: one sentence per comment when possible
+- Focus on actionable feedback, not observations
+- When reviewing text, only comment on clarity issues if the text is genuinely confusing or could lead to errors.
+  "Could be clearer" is not the same as "is confusing" - stay silent unless HIGH confidence it will cause problems
+
+## Priority Areas (Review These)
+
+### Security & Safety
+
+- Unsafe code blocks without justification
+- Command injection risks (shell commands, user input)
+- Path traversal vulnerabilities
+- Credential exposure or hardcoded secrets
+- Missing input validation on external data
+- Improper error handling that could leak sensitive info
+
+### Correctness Issues
+
+- Logic errors that could cause panics or incorrect behavior
+- Resource leaks (files, connections, memory)
+- Off-by-one errors or boundary conditions
+- Optional types that don't need to be optional
+- Booleans that should default to false but are set as optional
+- Overly defensive code that adds unnecessary checks
+- Unnecessary comments that just restate what the code already shows (remove them)
+
+### Architecture & Patterns
+
+- Code that violates existing patterns in the codebase
+- Missing error handling
+- Code that is not following [Effective Go](https://go.dev/doc/effective_go), [Python PEP8](https://peps.python.org/pep-0008/), or [Rust API guidelines](https://rust-lang.github.io/api-guidelines/)
+- Violating [Kubernetes API guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md)
+
+## Project-Specific Context
+
+- See [AGENTS.md](../AGENTS.md) in the root directory for project guidelines and architecture decisions.
+
+## CI Pipeline Context
+
+**Important**: You review PRs immediately, before CI completes. Do not flag issues that CI will catch.
+
+### What Our CI Checks
+
+- `.github/workflows/test-go.yaml` - Code generation, linting, and tests for Go source code
+- `.github/workflows/test-python.yaml` - unit and integration tests for Python source code
+- `.github/workflows/test-rust.yaml` - unit and integration tests for Rust source code
+- `.github/workflows/test-e2e.yaml` - e2e tests
+- `.github/workflows/build-and-push-images.yaml` - build and push container images
+
+## Skip These (IMPORTANT)
+
+Do not comment on:
+
+- **Auto generated code** - CI handles this (make generate)
+- **Style/formatting** - CI handles this (gofmt, black, prettier)
+- **Test failures** - CI handles this (full test suite)
+- **Missing dependencies** - CI handles this
+
+## Response Format
+
+When you identify an issue:
+
+1. **State the problem** (1 sentence)
+2. **Why it matters** (1 sentence, only if not obvious)
+3. **Suggested fix** (code snippet or specific action)
+
+## When to Stay Silent
+
+If you're uncertain whether something is an issue, don't comment. False positives create noise and reduce trust in the review process.
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000000..7391acd8a8
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,346 @@
+## Who This Is For
+
+- **AI agents**: Automate repository tasks with minimal context
+- **Contributors**: Humans using AI assistants or working directly
+- **Maintainers**: Ensure assistants follow project conventions and CI rules
+
+## Agent Behavior Policy
+
+AI agents should:
+
+- Make atomic, minimal, and reversible changes.
+- Prefer local analysis (e.g. `make generate`, `make fmt`, `make test`, `make test-python`) before proposing commits.
+- NEVER modify configuration, CI/CD, or release automation unless explicitly requested.
+- Scan the generated code for vulnerabilities and dependency upgrades.
+- Avoid non-deterministic code or random seeds without fixtures.
+- Use `AGENTS.md` and `Makefile` as the source of truth for development commands.
+
+Agents must NOT:
+
+- Bypass tests or linters
+- Introduce dependencies without updating `go.mod` (Go) or `pyproject.toml` (Python) or `Cargo.toml` (Rust)
+- Generate or commit large autogenerated files
+- Modify CRD schemas or API versions without explicit instruction
+
+### Context Awareness
+
+Before writing code, agents should:
+
+- Read existing test cases and docstrings for pattern alignment
+- Match import patterns from neighboring files
+- Preserve existing logging and error-handling conventions
+- Understand the plugin architecture and extension framework before modifying runtime code
+- Review CRD schemas in `pkg/apis/` before changing API structures
+- Call out for any breaking changes introduced and follow the deprecation policy
+
+For additional context see [the Kubeflow Trainer docs](https://www.kubeflow.org/docs/components/trainer/overview/).
+
+## Repository Map
+
+```
+kubeflow/trainer/
+├── .github/                         # GitHub actions for CI/CD
+├── charts/                          # Helm charts for deployment
+├── cmd/                             # Command-line applications and binaries
+│   ├── trainer-controller-manager/    # Main Trainer controller (Go)
+│   ├── initializers/                  # Dataset/model initializers (Python)
+│   │   ├── dataset/
+│   │   └── model/
+│   ├── runtimes/                      # Builtin ML training runtimes
+│   │   ├── deepspeed/                   # DeepSpeed runtime
+│   │   └── mlx/                         # MLX runtime
+│   ├── trainers/                      # Builtin trainers for LLM fine-tuning
+│   │   └── torchtune/                   # TorchTune fine-tuning trainer
+│   └── data_cache/                    # Distributed data cache service (Rust)
+└── docs/                            # Documentation and proposals
+├── examples/                        # Examples with TrainJobs
+├── hack/                            # Scripts to manage CI/CD and installation
+├── manifests/                       # Kustomize manifests for deployment
+├── pkg/                             # Core library packages (Go)
+│   ├── apis/                          # Kubernetes CRD API definitions
+│   │   ├── trainer/v1alpha1/            # TrainJob, TrainingRuntime, and ClusterTrainingRuntime APIs
+│   │   └── config/v1alpha1/             # Trainer config APIs
+│   ├── config/                        # Trainer config logic
+│   ├── controller/                    # Trainer Kubernetes controllers logic
+│   ├── runtime/                     # Trainer Extension Framework
+│   │   ├── core/                      # Core runtime implementation
+│   │   └── framework/                 # Implementation for the framework
+│   │       ├── plugins/                 # Implementation for the builtin plugins
+│   │       │   ├── torch/                 # PyTorch plugin
+│   │       │   ├── mpi/                   # MPI plugin
+│   │       │   ├── jobset/                # JobSet plugin
+│   │       │   └── ...
+│   │       └── interface.go           # Framework interface definitions
+│   │       └── runtime.go             # Implementation of Info object which carries information trough the plugin chain.
+│   ├── webhooks/                    # Kubernetes validation/mutation webhooks for Trainer
+│   ├── data_cache/                  # Distributed in-memory cache (Rust)
+│   ├── initializers/                # Dataset and model initializers (Python)
+│   └── util/                        # Utility functions (Go)
+├── test/                          # Integration and E2E tests
+│   ├── integration/                 # Ginkgo integration tests
+│   └── e2e/                         # End-to-end tests
+```
+
+## Environment & Tooling
+
+- **Go**: primary language for controller, APIs, plugins
+- **Python**: dataset and model initializer
+- **Rust**: data cache
+- **Build**: `make` (orchestration), `go build`, `cargo`, `docker`
+- **Lint/format**: `golangci-lint`, `gofmt` (Go), `ruff` (Python), `cargo fmt` (Rust)
+- **Tests**: `go test`, `ginkgo` (integration), `pytest` (Python), `cargo test` (Rust)
+- **Code generation**: `controller-gen`, `openapi-gen`
+- **Pre-commit**: Config provided and enforced in CI
+
+## Commands
+
+<!-- BEGIN: AGENT_COMMANDS -->
+
+### Build
+
+Use available container runtime to build an image. For example:
+
+```bash
+docker build . -f cmd/trainer-controller-manager/Dockerfile -t trainer-controller-manager:test
+docker build . -f cmd/runtimes/deepspeed/Dockerfile -t deepspeed-runtime:test
+```
+
+### Testing
+
+```bash
+make test                     # Go unit tests
+make test-integration         # Go integration test
+make test-python              # Python unit tests
+make test-python-integration  # Python integration tests
+make test-rust                # Rust unit tests
+make test-e2e                 # End-to-end tests (requires Kind cluster)
+
+# Targeted tests
+go test ./pkg/controller/...                              # Run all controller tests
+go test -v -run TestTrainJobController ./pkg/controller/  # Run specific test function
+```
+
+### Local lint/format
+
+```bash
+make fmt                      # Format Go code
+make vet                      # Vet the Go code
+make golangci-lint            # Verify the Go code
+```
+
+**Code generation** (always run after modifying the APIs):
+
+```bash
+make generate                # Generate the required files
+```
+
+**Pre-commit**:
+
+```bash
+pre-commit install            # Install hooks
+pre-commit run --all-files    # Run all hooks manually
+```
+
+<!-- END: AGENT_COMMANDS -->
+
+## Development Workflow for AI Agents
+
+**Preferred commands**: Use `make` targets to ensure consistency with CI
+
+**Before making changes**:
+
+1. Read existing code patterns, comments, and tests for alignment
+2. Check the Core Development Principles below
+3. Run quick start command for validation and testing
+
+**Commit/PR hygiene**:
+
+- Follow Conventional Commits in titles and messages.
+- See the [check-pr-title.yaml](./.github/workflows/check-pr-title.yaml) for PR titles conventions.
+- Include rationale ("why") in commit messages/PR descriptions
+- Do not push secrets or change git config
+- Scope discipline: only modify files relevant to the task; keep diffs minimal
+
+## Core Development Principles
+
+### 1. Maintain Stable Public Interfaces ⚠️ CRITICAL
+
+**Always preserve API compatibility for released versions. APIs are in alpha and evolving.**
+
+**API Stability Rules**:
+
+- **CRD schemas** (`pkg/apis/trainer/v1alpha1`): Changes require careful review
+  - Adding fields: Use `+optional` marker and provide defaults
+  - ALWAYS use [the CEL validation](https://kubernetes.io/docs/reference/using-api/cel/) whenever applicable
+  - Removing/renaming fields: Requires API version bump and migration plan
+  - Changing field types: Breaking change, requires deprecation period
+- **Go public APIs**: Exported types, functions, interfaces
+  - Check if exported (capitalized names)
+  - Look for usage in examples, tests, and documentation
+  - Use deprecation comments for gradual removal
+- **Plugin interfaces** (`pkg/runtime/framework/interface.go`): Breaking changes affect all plugins
+
+❌ **Bad - Breaking Change:**
+
+```go
+// Changed field name in CRD without migration
+type TrainJobSpec struct {
+    // Changed from `Suspend` to `Paused`
+    Paused *bool `json:"paused,omitempty"`
+}
+```
+
+✅ **Good - Backward Compatible:**
+
+```go
+type TrainJobSpec struct {
+    // Suspend pauses job execution without deleting resources.
+    // Useful for debugging or resource optimization.
+    // +optional
+    Suspend *bool `json:"suspend,omitempty"`
+
+    // NewFeature enables experimental capability
+    // +optional
+    NewFeature *bool `json:"newFeature,omitempty"`
+}
+```
+
+### 2. Code Quality Standards
+
+ALWAYS follow the existing patterns in the codebase.
+
+#### Go Code Standards
+
+❌ **Bad:**
+
+```go
+func p(u, d interface{}) interface{} {
+    return u
+}
+```
+
+✅ **Good:**
+
+```go
+// ReconcileTrainJob reconciles a TrainJob object
+func (r *TrainJobReconciler) ReconcileTrainJob(ctx context.Context, trainJob *trainv1alpha1.TrainJob) error {
+    log := ctrl.LoggerFrom(ctx)
+    log.V(1).Info("Reconciling TrainJob", "name", trainJob.Name, "namespace", trainJob.Namespace)
+
+    // Implementation...
+    return nil
+}
+```
+
+**Go Style Requirements**:
+
+- Follow
+  [Kubernetes code conventions](https://github.com/kubernetes/community/blob/master/contributors/guide/coding-conventions.md),
+  [Effective Go](https://go.dev/doc/effective_go), and
+  [Kubernetes API](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md) best practices.
+- Use structured logging with `ctrl.LoggerFrom(ctx)` (Zap-based)
+- Error handling: Always check errors, use `fmt.Errorf` for wrapping
+- Naming: `camelCase` for unexported, `PascalCase` for exported
+- Package names: Short, lowercase, no underscores
+
+#### Python Code Standards
+
+❌ **Bad - Missing provider pattern:**
+
+```python
+class CustomModel:  # Not inheriting from ModelProvider(ABC)
+    def download(self):
+        pass
+```
+
+✅ **Good - Following provider pattern:**
+
+```python
+class HuggingFace(utils.ModelProvider):
+    """HuggingFace model initializer."""
+
+    def load_config(self) -> None:
+        config_dict = utils.get_config_from_env(types.HuggingFaceModelInitializer)
+        self.config = types.HuggingFaceModelInitializer(**config_dict)
+
+    def download_model(self) -> None:
+        """Download model from HuggingFace Hub."""
+        # Implementation...
+```
+
+**Python Style Requirements**:
+
+- Line length 100, Python 3.11+, double quotes, spaces indent
+- Imports: isort via ruff; prefer absolute imports
+- Naming: `snake_case` for functions/vars, `PascalCase` for classes, `UPPER_SNAKE_CASE` for constants
+- Use descriptive variable names; break up complex functions (>20 lines)
+- Use logging module (not print statements) for output
+
+#### Rust Code Standards
+
+- Follow Cargo conventions and rustfmt defaults
+
+```rust
+/// Distributed cache server implementation
+pub struct CacheServer {
+    config: ServerConfig,
+    state: Arc<RwLock<CacheState>>,
+}
+
+impl CacheServer {
+    /// Create new cache server instance
+    pub fn new(config: ServerConfig) -> Result<Self> {
+        Ok(Self {
+            config,
+            state: Arc::new(RwLock::new(CacheState::default())),
+        })
+    }
+}
+```
+
+### 3. Testing Requirements
+
+- Every new feature or bugfix MUST be covered by tests
+- Every new test MUST follow the existing tests structure
+- Unit tests should go to the same folder as source code
+- Integration tests should go to the `test/integration/` directory
+
+#### Go Testing Patterns
+
+- File names must have `*_test.go` postfix
+- Use dictionary to define test cases
+- Every new function must have a corresponding test function prefixed with `Test`
+  - Example: `func RunEnforceMLPolicyPlugins()` -> `func TestRunEnforceMLPolicyPlugins()`
+- Integration tests use Ginkgo framework
+
+#### Python Testing Patterns
+
+- File names must have `*_test.py` postfix
+- Use pytest with fixtures
+- Every new function must have a corresponding test function prefixed with `test_`
+  - Example: `def calculate_total()` -> `def test_calculate_total()`
+- Use `pytest.mark.parametrize` with `TestCase` dataclass for multiple test scenarios:
+
+```python
+@pytest.mark.parametrize(
+    "test_case",
+    [
+        TestCase(
+            name="valid dataset URI",
+            expected_status=SUCCESS,
+            config={"uri": "hf://meta-llama/model"},
+            expected_output={"scheme": "hf"},
+        ),
+        TestCase(
+            name="invalid URI format",
+            expected_status=FAILED,
+            config={"uri": "invalid"},
+            expected_error=ValueError,
+        ),
+    ],
+)
+def test_parse_dataset_uri(test_case):
+    # Test implementation using test_case attributes
+    result = parse_dataset_uri(**test_case.config)
+    assert result == test_case.expected_output
+```