|
| 1 | +# AI Agent Guidance for Cluster Monitoring Operator |
| 2 | + |
| 3 | +This file provides guidance to AI agents when working with code in this repository. |
| 4 | + |
| 5 | +This is the Cluster Monitoring Operator (CMO) - the operator that manages the Prometheus-based monitoring stack in |
| 6 | +OpenShift. CMO is deployed by the Cluster Version Operator (CVO). |
| 7 | + |
| 8 | +## Architecture Overview |
| 9 | + |
| 10 | +### Jsonnet Manifest Generation |
| 11 | + |
| 12 | +CMO generates Kubernetes manifests using Jsonnet (`jsonnet/`): |
| 13 | + |
| 14 | +- Source: `jsonnet/components/*.libsonnet` and vendored upstream jsonnet |
| 15 | +- Output: `assets/` directory with generated YAML manifests |
| 16 | +- The operator reads manifests from `assets/` at runtime via `pkg/manifests/` |
| 17 | + |
| 18 | +**Critical**: When modifying manifests, changes must be made in jsonnet source files, then regenerated with |
| 19 | +`make generate`. Direct edits to `assets/*.yaml` will be overwritten. **DO NOT** edit them directly. |
| 20 | + |
| 21 | +### Configuration API |
| 22 | + |
| 23 | +Two ConfigMaps control monitoring behavior: |
| 24 | + |
| 25 | +- `cluster-monitoring-config` (openshift-monitoring) - Platform monitoring |
| 26 | +- `user-workload-monitoring-config` (openshift-user-workload-monitoring) - User workload monitoring |
| 27 | + |
| 28 | +Types defined in `pkg/manifests/types.go` with validation rules. |
| 29 | + |
| 30 | +- **DO NOT** invent new config keys |
| 31 | + |
| 32 | +## Development Commands |
| 33 | + |
| 34 | +### Local Development |
| 35 | + |
| 36 | +**Prerequisites**: You need access to an OpenShift cluster. You can provision one using |
| 37 | +[cluster-bot](https://github.com/openshift/ci-chat-bot) via Slack (Red Hat internal). In Slack, message `@cluster-bot` |
| 38 | +with `launch 4.17` (or desired version) to get a temporary cluster with credentials. |
| 39 | + |
| 40 | +```bash |
| 41 | +export KUBECONFIG=/path/to/kubeconfig # Requires OpenShift cluster |
| 42 | +make run-local # Build and run locally as CMO service account |
| 43 | +make run-local SWITCH_TO_CMO=false # Run as current user (e.g., kube:admin) |
| 44 | +``` |
| 45 | + |
| 46 | +### Jsonnet Workflow |
| 47 | + |
| 48 | +```bash |
| 49 | +# Modify jsonnet source files in jsonnet/ |
| 50 | +make generate # Regenerate manifests, docs, and metadata |
| 51 | +make docs # Regenerate documentation only (api.md, resources.md) |
| 52 | +make check-assets # Verify assets are up to date |
| 53 | +``` |
| 54 | + |
| 55 | +**Rapid iteration**: For quick testing, you can modify YAML files in `assets/` directly, run the operator with |
| 56 | +`hack/local-cmo.sh` (no rebuild needed), then port changes back to jsonnet. See `Documentation/development.md` for |
| 57 | +detailed workflow. |
| 58 | + |
| 59 | +**Two-release annotation/label removal**: To remove a label/annotation from a resource managed by `CreateOrUpdateXXX` |
| 60 | +functions: |
| 61 | + |
| 62 | +1. First release: Add suffix `"-"` to the annotation/label (CMO deletes it via library-go) |
| 63 | +2. Second release: Remove from jsonnet source |
| 64 | + |
| 65 | +### Testing |
| 66 | + |
| 67 | +```bash |
| 68 | +make test # All tests (requires OpenShift cluster with KUBECONFIG) |
| 69 | +make test-unit # Unit tests only |
| 70 | +make test-e2e # E2E tests (requires OpenShift cluster) |
| 71 | +make test-ginkgo # Ginkgo tests (ported from openshift-tests-private) |
| 72 | + |
| 73 | +# Specific tests |
| 74 | +go test -v ./pkg/... -run TestName # Specific unit test |
| 75 | +go test -v -timeout=120m -run TestName ./test/e2e/ --kubeconfig $KUBECONFIG # Specific e2e test |
| 76 | +``` |
| 77 | + |
| 78 | +**openshift-tests-extension**: CMO integrates with the OpenShift conformance test framework via `tests-ext` binary. |
| 79 | +Run `make tests-ext-update` after modifying Ginkgo tests to update metadata. |
| 80 | + |
| 81 | +### Verification |
| 82 | + |
| 83 | +```bash |
| 84 | +make verify # Run all checks |
| 85 | +make format # Format code (go fmt, jsonnet fmt, shellcheck) |
| 86 | +make golangci-lint # Lint Go code |
| 87 | +make check-rules # Validate Prometheus rules with promtool |
| 88 | +``` |
| 89 | + |
| 90 | +## OpenShift Conventions |
| 91 | + |
| 92 | +### Pull Requests |
| 93 | + |
| 94 | +- **Title format**: `OCPBUGS-12345: descriptive title` (bugs) or `MON-1234: descriptive title` (features) |
| 95 | + - Example: `MON-4435: Add RBAC permission for endpointslice resource in UWM prometheus-operator` |
| 96 | + - Example: `OCPBUGS-61088: create networkpolicy settings for in-cluster monitoring` |
| 97 | +- **Commit format**: `<subsystem>: <what changed>` |
| 98 | + - Example: `jsonnet: update prometheus version` |
| 99 | + - Example: `e2e: add e2e test to verify endpointslice discovery in uwm` |
| 100 | +- All PRs require JIRA ticket reference |
| 101 | + |
| 102 | +### Jira Integration |
| 103 | + |
| 104 | +- **Automatic linking**: PRs are automatically linked to JIRA when the key is in the PR title |
| 105 | +- **Lifecycle automation**: [jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin) updates |
| 106 | + JIRA status based on PR events |
| 107 | +- **Jira commands** (comment on PR): |
| 108 | + - `/jira refresh` - Manually sync PR with JIRA issue |
| 109 | + - `/jira cc @username` - CC someone on the JIRA issue |
| 110 | + - `/jira backport <branch>` - Create backport PR to target branch (e.g., `/jira backport release-4.17`) |
| 111 | + - `/jira assign <user>` - Assign the JIRA issue to specified user |
| 112 | + - `/jira unassign` - Remove current assignee from JIRA issue |
| 113 | + - `/jira comment <comment>` - Add comment to the JIRA issue |
| 114 | + - `/jira close` - Close the JIRA issue |
| 115 | + - `/jira reopen` - Reopen the JIRA issue |
| 116 | +- **Creating tickets**: Use OCPBUGS project for bugs, MON project for features |
| 117 | +- **Required fields**: Component (Monitoring), Target Version, Priority |
| 118 | +- **Status workflow**: To Do → In Progress → Code Review → Done |
| 119 | + |
| 120 | +### Prow CI |
| 121 | + |
| 122 | +- **Triggering tests**: Tests run automatically on PR creation/update |
| 123 | +- **Useful commands** (comment on PR): |
| 124 | + - `/retest` - Retry all failed tests |
| 125 | + - `/test <job-name>` - Run specific job (e.g., `/test e2e-aws`) |
| 126 | + - `/test-with <job-name>` - Run specific job with additional tests |
| 127 | + - `/retitle <new-title>` - Change PR title |
| 128 | + - `/assign @username` - Assign reviewer |
| 129 | + - `/cc @username` - Request review without assignment |
| 130 | + - `/hold` - Prevent auto-merge, `/hold cancel` to remove |
| 131 | + - `/lgtm` - Approve PR (maintainers only) |
| 132 | + - `/approve` - Approve for merge (maintainers only) |
| 133 | + - `/cherry-pick <branch>` - Cherry-pick to another branch after merge |
| 134 | +- **Important jobs**: |
| 135 | + - `ci/prow/images` - Builds container images |
| 136 | + - `ci/prow/e2e-*` - E2E test variants |
| 137 | + - `ci/prow/verify` - Runs `make verify` |
| 138 | + - `ci/prow/unit` - Unit tests |
| 139 | +- **Viewing results**: Click "Details" next to job to see Prow logs |
| 140 | +- **Common failures**: |
| 141 | + - `ci/prow/images` fails if `make verify` would fail (run locally first) |
| 142 | + - E2E timeouts may be transient (retry with `/retest`) |
| 143 | +- **More commands**: See [prow.ci.openshift.org/command-help](https://prow.ci.openshift.org/command-help) |
| 144 | + |
| 145 | +### Feature Development |
| 146 | + |
| 147 | +- **FeatureGate integration**: CMO integrates with OpenShift FeatureGates for controlling feature availability |
| 148 | + - Example: `MetricsCollectionProfiles` feature gate controls collection profile functionality |
| 149 | + - Check in `pkg/operator/operator.go`: `featureGates.Enabled(features.FeatureGateMetricsCollectionProfiles)` |
| 150 | + - Pass to config: `CollectionProfilesFeatureGateEnabled` flag in `pkg/manifests/config.go` |
| 151 | +- **TechPreview → GA lifecycle**: |
| 152 | + - TechPreview: Feature gated, requires explicit enablement |
| 153 | + - GA: Feature gate removed, enabled by default |
| 154 | +- **Adding new features**: |
| 155 | + 1. Add FeatureGate check in `pkg/operator/operator.go` |
| 156 | + 2. Pass enabled state through config |
| 157 | + 3. Conditionally create resources based on gate state (e.g., `serviceMonitors()` helper) |
| 158 | + 4. Update `pkg/manifests/types.go` if new config fields needed |
| 159 | + |
| 160 | +## Updating Jsonnet Dependencies |
| 161 | + |
| 162 | +Example: Updating kube-prometheus bundle: |
| 163 | + |
| 164 | +```bash |
| 165 | +cd jsonnet |
| 166 | + |
| 167 | +# Edit jsonnetfile.json, update version for desired component |
| 168 | +jb update |
| 169 | + |
| 170 | +# Stage only the version/sum changes for target bundle in jsonnetfile.lock.json |
| 171 | +git add -p jsonnetfile.lock.json |
| 172 | + |
| 173 | +# Revert unwanted changes |
| 174 | +git restore jsonnetfile.json jsonnetfile.lock.json |
| 175 | + |
| 176 | +# Reinstall with updated lockfile |
| 177 | +rm -rf vendor && jb install |
| 178 | + |
| 179 | +cd .. |
| 180 | +make generate |
| 181 | +``` |
| 182 | + |
| 183 | +See `Documentation/development.md` for detailed workflow. |
| 184 | + |
| 185 | +## Common Pitfalls |
| 186 | + |
| 187 | +1. **Forgetting `make generate`**: Modifying jsonnet without regenerating assets causes CI failures |
| 188 | +2. **Missing KUBECONFIG**: E2E tests fail silently if KUBECONFIG isn't set, even if `~/.kube/config` exists |
| 189 | +3. **Asset sync issues**: Run `make clean` before `make generate` if vendored jsonnet behaves unexpectedly |
| 190 | +4. **Wrong cluster type**: Tests require OpenShift, not vanilla Kubernetes |
| 191 | +5. **Stale local CMO**: Make sure you have the right permissions when running locally for development or the operator |
| 192 | + may get stuck within the reconcile loop as it won't have permissions to list or modify resources. |
| 193 | + |
| 194 | +## Documentation |
| 195 | + |
| 196 | +- `CONTRIBUTING.md` - Contribution guidelines and workflow details |
| 197 | +- `Documentation/development.md` - Detailed development workflows |
| 198 | +- [OpenShift Monitoring Docs](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/monitoring/) |
| 199 | + \- User-facing monitoring documentation |
| 200 | + |
| 201 | +## Important Files |
| 202 | + |
| 203 | +- `Makefile` - All build and test targets |
| 204 | +- `VERSION` - Operator version string |
| 205 | +- `manifests/` - Deployment manifests |
| 206 | +- `OWNERS` and `OWNERS_ALIASES` - Code ownership definitions, admins. |
0 commit comments