Skip to content

Optimize sync throughput, fix CRD/RBAC bugs, and add redeployLabelSelector#47

Open
felipesabadini wants to merge 6 commits intophasehq:mainfrom
felipesabadini:main
Open

Optimize sync throughput, fix CRD/RBAC bugs, and add redeployLabelSelector#47
felipesabadini wants to merge 6 commits intophasehq:mainfrom
felipesabadini:main

Conversation

@felipesabadini
Copy link
Contributor

Summary

This PR improves the Kubernetes Secrets Operator behavior under higher load (many PhaseSecret resources), reduces unnecessary API work, adds clearer operational telemetry, and fixes two bugs that caused the operator to silently stop processing new PhaseSecrets.

Bug Fixes

CRD status validation (422 Unprocessable Entity)

The CRD schema had required: ["conditions"] on status, but the operator does not populate this field on creation. This caused the API server to reject status updates for newly created PhaseSecrets with:

422 Unprocessable Entity: PhaseSecret.secrets.phase.dev "xxx" is invalid:
status.conditions: Required value

Fix: Removed required: ["conditions"] from the status schema in crd-template.yaml.

RBAC missing phasesecrets/status (403 Forbidden)

The operator's ClusterRole was missing the phasesecrets/status sub-resource permission. When the operator tried to update the status of a PhaseSecret, it received:

403 Forbidden: phasesecrets.secrets.phase.dev "xxx" is forbidden:
User "system:serviceaccount:phase-system:phase-kubernetes-operator"
cannot patch resource "phasesecrets/status"

After this error, kopf silently stopped processing the affected PhaseSecret daemon — no retries, no further syncs.

Fix: Added phasesecrets/status to verbs: [get, patch, update] in the ClusterRole.

What changed

1) Operator scalability and runtime behavior

  • Added Kopf startup configuration to increase worker capacity:
    • settings.execution.max_workers = 50
  • Updated runtime command to explicit cluster-wide scope:
    • kopf run --all-namespaces /app/main.py
  • Added startup jitter per PhaseSecret daemon to avoid synchronized bursts at startup.

2) Sync loop observability and log quality

  • Added structured cycle log summary with:
    • status
    • duration_s
    • phase_api_calls
  • Downgraded repetitive "No sync needed" message from INFO to DEBUG to reduce log noise.

3) Phase API efficiency and HTTP robustness

  • Introduced shared requests.Session with retry/backoff and timeout support:
    • PHASE_HTTP_TIMEOUT (default: 10)
    • PHASE_HTTP_RETRIES (default: 2)
    • PHASE_HTTP_BACKOFF (default: 0.3)
  • Added timeout error handling.
  • Added API call counters in Phase client for per-cycle visibility.
  • Reduced duplicate API work by reusing context (phase_client, user_data, resolved_context) during sync.

4) Secret reference resolution performance

  • Added prebuilt secret index (build_secrets_index) and fetch cache to avoid repeated recomputation and repeated lookups when resolving references.

5) Kubernetes secret update behavior

  • Replaced delete+create pattern with atomic upsert behavior (replace_namespaced_secret with resourceVersion when existing, create when missing), avoiding transient secret deletion windows where pods could read a missing secret.

6) Optional redeploy scope optimization

  • Added optional CR field:
    • spec.redeployLabelSelector
  • When present, deployment scan for autoredeploy is scoped by label selector, reducing unnecessary Kubernetes API calls in namespaces with many deployments.

7) Safety fix in Phase.get()

  • Added explicit guard:
    • when resolved_context is provided, user_data must also be provided.
  • Returns a clear ValueError instead of failing later with AttributeError.

8) Docs / charts / CRD updates

  • Updated:
    • Dockerfile command
    • README.md usage examples
    • CR template and CRDs with redeployLabelSelector
    • Helm values.yaml with HTTP env defaults
    • Helm chart bumped to 1.5.0

9) Tests

  • Added tests for secret reference indexing/cache:
    • src/tests/test_secret_referencing.py
  • Full test suite passing.

Validation

python3 -m pytest -q
Result: 89 passed

Compatibility

  • No breaking API changes. Existing CRs continue working without modification.
  • redeployLabelSelector is optional — omitting it preserves the current behavior (scan all deployments in namespace).
  • HTTP timeout/retry env vars have sensible defaults and require no configuration changes.
  • Deployments using custom image tags should update Helm values accordingly:
    image:
      repository: <your-registry>/phase-kubernetes-operator
      tag: v1.5.0

felipesabadini and others added 6 commits March 3, 2026 14:38
Includes fixes from 953795a:
- Remove required status.conditions from CRD (fixes 422 on new PhaseSecrets)
- Add phasesecrets/status permission to ClusterRole (fixes status update denied)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…version

Includes:
- Add Phase Kubernetes Operator v1.5.0 with updated metadata, icon, and keywords
- Bump chart version to 1.5.0 and appVersion to 1.5.0 in Chart.yaml
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nimish-ks nimish-ks self-requested a review March 3, 2026 18:34
@nimish-ks nimish-ks self-assigned this Mar 3, 2026
@felipesabadini
Copy link
Contributor Author

Hi! Just following up on this PR.

I’m currently running this version self-hosted in my cluster and it has been working well so far.

Since it includes fixes for the CRD status validation and the RBAC issue with phasesecrets/status, I wanted to check if the approach looks good to you or if you'd prefer any adjustments before moving forward.

Happy to update anything if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants