Skip to content

Conversation

@DannyLiCom
Copy link
Collaborator

@DannyLiCom DannyLiCom commented Nov 4, 2025

Description

Adding Webhooks, Injector, and connection-operator installations during cluster creation.

Testing

Test with the cluster create command and add the flag --managed-mldiagnostics. The cluster will then install the required ML Diagnostics components: cert-manager, the injection-webhook, and the connection-operator.
like this: https://paste.googleplex.com/6009656740806656

@DannyLiCom DannyLiCom marked this pull request as draft November 4, 2025 03:27
@DannyLiCom
Copy link
Collaborator Author

@xibinliu and @Shuang-cnt please review it.

@DannyLiCom DannyLiCom added the release-features features label Nov 4, 2025
@DannyLiCom DannyLiCom changed the base branch from main to develop November 4, 2025 04:02
@scaliby scaliby changed the base branch from develop to main November 4, 2025 08:13
@jamOne-
Copy link
Collaborator

jamOne- commented Nov 4, 2025

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

@jamOne-
Copy link
Collaborator

jamOne- commented Nov 4, 2025

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

@DannyLiCom
Copy link
Collaborator Author

DannyLiCom commented Nov 4, 2025

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

@DannyLiCom
Copy link
Collaborator Author

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

@scaliby
Copy link
Member

scaliby commented Nov 4, 2025

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

The main. Please just pull latest main and merge it into your branch - that should do the job.

@DannyLiCom DannyLiCom force-pushed the diagon-setup-and-workload branch from 681f902 to 20b94c7 Compare November 4, 2025 09:13
@scaliby scaliby changed the base branch from main to reusable-nightly November 4, 2025 09:31
@scaliby scaliby changed the base branch from reusable-nightly to main November 4, 2025 09:31
@DannyLiCom DannyLiCom force-pushed the diagon-setup-and-workload branch 2 times, most recently from 3c46565 to b4aaa13 Compare November 5, 2025 01:01
@DannyLiCom
Copy link
Collaborator Author

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

The main. Please just pull latest main and merge it into your branch - that should do the job.

Done

@DannyLiCom DannyLiCom changed the title Diagon and XPK Integration Managed ml diagnostics and xpk integration Nov 5, 2025
@DannyLiCom
Copy link
Collaborator Author

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

ML Diagnostics components

For point 1: Add the --managed-mldiagnostics flag to allow the user to decide whether or not to install these ML Diagnostics components.

@DannyLiCom DannyLiCom force-pushed the diagon-setup-and-workload branch 2 times, most recently from 8997011 to de0c17b Compare November 6, 2025 03:52
@DannyLiCom
Copy link
Collaborator Author

DannyLiCom commented Nov 6, 2025

@jamOne- @scaliby The unit tests are failing because our installation involves the Kubernetes API server. Does the installation process require unit tests? If so, it will get stuck at get-credentials.

[XPK] ********************************************************************************
[XPK] b'E1106 04:14:54.063296    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.063919    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065489    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065944    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?\n'
[XPK] ********************************************************************************

@scaliby
Copy link
Member

scaliby commented Nov 6, 2025

@DannyLiCom please just mock this behavior on command level, so there will be no actual command execution happening.

@DannyLiCom DannyLiCom force-pushed the diagon-setup-and-workload branch 3 times, most recently from a41448d to a7c8f82 Compare November 7, 2025 01:40
@DannyLiCom DannyLiCom force-pushed the diagon-setup-and-workload branch 5 times, most recently from 244d511 to 6c8005b Compare November 7, 2025 02:53
…ad.py

Add wait_for_deployment_ready()

Added unit test

update goldens.yaml

update goldens.yaml

update goldens.yaml

Fixed parser/cluster.py

update goldens.yaml

fixed linter

fixed linter

pyink

Test unit test
@DannyLiCom DannyLiCom force-pushed the diagon-setup-and-workload branch from 6c8005b to 0596553 Compare November 7, 2025 03:07
@xibinliu
Copy link

xibinliu commented Nov 7, 2025

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants