-
Notifications
You must be signed in to change notification settings - Fork 57
Managed ml diagnostics and xpk integration #801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Managed ml diagnostics and xpk integration #801
Conversation
|
@xibinliu and @Shuang-cnt please review it. |
|
|
Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review. |
When I run |
Perhaps @xibinliu needs to explain for it to be clearer. |
The |
681f902 to
20b94c7
Compare
3c46565 to
b4aaa13
Compare
Done |
For point 1: Add the |
8997011 to
de0c17b
Compare
|
@jamOne- @scaliby The unit tests are failing because our installation involves the Kubernetes API server. Does the installation process require unit tests? If so, it will get stuck at [XPK] ********************************************************************************
[XPK] b'E1106 04:14:54.063296 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.063919 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065489 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065944 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?\n'
[XPK] ******************************************************************************** |
|
@DannyLiCom please just mock this behavior on command level, so there will be no actual command execution happening. |
a41448d to
a7c8f82
Compare
244d511 to
6c8005b
Compare
…ad.py Add wait_for_deployment_ready() Added unit test update goldens.yaml update goldens.yaml update goldens.yaml Fixed parser/cluster.py update goldens.yaml fixed linter fixed linter pyink Test unit test
6c8005b to
0596553
Compare
Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience. |
Description
Adding Webhooks, Injector, and connection-operator installations during cluster creation.
Testing
Test with the cluster create command and add the flag
--managed-mldiagnostics. The cluster will then install the required ML Diagnostics components: cert-manager, the injection-webhook, and the connection-operator.like this: https://paste.googleplex.com/6009656740806656