-
Notifications
You must be signed in to change notification settings - Fork 239
[WIP] Interactive Ray support #1428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for creating remote kernels via Ray operator by introducing a RayOperatorProcessProxy Fixes jupyter-server#939
| if self.assigned_host: | ||
| ready_to_connect = await self.receive_connection_info() | ||
| self.log.debug( | ||
| f">>> container.confirm_remote_startup(): ready to connect => {ready_to_connect}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm cool with leaving these debug statements in - perhaps we can remove some of them later. However, can we avoid having two statements display the same message? This statement is the same as that on L229, so it would be good differentiate these.
| self.container_name = pod_info.metadata.name | ||
| if pod_info.status: | ||
| pod_status = pod_info.status.phase.lower() | ||
| self.log.debug(f">>> k8s.get_container_status: {pod_status}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here - see L132 below - let's differentiate these.
| from .k8s import KubernetesProcessProxy | ||
|
|
||
|
|
||
| class RayOperatorProcessProxy(KubernetesProcessProxy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this derive from CustomResourceProcessProxy?
| await super().launch_process(kernel_cmd, **kwargs) | ||
| return self | ||
|
|
||
| def get_container_status(self, iteration: int | None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here looks very generic and feels like it belongs on the superclass. In speaking with you offline - it sounds like the RayOperator CRD expects different status values for running and ready, etc. Perhaps we can introduce override methods to account for CRD's with different status requirements in their lifecycle.
We would need to be aware of the Spark operator, but, actually, the default implementations on the base class would match what that operator expects anyway - so perhaps its not too risky.
| if result: | ||
| self._reset_connection_info() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seems useful in the superclass, then this method could be removed. Is there harm in moving _reset_connection_info() there as well? If so, perhaps the default implementation of "reset" could be a no-op and let this process proxx override.
| cp -r kernelspecs ../build | ||
| # Operator kernelspecs get launcher files after the override to preserve scripts | ||
| @echo ../build/kernelspecs/spark_python_operator | xargs -t -n 1 cp -r kernel-launchers/operators/* | ||
| @rm -f ../build/kernelspecs/spark_python_operator/scripts/ray.io-v1alpha1.yaml.j2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These rm statements (and on L72) are confusing to me. Ah, is it because we have a single directory for these scripts, and this is just pruning the unnecessary files for the given context?
No worries - this seems okay. The option to introduce another directory is probably too heavyweight.
|
|
||
| ENV SPARK_VER $SPARK_VERSION | ||
| ENV HADOOP_VER 3.3.1 | ||
| ENV SPARK_VER=$SPARK_VERSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this (and elsewhere)!
| "process_proxy": { | ||
| "class_name": "enterprise_gateway.services.processproxies.ray_operator.RayOperatorProcessProxy", | ||
| "config": { | ||
| "image_name": "lresende/kernel-ray-py:VERSION", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move these to the elyra repo prior to merge?
| value: !!str {{ .Values.kernel.launchTimeout }} | ||
| - name: EG_KERNEL_INFO_TIMEOUT | ||
| value: !!str {{ .Values.kernel.infoTimeout }} | ||
| - name: EG_REQUEST_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
kevin-bates
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff @lresende - thank you! This will be useful
Just a couple nits about logging. I'm cool with keeping the debugs in place since this area is a common problem when bootstrapping new process proxies.
The other comment is considering perhaps moving some of the CRD-based code to the superclass as we better understand the patterns of CRDs.
Add support for launching interactive Jupyter kernels on Ray clusters managed by the Kubernetes Ray Operator. It adds
a new RayOperatorProcessProxy to manage kernel lifecycle through Ray Custom Resource Definitions (CRDs). This process proxy intelligently monitors both the RayCluster CRD state (managed by ray.io/v1alpha1 API) and the underlying Kubernetes pod status, implementing a two-phase readiness check that first verifies the Ray cluster reaches "ready" state before confirming the kernel head pod is running. The implementation includes sophisticated container status detection that returns unified pod states compatible with Enterprise Gateway's existing lifecycle management while properly handling Ray-specific states like cluster initialization, head pod scheduling, and worker node scaling.
The commit provides a complete end-to-end solution, including a custom Jinja2 template (ray.io-v1alpha1.yaml.j2) that generates RayCluster CR manifests with optimized configurations for kernel workloads with configurable idle timeout, and resource limits. A new Docker image (kernel-ray-py) packages the Ray-enabled Python kernel with all necessary dependencies, while the ray_python_operator kernel specification integrates seamlessly with Enterprise Gateway's
existing kernel management infrastructure. The changes include updates to Helm charts for proper RBAC configuration (adding Ray CRD permissions to the cluster role), and Makefile enhancements for building Ray kernel images. The implementation maintains backward compatibility with existing kernel types while enabling data scientists to leverage Ray's distributed computing capabilities directly from interactive Jupyter notebooks with full kernel lifecycle management, automatic cleanup, and proper resource isolation.
Fixes #939