Skip to content

Conversation

@lresende
Copy link
Member

@lresende lresende commented Nov 4, 2025

Add support for launching interactive Jupyter kernels on Ray clusters managed by the Kubernetes Ray Operator. It adds
a new RayOperatorProcessProxy to manage kernel lifecycle through Ray Custom Resource Definitions (CRDs). This process proxy intelligently monitors both the RayCluster CRD state (managed by ray.io/v1alpha1 API) and the underlying Kubernetes pod status, implementing a two-phase readiness check that first verifies the Ray cluster reaches "ready" state before confirming the kernel head pod is running. The implementation includes sophisticated container status detection that returns unified pod states compatible with Enterprise Gateway's existing lifecycle management while properly handling Ray-specific states like cluster initialization, head pod scheduling, and worker node scaling.

The commit provides a complete end-to-end solution, including a custom Jinja2 template (ray.io-v1alpha1.yaml.j2) that generates RayCluster CR manifests with optimized configurations for kernel workloads with configurable idle timeout, and resource limits. A new Docker image (kernel-ray-py) packages the Ray-enabled Python kernel with all necessary dependencies, while the ray_python_operator kernel specification integrates seamlessly with Enterprise Gateway's
existing kernel management infrastructure. The changes include updates to Helm charts for proper RBAC configuration (adding Ray CRD permissions to the cluster role), and Makefile enhancements for building Ray kernel images. The implementation maintains backward compatibility with existing kernel types while enabling data scientists to leverage Ray's distributed computing capabilities directly from interactive Jupyter notebooks with full kernel lifecycle management, automatic cleanup, and proper resource isolation.

Fixes #939

@lresende lresende requested a review from kevin-bates November 4, 2025 19:05
Add support for creating remote kernels via Ray operator
by introducing a RayOperatorProcessProxy

Fixes jupyter-server#939
if self.assigned_host:
ready_to_connect = await self.receive_connection_info()
self.log.debug(
f">>> container.confirm_remote_startup(): ready to connect => {ready_to_connect}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cool with leaving these debug statements in - perhaps we can remove some of them later. However, can we avoid having two statements display the same message? This statement is the same as that on L229, so it would be good differentiate these.

self.container_name = pod_info.metadata.name
if pod_info.status:
pod_status = pod_info.status.phase.lower()
self.log.debug(f">>> k8s.get_container_status: {pod_status}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here - see L132 below - let's differentiate these.

from .k8s import KubernetesProcessProxy


class RayOperatorProcessProxy(KubernetesProcessProxy):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this derive from CustomResourceProcessProxy?

await super().launch_process(kernel_cmd, **kwargs)
return self

def get_container_status(self, iteration: int | None) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks very generic and feels like it belongs on the superclass. In speaking with you offline - it sounds like the RayOperator CRD expects different status values for running and ready, etc. Perhaps we can introduce override methods to account for CRD's with different status requirements in their lifecycle.

We would need to be aware of the Spark operator, but, actually, the default implementations on the base class would match what that operator expects anyway - so perhaps its not too risky.

Comment on lines +138 to +139
if result:
self._reset_connection_info()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seems useful in the superclass, then this method could be removed. Is there harm in moving _reset_connection_info() there as well? If so, perhaps the default implementation of "reset" could be a no-op and let this process proxx override.

cp -r kernelspecs ../build
# Operator kernelspecs get launcher files after the override to preserve scripts
@echo ../build/kernelspecs/spark_python_operator | xargs -t -n 1 cp -r kernel-launchers/operators/*
@rm -f ../build/kernelspecs/spark_python_operator/scripts/ray.io-v1alpha1.yaml.j2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These rm statements (and on L72) are confusing to me. Ah, is it because we have a single directory for these scripts, and this is just pruning the unnecessary files for the given context?

No worries - this seems okay. The option to introduce another directory is probably too heavyweight.


ENV SPARK_VER $SPARK_VERSION
ENV HADOOP_VER 3.3.1
ENV SPARK_VER=$SPARK_VERSION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this (and elsewhere)!

"process_proxy": {
"class_name": "enterprise_gateway.services.processproxies.ray_operator.RayOperatorProcessProxy",
"config": {
"image_name": "lresende/kernel-ray-py:VERSION",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to the elyra repo prior to merge?

value: !!str {{ .Values.kernel.launchTimeout }}
- name: EG_KERNEL_INFO_TIMEOUT
value: !!str {{ .Values.kernel.infoTimeout }}
- name: EG_REQUEST_TIMEOUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @lresende - thank you! This will be useful

Just a couple nits about logging. I'm cool with keeping the debugs in place since this area is a common problem when bootstrapping new process proxies.

The other comment is considering perhaps moving some of the CRD-based code to the superclass as we better understand the patterns of CRDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Ray on Kubernetes

2 participants