[WIP] Interactive Ray support #1428

lresende · 2025-11-04T19:05:32Z

Add support for launching interactive Jupyter kernels on Ray clusters managed by the Kubernetes Ray Operator. It adds
a new RayOperatorProcessProxy to manage kernel lifecycle through Ray Custom Resource Definitions (CRDs). This process proxy intelligently monitors both the RayCluster CRD state (managed by ray.io/v1alpha1 API) and the underlying Kubernetes pod status, implementing a two-phase readiness check that first verifies the Ray cluster reaches "ready" state before confirming the kernel head pod is running. The implementation includes sophisticated container status detection that returns unified pod states compatible with Enterprise Gateway's existing lifecycle management while properly handling Ray-specific states like cluster initialization, head pod scheduling, and worker node scaling.

The commit provides a complete end-to-end solution, including a custom Jinja2 template (ray.io-v1alpha1.yaml.j2) that generates RayCluster CR manifests with optimized configurations for kernel workloads with configurable idle timeout, and resource limits. A new Docker image (kernel-ray-py) packages the Ray-enabled Python kernel with all necessary dependencies, while the ray_python_operator kernel specification integrates seamlessly with Enterprise Gateway's
existing kernel management infrastructure. The changes include updates to Helm charts for proper RBAC configuration (adding Ray CRD permissions to the cluster role), and Makefile enhancements for building Ray kernel images. The implementation maintains backward compatibility with existing kernel types while enabling data scientists to leverage Ray's distributed computing capabilities directly from interactive Jupyter notebooks with full kernel lifecycle management, automatic cleanup, and proper resource isolation.

Fixes #939

Add support for creating remote kernels via Ray operator by introducing a RayOperatorProcessProxy Fixes jupyter-server#939

kevin-bates · 2025-11-05T18:49:58Z

enterprise_gateway/services/processproxies/container.py

                    if self.assigned_host:
                        ready_to_connect = await self.receive_connection_info()
+                        self.log.debug(
+                            f">>> container.confirm_remote_startup(): ready to connect => {ready_to_connect}"


I'm cool with leaving these debug statements in - perhaps we can remove some of them later. However, can we avoid having two statements display the same message? This statement is the same as that on L229, so it would be good differentiate these.

kevin-bates · 2025-11-05T18:50:40Z

enterprise_gateway/services/processproxies/k8s.py

            self.container_name = pod_info.metadata.name
            if pod_info.status:
                pod_status = pod_info.status.phase.lower()
+                self.log.debug(f">>> k8s.get_container_status: {pod_status}")


Same comment here - see L132 below - let's differentiate these.

kevin-bates · 2025-11-05T18:52:28Z

enterprise_gateway/services/processproxies/ray_operator.py

+from .k8s import KubernetesProcessProxy
+
+
+class RayOperatorProcessProxy(KubernetesProcessProxy):


Should this derive from CustomResourceProcessProxy?

kevin-bates · 2025-11-05T18:54:42Z

enterprise_gateway/services/processproxies/ray_operator.py

+        await super().launch_process(kernel_cmd, **kwargs)
+        return self
+
+    def get_container_status(self, iteration: int | None) -> str:


The code here looks very generic and feels like it belongs on the superclass. In speaking with you offline - it sounds like the RayOperator CRD expects different status values for running and ready, etc. Perhaps we can introduce override methods to account for CRD's with different status requirements in their lifecycle.

We would need to be aware of the Spark operator, but, actually, the default implementations on the base class would match what that operator expects anyway - so perhaps its not too risky.

kevin-bates · 2025-11-05T18:58:26Z

enterprise_gateway/services/processproxies/ray_operator.py

+        if result:
+            self._reset_connection_info()


These seems useful in the superclass, then this method could be removed. Is there harm in moving _reset_connection_info() there as well? If so, perhaps the default implementation of "reset" could be a no-op and let this process proxx override.

kevin-bates · 2025-11-05T19:03:38Z

etc/Makefile

 	cp -r kernelspecs ../build
+	# Operator kernelspecs get launcher files after the override to preserve scripts
+	@echo ../build/kernelspecs/spark_python_operator | xargs -t -n 1 cp -r kernel-launchers/operators/*
+	@rm -f ../build/kernelspecs/spark_python_operator/scripts/ray.io-v1alpha1.yaml.j2


These rm statements (and on L72) are confusing to me. Ah, is it because we have a single directory for these scripts, and this is just pruning the unnecessary files for the given context?

No worries - this seems okay. The option to introduce another directory is probably too heavyweight.

kevin-bates · 2025-11-05T19:04:57Z

etc/docker/demo-base/Dockerfile


-ENV SPARK_VER $SPARK_VERSION
-ENV HADOOP_VER 3.3.1
+ENV SPARK_VER=$SPARK_VERSION


Thanks for updating this (and elsewhere)!

kevin-bates · 2025-11-05T19:07:21Z

etc/kernelspecs/ray_python_operator/kernel.json

+    "process_proxy": {
+      "class_name": "enterprise_gateway.services.processproxies.ray_operator.RayOperatorProcessProxy",
+      "config": {
+        "image_name": "lresende/kernel-ray-py:VERSION",


Can we move these to the elyra repo prior to merge?

kevin-bates · 2025-11-05T19:07:52Z

etc/kubernetes/helm/enterprise-gateway/templates/deployment.yaml

          value: !!str {{ .Values.kernel.launchTimeout }}
        - name: EG_KERNEL_INFO_TIMEOUT
          value: !!str {{ .Values.kernel.infoTimeout }}
+        - name: EG_REQUEST_TIMEOUT


kevin-bates

Good stuff @lresende - thank you! This will be useful

Just a couple nits about logging. I'm cool with keeping the debugs in place since this area is a common problem when bootstrapping new process proxies.

The other comment is considering perhaps moving some of the CRD-based code to the superclass as we better understand the patterns of CRDs.

Fix ENV syntax in Dockerfiles

0b2af8d

lresende requested a review from kevin-bates November 4, 2025 19:05

[WIP] Interactive Ray support

4016d97

Add support for creating remote kernels via Ray operator by introducing a RayOperatorProcessProxy Fixes jupyter-server#939

lresende force-pushed the apple-ray branch from 1f3d63f to 4016d97 Compare November 4, 2025 19:06

kevin-bates reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Interactive Ray support #1428

[WIP] Interactive Ray support #1428

Uh oh!

lresende commented Nov 4, 2025 •

edited

Loading

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates Nov 5, 2025

Uh oh!

kevin-bates left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from .k8s import KubernetesProcessProxy


		class RayOperatorProcessProxy(KubernetesProcessProxy):

[WIP] Interactive Ray support #1428

Are you sure you want to change the base?

[WIP] Interactive Ray support #1428

Uh oh!

Conversation

lresende commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-bates left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lresende commented Nov 4, 2025 •

edited

Loading