Skip to content

Add the option to terminate pending kubernetes kernels if they have events preventing them from starting #1357

@OrenZ1

Description

@OrenZ1

Problem

I am facing a problem when using JEG on kubernetes.
I have set kernel launch timeout to 5 mins (because I am using large images), and set MAX_KERNELS_PER_USER to 2 to prevent spamming of kernels.
When a user submits a request to launch a kernel, it gets started over a remote pod. Sometimes, the pod remains stuck on pending, i.e. due to a lack of resources which is currently affective. In this case, the user can’t submit a new kernel (with a lower resources demand), and has to wait for 5 minutes for the timeout to be affective, before using another kernel. I even thought about setting up a service which watches pending kernel pods, and if they have events which prevent them from starting, it would send a DELETE request to the gateway to kill the kernel. The problem is that when kernels are pending, the gateway can’t receive DELETE requests to kernels.
In addition, the kernel is not aware to actions done on the kubernetes cluster, so I can’t delete the pods using kubernetes API, because JEG would still wait for timeout for this kernel.

Proposed Solution

For starters, I would expect JEG to have awareness of the Kubernetes cluster it is running on, so that when kernel pods are deleted, it would stop sampling them.
For the other issue I’ve stated I can see two possible solutions:
The first one (and in my opinion, the easier one), is to allow receiving DELETE requests to kernels which are pending.
The second one is to allow to configure the JEG to kill pending kernels when they have events (or certain events) on its own. But this seems a bit trickier to think about properly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions