Experiment for safe-mode #2414

jabolina · 2026-01-05T14:25:00Z

I was experimenting with this idea for toggling safe mode on the operator side.

The idea is to expose pod-level parameters on the Infinispan CR. The offset indexes the target pod. It looks like:

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
spec:
  replicas: 3
  version: 16.0.3
  overrides:
    targets:
      - offset: 1
        safeMode: true
  service:
    type: DataGrid

This opens the possibility of including extra parameters in the future.

When the user updates the CR with the overrides arguments, we intercept that and update the ConfigMap. The idea is to cause an overrides.env file to be written in the volume by the operator. This file will look something like:

$ cat server/conf/operator/overrides.env 
infinispan-1=--safe-mode

It uses the node's hostname to index all the parameters we define. This file is written in ALL pods in the system, in my example above:

NAME                                                      READY   STATUS      RESTARTS   AGE     IP            NODE                 NOMINATED NODE   READINESS GATES
infinispan-0                                              1/1     Running     0          2m48s   10.244.0.16   kind-control-plane   <none>           <none>
infinispan-1                                              1/1     Running     0          2m32s   10.244.0.18   kind-control-plane   <none>           <none>
infinispan-2                                              1/1     Running     0          2m23s   10.244.0.20   kind-control-plane   <none>           <none>

They all have the overrides.env with the same content.

The second part requires a change on Infinispan's side to search for the overrides.env file, parse it, and collect all the options specified by the operator when starting the server.

I think this approach also requires the operator to delete the targeted pods to trigger a restart in some cases. If the pod is already in a crash loop state, then it is all good; the pod will restart with safe mode enabled. But if the pod is alive but not ready, toggling safe-mode would require an additional step of forcing the pod restart.

openshift-ci · 2026-01-05T14:25:03Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ryanemerson · 2026-01-05T17:04:25Z

I like the approach for getting new commands to the server without changes to the pod spec.

But if the pod is alive but not ready, toggling safe-mode would require an additional step of forcing the pod restart.

I guess that depends on how we manage the lifecycle of the FileWatcher, if it's one of the first things that we initialise then I think we should be fine for most scenarios as the ConfigMap mounted file will eventually update:

https://kubernetes.io/docs/tutorials/configuration/updating-configuration-via-a-configmap/#rollout-configmap-volume

The second part requires a change on Infinispan's side to search for the overrides.env file, parse it, and collect all the options specified by the operator when starting the server.

Safely applying ^ configured changes and applying it, via an internal server restart etc, is going to be tricky I suspect!

jabolina · 2026-01-06T22:48:38Z

@ryanemerson, I've opened infinispan/infinispan#16458 with the rough idea of how to enable safe-mode together with the operator.

I've created the same three node cluster. After all nodes were running, I applied:

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
spec:
  image: quay.io/infinispan-test/server:16458
  replicas: 3
  version: 16.0.3
  overrides:
    targets:
      - offset: 1
        safeMode: true
  service:
    type: DataGrid

Only this won't trigger safe-mode, it requires the server to stop and start again, so the bash script is executed. After I did kubectl delete pod infinispan-1 -n infinispan-operator-system, we have the following logs in pod infinispan-1:

Toggling safe mode for node
22:39:04,490 INFO  (main) [BOOT] Running batch files: [/etc/security/conf/operator-security/identities.cli]
...
22:39:04,803 INFO  (main) [BOOT] JVM arguments = [-server, --add-exports, java.naming/com.sun.jndi.ldap=ALL-UNNAMED, --add-opens, java.base/java.util=ALL-UNNAMED, --add-opens, java.base/java.util.concurrent=ALL-UNNAMED, --enable-native-access=ALL-UNNAMED, -XX:+UseCompactObjectHeaders, -XX:AOTMode=on, -XX:AOTCache=/opt/infinispan/server/cache/infinispan-server.x86_64.aot, -Xlog:gc*:file=/opt/infinispan/server/log/gc.log:time,uptimemillis:filecount=5,filesize=3M, -Djava.awt.headless=true, -Djava.net.preferIPv4Stack=true, -XX:+ExitOnOutOfMemoryError, -XX:MetaspaceSize=64M, -Xms64m, -Xmx512m, -Dvisualvm.display.name=infinispan-server, -Djava.util.logging.manager=org.infinispan.server.loader.LogManager, -Dinfinispan.server.home.path=/opt/infinispan, -jar, /opt/infinispan/lib/infinispan-server-runtime-16.1.0-SNAPSHOT.jar, --bind-address=0.0.0.0, --pre-start-batch=/etc/security/conf/operator-security/identities.cli, -l, /opt/infinispan/server/conf/operator/log4j.xml, -c, operator/infinispan-base.xml, -c, operator/infinispan-admin.xml, --safe-mode]

The first line is the one I've added in the Infinispan PR, and we have the --safe-mode argument included in the JVM arguments log, too. I've tested restarting the other pods, and they correctly ignore the arguments in the overrides file.

I am a bit torn about the coupling between the operator and the server script, but I couldn't think of anything else that would target specific pods without restarting the entire cluster.

ryanemerson · 2026-01-08T09:37:33Z

Only this won't trigger safe-mode, it requires the server to stop and start again, so the bash script is executed.

What behaviour do we expect when a pod goes into safe-mode?

My concern with above is that it's going to trigger a rebalance. If we have a FileWatcher in the server that looks for updates to the overrides file, then we can enter safe mode potentially without triggering a rebalance as we don't need to shutdown the original process.

Experiment for safe-mode

8f9d437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment for safe-mode #2414

Experiment for safe-mode #2414

Uh oh!

jabolina commented Jan 5, 2026

Uh oh!

openshift-ci bot commented Jan 5, 2026

Uh oh!

ryanemerson commented Jan 5, 2026

Uh oh!

jabolina commented Jan 6, 2026

Uh oh!

ryanemerson commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Experiment for safe-mode #2414

Are you sure you want to change the base?

Experiment for safe-mode #2414

Uh oh!

Conversation

jabolina commented Jan 5, 2026

Uh oh!

openshift-ci bot commented Jan 5, 2026

Uh oh!

ryanemerson commented Jan 5, 2026

Uh oh!

jabolina commented Jan 6, 2026

Uh oh!

ryanemerson commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants