Skip to content

Conversation

@jabolina
Copy link
Member

@jabolina jabolina commented Jan 5, 2026

I was experimenting with this idea for toggling safe mode on the operator side.

The idea is to expose pod-level parameters on the Infinispan CR. The offset indexes the target pod. It looks like:

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
spec:
  replicas: 3
  version: 16.0.3
  overrides:
    targets:
      - offset: 1
        safeMode: true
  service:
    type: DataGrid

This opens the possibility of including extra parameters in the future.

When the user updates the CR with the overrides arguments, we intercept that and update the ConfigMap. The idea is to cause an overrides.env file to be written in the volume by the operator. This file will look something like:

$ cat server/conf/operator/overrides.env 
infinispan-1=--safe-mode 

It uses the node's hostname to index all the parameters we define. This file is written in ALL pods in the system, in my example above:

NAME                                                      READY   STATUS      RESTARTS   AGE     IP            NODE                 NOMINATED NODE   READINESS GATES
infinispan-0                                              1/1     Running     0          2m48s   10.244.0.16   kind-control-plane   <none>           <none>
infinispan-1                                              1/1     Running     0          2m32s   10.244.0.18   kind-control-plane   <none>           <none>
infinispan-2                                              1/1     Running     0          2m23s   10.244.0.20   kind-control-plane   <none>           <none>

They all have the overrides.env with the same content.

The second part requires a change on Infinispan's side to search for the overrides.env file, parse it, and collect all the options specified by the operator when starting the server.

I think this approach also requires the operator to delete the targeted pods to trigger a restart in some cases. If the pod is already in a crash loop state, then it is all good; the pod will restart with safe mode enabled. But if the pod is alive but not ready, toggling safe-mode would require an additional step of forcing the pod restart.

@openshift-ci
Copy link

openshift-ci bot commented Jan 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ryanemerson
Copy link
Contributor

I like the approach for getting new commands to the server without changes to the pod spec.

But if the pod is alive but not ready, toggling safe-mode would require an additional step of forcing the pod restart.

I guess that depends on how we manage the lifecycle of the FileWatcher, if it's one of the first things that we initialise then I think we should be fine for most scenarios as the ConfigMap mounted file will eventually update:

https://kubernetes.io/docs/tutorials/configuration/updating-configuration-via-a-configmap/#rollout-configmap-volume

The second part requires a change on Infinispan's side to search for the overrides.env file, parse it, and collect all the options specified by the operator when starting the server.

Safely applying ^ configured changes and applying it, via an internal server restart etc, is going to be tricky I suspect!

@jabolina
Copy link
Member Author

jabolina commented Jan 6, 2026

@ryanemerson, I've opened infinispan/infinispan#16458 with the rough idea of how to enable safe-mode together with the operator.

I've created the same three node cluster. After all nodes were running, I applied:

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
spec:
  image: quay.io/infinispan-test/server:16458
  replicas: 3
  version: 16.0.3
  overrides:
    targets:
      - offset: 1
        safeMode: true
  service:
    type: DataGrid

Only this won't trigger safe-mode, it requires the server to stop and start again, so the bash script is executed. After I did kubectl delete pod infinispan-1 -n infinispan-operator-system, we have the following logs in pod infinispan-1:

Toggling safe mode for node
22:39:04,490 INFO  (main) [BOOT] Running batch files: [/etc/security/conf/operator-security/identities.cli]
...
22:39:04,803 INFO  (main) [BOOT] JVM arguments = [-server, --add-exports, java.naming/com.sun.jndi.ldap=ALL-UNNAMED, --add-opens, java.base/java.util=ALL-UNNAMED, --add-opens, java.base/java.util.concurrent=ALL-UNNAMED, --enable-native-access=ALL-UNNAMED, -XX:+UseCompactObjectHeaders, -XX:AOTMode=on, -XX:AOTCache=/opt/infinispan/server/cache/infinispan-server.x86_64.aot, -Xlog:gc*:file=/opt/infinispan/server/log/gc.log:time,uptimemillis:filecount=5,filesize=3M, -Djava.awt.headless=true, -Djava.net.preferIPv4Stack=true, -XX:+ExitOnOutOfMemoryError, -XX:MetaspaceSize=64M, -Xms64m, -Xmx512m, -Dvisualvm.display.name=infinispan-server, -Djava.util.logging.manager=org.infinispan.server.loader.LogManager, -Dinfinispan.server.home.path=/opt/infinispan, -jar, /opt/infinispan/lib/infinispan-server-runtime-16.1.0-SNAPSHOT.jar, --bind-address=0.0.0.0, --pre-start-batch=/etc/security/conf/operator-security/identities.cli, -l, /opt/infinispan/server/conf/operator/log4j.xml, -c, operator/infinispan-base.xml, -c, operator/infinispan-admin.xml, --safe-mode]

The first line is the one I've added in the Infinispan PR, and we have the --safe-mode argument included in the JVM arguments log, too. I've tested restarting the other pods, and they correctly ignore the arguments in the overrides file.

I am a bit torn about the coupling between the operator and the server script, but I couldn't think of anything else that would target specific pods without restarting the entire cluster.

@ryanemerson
Copy link
Contributor

Only this won't trigger safe-mode, it requires the server to stop and start again, so the bash script is executed.

What behaviour do we expect when a pod goes into safe-mode?

My concern with above is that it's going to trigger a rebalance. If we have a FileWatcher in the server that looks for updates to the overrides file, then we can enter safe mode potentially without triggering a rebalance as we don't need to shutdown the original process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants