-
Notifications
You must be signed in to change notification settings - Fork 61
Open
Description
Setup: 4 gateways, 3 subsystems with 20 ns each.
The test was scaling from 4 gateways to 2 gateways (by removing gateways nvmeof.a and nvmeof.b). During ns load-balancing, this traceback shows up on gateway nvmeof.d:
[04-Nov-2025 11:35:14] INFO rebalance.py:147 (2): Scale-down rebalance is ongoing for ANA group 1 current load 14
[04-Nov-2025 11:35:14] INFO rebalance.py:155 (2): Found optimized ana group 1 that handles the group of deleted GW. Number NS in group 14 - Start NS rebalance
[04-Nov-2025 11:35:14] INFO rebalance.py:165 (2): Start rebalance (scale down) destination ana group 3, subsystem nqn.2016-06.io.spdk:cnode1
[04-Nov-2025 11:35:14] INFO rebalance.py:248 (2): == rebalance started == for subsystem 0, anagrp 1, destination anagrp 3, num ns 1 time 1762256114.5813808
[04-Nov-2025 11:35:14] INFO rebalance.py:255 (2): nsid for change_load_balancing: 8, nqn.2016-06.io.spdk:cnode1, anagrpid: 1
[04-Nov-2025 11:35:14] INFO grpc.py:2578 (2): Received auto request to change load balancing group for namespace with ID 8 in nqn.2016-06.io.spdk:cnode1 to 3, context: context
[04-Nov-2025 11:35:14] ERROR grpc.py:1239 (2): Failure while executing rebalance_logic()
Traceback (most recent call last):
File "/src/control/state.py", line 526, in lock_omap
self.omap_state.ioctx.lock_exclusive(self.omap_state.omap_name,
File "rados.pyx", line 3935, in rados.Ioctx.lock_exclusive
rados.ObjectExists: [errno 17] RADOS object exists (Ioctx.rados_lock_exclusive(mypool): failed to set lock omap_file_lock on nvmeof.mygroup0.state)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/control/grpc.py", line 1236, in execute_grpc_function
rc = self.omap_lock.execute_omap_locking_function(
File "/src/control/state.py", line 430, in execute_omap_locking_function
return grpc_func(omap_locking_func, request, context)
File "/src/control/grpc.py", line 1213, in _grpc_function_with_lock
rc = func(request, context)
File "/src/control/rebalance.py", line 168, in rebalance_logic
self.ns_rebalance(context, ana_grp, min_ana_grp, 1, "0")
File "/src/control/rebalance.py", line 263, in ns_rebalance
ret = self.gw_srv.namespace_change_load_balancing_group_safe(change_lb_group_req,
File "/src/control/grpc.py", line 2609, in namespace_change_load_balancing_group_safe
with omap_lock:
File "/src/control/state.py", line 764, in __enter__
self.omap_lock.lock_omap()
File "/src/control/state.py", line 553, in lock_omap
raise RuntimeError("An attempt to lock OMAP exclusively twice from "
RuntimeError: An attempt to lock OMAP exclusively twice from the same thread
In the test, we deploy 4 gateways. Then we apply spec to remove 2 gateways, and we see the above traceback and "nvme-gw show" shows 3 gateways (nvmeof.a is still in DELETING state 7mins after scale-down):
2025-11-04T11:29:20.434 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] SCALE DOWN: Setting up config to remove gateways nvmeof.a,nvmeof.b
....
2025-11-04T11:29:22.353 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] SCALE DOWN: Starting scale testing by removing nvmeof.a,nvmeof.b
....
2025-11-04T11:29:22.353 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] Verifying that everything is working with 4 gateways
....
2025-11-04T11:29:27.112 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] Verified successfully that everything is working with 4 gateways
....
2025-11-04T11:29:27.113 INFO:tasks.workunit.client.3.smithi178.stderr:+ ceph orch apply -i /tmp/nvmeof-gw-new.yaml
....
2025-11-04T11:36:08.108 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] Verifying that everything is working with 2 gateways
2025-11-04T11:36:08.108 INFO:tasks.workunit.client.3.smithi178.stderr:++ ceph nvme-gw show mypool mygroup0
2025-11-04T11:36:08.973 INFO:tasks.workunit.client.3.smithi178.stderr:+ output='{
2025-11-04T11:36:08.973 INFO:tasks.workunit.client.3.smithi178.stderr: "epoch": 127,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "pool": "mypool",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "group": "mygroup0",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "features": "LB",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "rebalance_ana_group": 1,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "num gws": 3,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "GW-epoch": 22,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "Anagrp list": "[ 3 4 ]",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "num-namespaces": 60,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr: "Created Gateways:": [
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: {
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: "gw-id": "client.nvmeof.nvmeof.a",
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: "anagrp-id": 1,
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: "num-namespaces": 14,
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: "performed-full-startup": 0,
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: "Availability": "DELETING",
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: "ana states": " 1: STANDBY , 3: STANDBY , 4: STANDBY "
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr: },
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: {
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "gw-id": "client.nvmeof.nvmeof.c",
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "anagrp-id": 3,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "num-namespaces": 22,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "performed-full-startup": 1,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "Availability": "AVAILABLE",
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "num-listeners": 3,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "ana states": " 1: STANDBY , 3: ACTIVE , 4: STANDBY "
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: },
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: {
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "gw-id": "client.nvmeof.nvmeof.d",
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr: "anagrp-id": 4,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: "num-namespaces": 24,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: "performed-full-startup": 1,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: "Availability": "AVAILABLE",
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: "num-listeners": 3,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: "ana states": " 1: ACTIVE , 3: STANDBY , 4: ACTIVE "
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: }
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr: ]
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:}'
Full test logs: https://qa-proxy.ceph.com/teuthology/vallariag-2025-11-04_10:37:49-nvmeof-main-distro-default-smithi/8582236/teuthology.log
Reactions are currently unavailable