RuntimeError: An attempt to lock OMAP exclusively twice from the same thread

Setup: 4 gateways, 3 subsystems with 20 ns each. 

The test was scaling from 4 gateways to 2 gateways (by removing gateways `nvmeof.a` and `nvmeof.b`). During ns load-balancing, this traceback shows up on gateway `nvmeof.d`:

```
[04-Nov-2025 11:35:14] INFO rebalance.py:147 (2): Scale-down rebalance is ongoing for ANA group 1 current load 14
[04-Nov-2025 11:35:14] INFO rebalance.py:155 (2): Found optimized ana group 1 that handles the group of deleted GW. Number NS in group 14 - Start NS rebalance
[04-Nov-2025 11:35:14] INFO rebalance.py:165 (2): Start rebalance (scale down) destination ana group 3, subsystem nqn.2016-06.io.spdk:cnode1
[04-Nov-2025 11:35:14] INFO rebalance.py:248 (2): == rebalance started == for subsystem 0, anagrp 1, destination anagrp 3, num ns 1 time 1762256114.5813808 
[04-Nov-2025 11:35:14] INFO rebalance.py:255 (2): nsid for change_load_balancing: 8, nqn.2016-06.io.spdk:cnode1, anagrpid: 1
[04-Nov-2025 11:35:14] INFO grpc.py:2578 (2): Received auto request to change load balancing group for namespace with ID 8 in nqn.2016-06.io.spdk:cnode1 to 3, context: context
[04-Nov-2025 11:35:14] ERROR grpc.py:1239 (2): Failure while executing rebalance_logic()
Traceback (most recent call last):
  File "/src/control/state.py", line 526, in lock_omap
    self.omap_state.ioctx.lock_exclusive(self.omap_state.omap_name,
  File "rados.pyx", line 3935, in rados.Ioctx.lock_exclusive
rados.ObjectExists: [errno 17] RADOS object exists (Ioctx.rados_lock_exclusive(mypool): failed to set lock omap_file_lock on nvmeof.mygroup0.state)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/src/control/grpc.py", line 1236, in execute_grpc_function
    rc = self.omap_lock.execute_omap_locking_function(
  File "/src/control/state.py", line 430, in execute_omap_locking_function
    return grpc_func(omap_locking_func, request, context)
  File "/src/control/grpc.py", line 1213, in _grpc_function_with_lock
    rc = func(request, context)
  File "/src/control/rebalance.py", line 168, in rebalance_logic
    self.ns_rebalance(context, ana_grp, min_ana_grp, 1, "0")
  File "/src/control/rebalance.py", line 263, in ns_rebalance
    ret = self.gw_srv.namespace_change_load_balancing_group_safe(change_lb_group_req,
  File "/src/control/grpc.py", line 2609, in namespace_change_load_balancing_group_safe
    with omap_lock:
  File "/src/control/state.py", line 764, in __enter__
    self.omap_lock.lock_omap()
  File "/src/control/state.py", line 553, in lock_omap
    raise RuntimeError("An attempt to lock OMAP exclusively twice from "
RuntimeError: An attempt to lock OMAP exclusively twice from the same thread
```
Full gateway logs: https://qa-proxy.ceph.com/teuthology/vallariag-2025-11-04_10:37:49-nvmeof-main-distro-default-smithi/8582236/remote/smithi204/log/c0e73976-b96f-11f0-8779-adfe0268badd/nvmeof-client.nvmeof.nvmeof.d/nvmeof-log


In the test, we deploy 4 gateways. Then we apply spec to remove 2 gateways, and we see the above traceback and "nvme-gw show" shows 3 gateways (nvmeof.a is still in DELETING state 7mins after scale-down):
```
2025-11-04T11:29:20.434 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] SCALE DOWN: Setting up config to remove gateways nvmeof.a,nvmeof.b
....
2025-11-04T11:29:22.353 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] SCALE DOWN: Starting scale testing by removing nvmeof.a,nvmeof.b
....
2025-11-04T11:29:22.353 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] Verifying that everything is working with 4 gateways
....
2025-11-04T11:29:27.112 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] Verified successfully that everything is working with 4 gateways
....
2025-11-04T11:29:27.113 INFO:tasks.workunit.client.3.smithi178.stderr:+ ceph orch apply -i /tmp/nvmeof-gw-new.yaml
....
2025-11-04T11:36:08.108 INFO:tasks.workunit.client.3.smithi178.stdout:[nvmeof.scale] Verifying that everything is working with 2 gateways
2025-11-04T11:36:08.108 INFO:tasks.workunit.client.3.smithi178.stderr:++ ceph nvme-gw show mypool mygroup0
2025-11-04T11:36:08.973 INFO:tasks.workunit.client.3.smithi178.stderr:+ output='{
2025-11-04T11:36:08.973 INFO:tasks.workunit.client.3.smithi178.stderr:    "epoch": 127,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "pool": "mypool",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "group": "mygroup0",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "features": "LB",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "rebalance_ana_group": 1,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "num gws": 3,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "GW-epoch": 22,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "Anagrp list": "[ 3 4 ]",
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "num-namespaces": 60,
2025-11-04T11:36:08.974 INFO:tasks.workunit.client.3.smithi178.stderr:    "Created Gateways:": [
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:        {
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:            "gw-id": "client.nvmeof.nvmeof.a",
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:            "anagrp-id": 1,
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:            "num-namespaces": 14,
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:            "performed-full-startup": 0,
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:            "Availability": "DELETING",
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:            "ana states": " 1: STANDBY ,  3: STANDBY ,  4: STANDBY "
2025-11-04T11:36:08.975 INFO:tasks.workunit.client.3.smithi178.stderr:        },
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:        {
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "gw-id": "client.nvmeof.nvmeof.c",
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "anagrp-id": 3,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "num-namespaces": 22,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "performed-full-startup": 1,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "Availability": "AVAILABLE",
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "num-listeners": 3,
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "ana states": " 1: STANDBY ,  3: ACTIVE ,  4: STANDBY "
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:        },
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:        {
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "gw-id": "client.nvmeof.nvmeof.d",
2025-11-04T11:36:08.976 INFO:tasks.workunit.client.3.smithi178.stderr:            "anagrp-id": 4,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:            "num-namespaces": 24,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:            "performed-full-startup": 1,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:            "Availability": "AVAILABLE",
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:            "num-listeners": 3,
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:            "ana states": " 1: ACTIVE ,  3: STANDBY ,  4: ACTIVE "
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:        }
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:    ]
2025-11-04T11:36:08.977 INFO:tasks.workunit.client.3.smithi178.stderr:}'
```
Full test logs: https://qa-proxy.ceph.com/teuthology/vallariag-2025-11-04_10:37:49-nvmeof-main-distro-default-smithi/8582236/teuthology.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: An attempt to lock OMAP exclusively twice from the same thread #1604

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: An attempt to lock OMAP exclusively twice from the same thread #1604

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions