Skip to content

Commit aa1c9fd

Browse files
authored
Fix a data race between clear_keep_alive and state. (#1191)
In JAX CI, we saw the following data race: ``` WARNING: ThreadSanitizer: data race (pid=36276) Read of size 4 at 0x7ad434c76774 by thread T4 (mutexes: read M0): #0 nanobind::detail::nb_type_get(std::type_info const*, _object*, unsigned char, nanobind::detail::cleanup_list*, void**) /proc/self/cwd/external/nanobind/src/nb_type.cpp:1536:17 (libjax_common.so+0xa5991a) (BuildId: 2414288b76007439f199e80c1b7b912642188753) #1 nanobind::detail::type_caster_base::from_python(nanobind::handle, unsigned char, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_cast.h:481:16 (libjax_common.so+0x5d96184) (BuildId: 2414288b76007439f199e80c1b7b912642188753) #2 _object* nanobind::detail::func_create(xla::Layout const& (xla::PjRtLayout::*)() const, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&)::'lambda'(xla::PjRtLayout const*), xla::Layout const&, xla::PjRtLayout const*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method>(xla::PjRtLayout&&, xla::Layout const& (*)(nanobind::scope, nanobind::name, nanobind::is_method), std::integer_sequence, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:254:41 (libjax_common.so+0x5d96184) ... Previous write of size 4 at 0x7ad434c76774 by thread T5 (mutexes: read M0): #0 nanobind::detail::keep_alive(_object*, void*, void (*)(void*) noexcept) /proc/self/cwd/external/nanobind/src/nb_type.cpp:1661:47 (libjax_common.so+0xa5a508) (BuildId: 2414288b76007439f199e80c1b7b912642188753) #1 nanobind::detail::shared_from_cpp(std::shared_ptr&&, _object*) /proc/self/cwd/external/nanobind/include/nanobind/stl/shared_ptr.h:56:5 (libjax_common.so+0x8c6bc6) (BuildId: 2414288b76007439f199e80c1b7b912642188753) #2 nanobind::detail::type_caster, int>::from_cpp(std::shared_ptr const&, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/stl/shared_ptr.h:129:13 (libjax_common.so+0x8c6899) (BuildId: 2414288b76007439f199e80c1b7b912642188753) #3 _object* nanobind::detail::func_create> (), jax::PyArray>, std::shared_ptr, jax::PyArray&, 0ul, nanobind::is_method, nanobind::is_getter>(xla::ValueOrThrowWrapper> (), jax::PyArray>&&, std::shared_ptr (*)(jax::PyArray&), std::integer_sequence, nanobind::is_method const&, nanobind::is_getter const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:274:22 (libjax_common.so+0x8c63d5) (BuildId: 2414288b76007439f199e80c1b7b912642188753) ``` The race looks like this: * two threads both attempt to return the same std::shared_ptr<> object concurrently. * thread (a) calls nb_type_put and gets the is_new=true case. Thread (b) calls nb_type_put and gets the is_new=false case. * thread (b) returns the object to Python, and then passes that object as an argument to a nanobind-bound function, which reads inst->state. * thread (a) reaches the is_new branch and calls keep_alive, which writes inst->clear_keep_alive. `state` and `clear_keep_alive` are two bitfields on the same object accessed concurrently, so this is a data race, per the C++20 standard: [intro.races] (6.9.2.1) Data races: Paragraph 21 defines a data race and its consequence: "The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other... Any such data race results in undefined behavior.". Paragraph 2 defines conflicting actions as modifying or reading/modifying the same memory location. [intro.memory] (6.7.1) Memory model: This section precisely defines what a "memory location" is for bitfields. Paragraph 3 states: "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width.". This PR splits clear_keep_alive into a separate non-bitfield member. This may or may not be a sufficient fix, in particular, one can imagine that it may be possible for multiple threads to call keep_alive concurrently, so it may need to be an atomic value.
1 parent 105340e commit aa1c9fd

File tree

5 files changed

+55
-11
lines changed

5 files changed

+55
-11
lines changed

docs/changelog.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@ below inherit that of the preceding release.
1818
Version TBD (not yet released)
1919
------------------------------
2020

21+
- ABI version 17. The layout of the `nb_inst` class was changed to fix a data race
22+
between `clear_keep_alive` and other bitfields in the same struct (PR `#1191
23+
<https://github.com/wjakob/nanobind/pull/1191>`__)
24+
2125
- Nanobind now officially supports **MinGW-w64** and **Intel ICX** (the modern
2226
Clang-based Intel compiler). Continuous integration tests have been added to
2327
ensure compatibility with these compilers on an ongoing basis.

src/nb_abi.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
/// Tracks the version of nanobind's internal data structures
1616
#ifndef NB_INTERNALS_VERSION
17-
# define NB_INTERNALS_VERSION 16
17+
# define NB_INTERNALS_VERSION 17
1818
#endif
1919

2020
#if defined(__MINGW32__)

src/nb_internals.h

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ struct nb_inst { // usually: 24 bytes
5656

5757
/// State of the C++ object this instance points to: is it constructed?
5858
/// can we use it?
59-
uint32_t state : 2;
59+
uint8_t state : 2;
6060

6161
// Values for `state`. Note that the numeric values of these are relied upon
6262
// for an optimization in `nb_type_get()`.
@@ -70,25 +70,27 @@ struct nb_inst { // usually: 24 bytes
7070
* relative offset to a pointer that must be dereferenced to get to the
7171
* instance data. 'direct' is 'true' in the former case.
7272
*/
73-
uint32_t direct : 1;
73+
uint8_t direct : 1;
7474

7575
/// Is the instance data co-located with the Python object?
76-
uint32_t internal : 1;
76+
uint8_t internal : 1;
7777

7878
/// Should the destructor be called when this instance is GCed?
79-
uint32_t destruct : 1;
79+
uint8_t destruct : 1;
8080

8181
/// Should nanobind call 'operator delete' when this instance is GCed?
82-
uint32_t cpp_delete : 1;
83-
84-
/// Does this instance hold references to others? (via internals.keep_alive)
85-
uint32_t clear_keep_alive : 1;
82+
uint8_t cpp_delete : 1;
8683

8784
/// Does this instance use intrusive reference counting?
88-
uint32_t intrusive : 1;
85+
uint8_t intrusive : 1;
86+
87+
/// Does this instance hold references to others? (via internals.keep_alive)
88+
/// This may be accessed concurrently to 'state', so it must not be in
89+
/// the same bitfield as 'state'.
90+
uint8_t clear_keep_alive;
8991

9092
// That's a lot of unused space. I wonder if there is a good use for it..
91-
uint32_t unused : 24;
93+
uint16_t unused;
9294
};
9395

9496
static_assert(sizeof(nb_inst) == sizeof(PyObject) + sizeof(uint32_t) * 2);

tests/test_thread.cpp

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
#include <nanobind/nanobind.h>
2+
#include <nanobind/stl/shared_ptr.h>
3+
4+
#include <memory>
5+
#include <vector>
26

37
namespace nb = nanobind;
48
using namespace nb::literals;
@@ -32,6 +36,11 @@ class ClassWithClassProperty {
3236
ClassWithProperty value_;
3337
};
3438

39+
struct AnInt {
40+
int value;
41+
AnInt(int v) : value(v) {}
42+
};
43+
3544

3645
NB_MODULE(test_thread_ext, m) {
3746
nb::class_<Counter>(m, "Counter")
@@ -68,4 +77,17 @@ NB_MODULE(test_thread_ext, m) {
6877
new (self) ClassWithClassProperty(std::move(value));
6978
}, nb::arg("value"))
7079
.def_prop_ro("prop1", &ClassWithClassProperty::get_prop);
80+
81+
nb::class_<AnInt>(m, "AnInt")
82+
.def(nb::init<int>())
83+
.def_rw("value", &AnInt::value);
84+
85+
std::vector<std::shared_ptr<AnInt>> shared_ints;
86+
for (int i = 0; i < 5; ++i) {
87+
shared_ints.push_back(std::make_shared<AnInt>(i));
88+
}
89+
m.def("fetch_shared_int", [shared_ints](int i) {
90+
return shared_ints.at(i);
91+
});
92+
m.def("consume_an_int", [](AnInt* p) { return p->value; });
7193
}

tests/test_thread.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
import random
2+
import threading
3+
14
import test_thread_ext as t
25
from test_thread_ext import Counter, GlobalData, ClassWithProperty, ClassWithClassProperty
36
from common import parallelize
@@ -100,3 +103,16 @@ def f():
100103
_ = c2.prop1.prop2
101104

102105
parallelize(f, n_threads=n_threads)
106+
107+
108+
def test08_shared_ptr_threaded_access(n_threads=8):
109+
# Test for keep_alive racing with other fields.
110+
def f(barrier):
111+
i = random.randint(0, 4)
112+
barrier.wait()
113+
p = t.fetch_shared_int(i)
114+
assert t.consume_an_int(p) == i
115+
116+
for _ in range(100):
117+
barrier = threading.Barrier(n_threads)
118+
parallelize(lambda: f(barrier), n_threads=n_threads)

0 commit comments

Comments
 (0)