Skip to content

Conversation

hawkinsp
Copy link
Contributor

In JAX CI, we saw the following data race:

WARNING: ThreadSanitizer: data race (pid=36276)
  Read of size 4 at 0x7ad434c76774 by thread T4 (mutexes: read M0):
    #0 nanobind::detail::nb_type_get(std::type_info const*, _object*, unsigned char, nanobind::detail::cleanup_list*, void**) /proc/self/cwd/external/nanobind/src/nb_type.cpp:1536:17 (libjax_common.so+0xa5991a) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    #1 nanobind::detail::type_caster_base::from_python(nanobind::handle, unsigned char, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_cast.h:481:16 (libjax_common.so+0x5d96184) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    #2 _object* nanobind::detail::func_create(xla::Layout const& (xla::PjRtLayout::*)() const, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&)::'lambda'(xla::PjRtLayout const*), xla::Layout const&, xla::PjRtLayout const*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method>(xla::PjRtLayout&&, xla::Layout const& (*)(nanobind::scope, nanobind::name, nanobind::is_method), std::integer_sequence, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:254:41 (libjax_common.so+0x5d96184)

...

  Previous write of size 4 at 0x7ad434c76774 by thread T5 (mutexes: read M0):
    #0 nanobind::detail::keep_alive(_object*, void*, void (*)(void*) noexcept) /proc/self/cwd/external/nanobind/src/nb_type.cpp:1661:47 (libjax_common.so+0xa5a508) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    #1 nanobind::detail::shared_from_cpp(std::shared_ptr&&, _object*) /proc/self/cwd/external/nanobind/include/nanobind/stl/shared_ptr.h:56:5 (libjax_common.so+0x8c6bc6) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    #2 nanobind::detail::type_caster, int>::from_cpp(std::shared_ptr const&, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/stl/shared_ptr.h:129:13 (libjax_common.so+0x8c6899) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    #3 _object* nanobind::detail::func_create> (), jax::PyArray>, std::shared_ptr, jax::PyArray&, 0ul, nanobind::is_method, nanobind::is_getter>(xla::ValueOrThrowWrapper> (), jax::PyArray>&&, std::shared_ptr (*)(jax::PyArray&), std::integer_sequence, nanobind::is_method const&, nanobind::is_getter const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:274:22 (libjax_common.so+0x8c63d5) (BuildId: 2414288b76007439f199e80c1b7b912642188753)

The race looks like this:

  • two threads both attempt to return the same std::shared_ptr<> object concurrently.
  • thread (a) calls nb_type_put and gets the is_new=true case. Thread (b) calls nb_type_put and gets the is_new=false case.
  • thread (b) returns the object to Python, and then passes that object as an argument to a nanobind-bound function, which reads inst->state.
  • thread (a) reaches the is_new branch and calls keep_alive, which writes inst->clear_keep_alive.

state and clear_keep_alive are two bitfields on the same object accessed concurrently, so this is a data race, per the C++20 standard:

[intro.races] (6.9.2.1) Data races: Paragraph 21 defines a data race and its consequence: "The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other... Any such data race results in undefined behavior.". Paragraph 2 defines conflicting actions as modifying or reading/modifying the same memory location.

[intro.memory] (6.7.1) Memory model: This section precisely defines what a "memory location" is for bitfields. Paragraph 3 states: "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width.".

This PR splits clear_keep_alive into a separate non-bitfield member. This may or may not be a sufficient fix, in particular, one can imagine that it may be possible for multiple threads to call keep_alive concurrently, so it may need to be an atomic value.

In JAX CI, we saw the following data race:

```
WARNING: ThreadSanitizer: data race (pid=36276)
  Read of size 4 at 0x7ad434c76774 by thread T4 (mutexes: read M0):
    #0 nanobind::detail::nb_type_get(std::type_info const*, _object*, unsigned char, nanobind::detail::cleanup_list*, void**) /proc/self/cwd/external/nanobind/src/nb_type.cpp:1536:17 (libjax_common.so+0xa5991a) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    wjakob#1 nanobind::detail::type_caster_base::from_python(nanobind::handle, unsigned char, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_cast.h:481:16 (libjax_common.so+0x5d96184) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    wjakob#2 _object* nanobind::detail::func_create(xla::Layout const& (xla::PjRtLayout::*)() const, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&)::'lambda'(xla::PjRtLayout const*), xla::Layout const&, xla::PjRtLayout const*, 0ul, nanobind::scope, nanobind::name, nanobind::is_method>(xla::PjRtLayout&&, xla::Layout const& (*)(nanobind::scope, nanobind::name, nanobind::is_method), std::integer_sequence, nanobind::scope const&, nanobind::name const&, nanobind::is_method const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:254:41 (libjax_common.so+0x5d96184)

...

  Previous write of size 4 at 0x7ad434c76774 by thread T5 (mutexes: read M0):
    #0 nanobind::detail::keep_alive(_object*, void*, void (*)(void*) noexcept) /proc/self/cwd/external/nanobind/src/nb_type.cpp:1661:47 (libjax_common.so+0xa5a508) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    wjakob#1 nanobind::detail::shared_from_cpp(std::shared_ptr&&, _object*) /proc/self/cwd/external/nanobind/include/nanobind/stl/shared_ptr.h:56:5 (libjax_common.so+0x8c6bc6) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    wjakob#2 nanobind::detail::type_caster, int>::from_cpp(std::shared_ptr const&, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/stl/shared_ptr.h:129:13 (libjax_common.so+0x8c6899) (BuildId: 2414288b76007439f199e80c1b7b912642188753)
    wjakob#3 _object* nanobind::detail::func_create> (), jax::PyArray>, std::shared_ptr, jax::PyArray&, 0ul, nanobind::is_method, nanobind::is_getter>(xla::ValueOrThrowWrapper> (), jax::PyArray>&&, std::shared_ptr (*)(jax::PyArray&), std::integer_sequence, nanobind::is_method const&, nanobind::is_getter const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:274:22 (libjax_common.so+0x8c63d5) (BuildId: 2414288b76007439f199e80c1b7b912642188753)

```

The race looks like this:
* two threads both attempt to return the same std::shared_ptr<> object
  concurrently.
* thread (a) calls nb_type_put and gets the is_new=true case. Thread (b)
  calls nb_type_put and gets the is_new=false case.
* thread (b) returns the object to Python, and then passes that object
  as an argument to a nanobind-bound function, which reads inst->state.
* thread (a) reaches the is_new branch and calls keep_alive, which
  writes inst->clear_keep_alive.

`state` and `clear_keep_alive` are two bitfields on the same object
accessed concurrently, so this is a data race, per the C++20 standard:

[intro.races] (6.9.2.1) Data races: Paragraph 21 defines a data race and its consequence: "The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other... Any such data race results in undefined behavior.". Paragraph 2 defines conflicting actions as modifying or reading/modifying the same memory location.

[intro.memory] (6.7.1) Memory model: This section precisely defines what a "memory location" is for bitfields. Paragraph 3 states: "A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width.".

This PR splits clear_keep_alive into a separate non-bitfield member.
This may or may not be a sufficient fix, in particular, one can imagine
that it may be possible for multiple threads to call keep_alive
concurrently, so it may need to be an atomic value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant