diff --git a/doc/contributing/concurrency_guide.md b/doc/contributing/concurrency_guide.md new file mode 100644 index 00000000000000..c4e5cc7700f17a --- /dev/null +++ b/doc/contributing/concurrency_guide.md @@ -0,0 +1,154 @@ +# Concurrency Guide + +This is a guide to thinking about concurrency in the native cruby source code, whether that's +contributing to Ruby by writing C or Rust. This doesn't touch on native extensions, only the core +language. It will go over: + +* What needs synchronizing? +* How to use the VM lock, and what you can and can't do when you've acquired this lock. +* What you can and can't do when you've acquired other native locks. +* The difference between the VM lock and the GVL. +* What a VM barrier is and when to use it. +* The lock ordering of some important locks. +* How ruby interrupt handling works. +* The timer thread and what it's responsible for. + +## What needs synchronizing? + +Before ractors, only one ruby thread could run at once. That didn't mean you could forget about concurrency issues, though. The timer thread +is a native thread that interacts with other ruby threads and changes some VM internals, so if these changes can be done in parallel by both the timer +thread and a ruby thread, they needed to be synchronized. + +When you add ractors to the mix, it gets more complicated. However, ractors allow you to forget about synchronization for non-shareable objects because +they aren't used across ractors. Only one ruby thread can touch the object at once. For shareable objects, they are deeply frozen so there isn't any +mutation on the objects themselves. However, something like reading/writing constants across ractors does need to be synchronized. Ruby threads need to see a consistent +view of the VM in these situations, so if publishing the update takes 2 steps or even two separate instructions, synchronization is required. + +Most synchronization is to protect VM internals. These internals include structures for the thread scheduler on each ractor, the global ractor scheduler, the +coordination between ruby threads and ractors, global tables (for `fstrings`, encodings, symbols and global vars), etc. Anything that can be mutated by 2 or more +ractors needs locks or atomics. + +## The VM Lock + +There's only one VM lock and it is for critical sections that can only be entered by one ractor at a time. +Without ractors, the VM lock is useless. It does not stop all ractors from running, as ractors can run +without trying to acquire this lock. If you're updating global (shared) data between ractors and aren't using +atomics, you need to a lock and this is a convenient one to use. Unlike other locks, you can allocate ruby-managed +memory with it held. When you take the VM lock, there are things you can and can't do during your critical section: + +You can (as long as no other locks are also held before the VM lock): + +* Create ruby objects, call `ruby_xmalloc`, etc. + +You can't: + +* Context switch to another ruby thread or ractor. This is important, as many things can cause ruby-level context switches including: + + * Calling any ruby method through, for example, `rb_funcall`. If you execute ruby code, a context switch could happen. + This also applies to ruby methods defined in C, as they can be redefined in Ruby. Things that call ruby methods such as + `rb_obj_respond_to` are also disallowed. + + * Calling `rb_raise`. This will call `initialize` on the new exception object. With the VM lock + held, nothing you call should be able to raise an exception. `NoMemoryError` is allowed, however. + + * Calling `rb_nogvl` or a ruby-level mechanism that can context switch like `rb_mutex_lock`. + + * Enter any blocking operation managed by ruby. This will context switch to another ruby thread using `rb_nogvl` or + something equivalent. + +Internally, the VM lock is the `vm->ractor.sync.lock`. + +You need to be on a ruby thread to take the VM lock. You also can't take it inside any functions that could be called during sweeping, as MMTK sweeps +on another thread and you need a valid `ec` to grab the lock. For this same reason (among others), you can't take it from the timer thread either. + +## Other Locks + +All native locks that aren't the VM lock share a more strict set of rules for what's allowed during the critical section. By native locks, we mean +anything that uses `rb_native_mutex_lock`. Some important locks include the `interrupt_lock`, the ractor scheduling lock (protects global scheduling data structures), +the thread scheduling lock (local to each ractor, protects per-ractor scheduling data structures) and the ractor lock (local to each ractor, protects ractor data structures). + +When you acquire one of these locks, + +You can: + +* Allocate memory though non-ruby allocation such as raw `malloc` or the standard library. But be careful, some functions like `strdup` use +ruby allocation through the use of macros! + +* Use `ccan` lists, as they don't allocate. + +* Do the usual things like set variables or struct fields, manipulate linked lists, signal condition variables etc. + +You can't: + +* Allocate ruby-managed memory. This includes creating ruby objects or using `ruby_xmalloc` or `st_insert`. The reason this +is disallowed is if that allocation causes a GC, then all other ruby threads must join a VM barrier as soon as possible +(when they next check interrupts or acquire the VM lock). This is so that no other ractors are running during GC. If a ruby thread +is waiting (blocked) on this same native lock, it can't join the barrier and a deadlock occurs because the barrier will never finish. + +* Raise exceptions or use `EC_JUMP_TAG` if it jumps out of the critical section. + +* Context switch. See the `VM Lock` section for more info. + +## Difference Between VM Lock and GVL + +The VM Lock is a particular lock in the source code. There is only one VM Lock. The GVL, on the other hand, is more of a combination of locks. +It is "acquired" when a ruby thread is about to run or is running. Since many ruby threads can run at the same time if they're in different ractors, +there are many GVLs (1 per `SNT` + 1 for the main ractor). It can no longer be thought of as a "Global VM Lock" like it once was before ractors. + +## VM Barriers + +Sometimes, taking the VM Lock isn't enough and you need a guarantee that all ractors have stopped. This happens when running `GC`, for instance. +A `VM barrier` is designed for this use case. It's not used often as taking a barrier slows ractor performance down considerably, but it's useful to +know about and is sometimes the only solution. + +## Lock Orderings + +It's a good idea to not hold more than 2 locks at once on the same thread. Locking multiple locks can introduce deadlocks, so do it with care. When locking +multiple locks at once, follow an ordering that is consistent across the program. Here are the orderings of some important locks: + +* VM lock before ractor_sched_lock +* thread_sched_lock before ractor_sched_lock +* interrupt_lock before timer_th.waiting_lock +* timer_th.waiting_lock before ractor_sched_lock + +These orderings are subject to change, so check the source if you're not sure. On top of this: + +* During each `ubf` (unblock) function, the VM lock can be taken around it in some circumstances. This happens during VM shutdown, for example. +See the "Interrupt Handling" section for more details. + +## Ruby Interrupt Handling + +When the VM runs ruby code, ruby's threads intermittently check ruby-level interrupts. These software interrupts +are for various things in ruby and they can be set by other ruby threads or the timer thread. + +* Ruby threads check when they should give up their timeslice. The native thread switches to another ruby thread when their time is up. +* The timer thread sends a "trap" interrupt to the main thread if any ruby-level signal handlers are pending. +* Ruby threads can have other ruby threads run tasks for them by sending them an interrupt. For instance, ractors send +the main thread an interrupt when they need to `require` a file so that it's done on the main thread. They wait for the +main thread's result. +* During VM shutdown, a "terminate" interrupt is sent to all ractor main threads top stop them asap. +* When calling `Thread#raise`, the caller sends an interrupt to that thread telling it which exception to raise. +* Unlocking a mutex sends the next waiter (if any) an interrupt telling it to grab the lock. +* Signalling or broadcasting on a condition variable tells the waiter(s) to wake up. + +This isn't a complete list. + +When sending an interrupt to a ruby thread, the ruby thread can be blocked. For example, it could be in the middle of a `TCPSocket#read` call. If so, +the receiving thread's `ubf` (unblock function) gets called from the thread (ruby thread or timer thread) that sent the interrupt. +Each ruby thread has a `ubf` that is set when it enters a blocking operation and is unset after returning from it. By default, this `ubf` function sends a +`SIGVTALRM` to the receiving thread to try to unblock it from the kernel so it can check its interrupts. There are other `ubfs` that +aren't associated with a syscall, such as when calling `Ractor#join` or `sleep`. All `ubfs` are called with the `interrupt_lock` held, +so take that into account when using locks inside `ubfs`. + +Remember, `ubfs` can be called from the timer thread so you cannot assume an `ec` inside them. The `ec` (execution context) is only set on ruby threads. + +## The Timer Thread + +The timer thread has a few functions. They are: + +* Send interrupts to ruby threads that have run for their whole timeslice. +* Wake up M:N ruby threads (threads in non-main ractors) blocked on IO or after a specified timeout. This +uses `kqueue` or `epoll`, depending on the OS, to receive IO events on behalf of the threads. +* Continue calling the `SIGVTARLM` signal if a thread is still blocked on a syscall after the first `ubf` call. +* Signal native threads (`SNT`) waiting on a ractor if there are ractors waiting in the global run queue. +* Create more `SNT`s if some are blocked, like on IO or on `Ractor#join`.