Project

General

Profile

Actions

Bug #21612

open

Make sure we never context switch while holding the VM lock

Added by luke-gru (Luke Gruber) about 6 hours ago. Updated about 5 hours ago.

Status:
Open
Assignee:
-
Target version:
[ruby-core:123306]

Description

The Problem

We're seeing errors in our application that uses ractors. The errors look like:

[BUG] unexpected situation - recordd:1 current:0
error.c:1097 rb_bug_without_die_internal
vm_sync.c:275 disallow_reentry
eval_intern.h:136 rb_ec_vm_lock_rec_check
eval_intern.h:147 rb_ec_tag_state
vm.c:2619 rb_vm_exec
vm.c:1702 rb_yield
eval.c:1173 rb_ensure

We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added
assertions in the code that we never yield to another thread with the VM lock held. We enabled these VM lock assertions even in single ractor mode. These assertions were failing in a few places, but most notably in finalizers. Finalizers are running with the VM lock held, and they were context switching and causing this issue.

Why Is This Bad?

There are a few reasons we shouldn't be able to context switch while holding the VM lock.

In single-ractor mode with threads A and B:

  1. Anything in this critical section should be thought of as a transaction related to the memory that's changed inside. if A has the lock, manipulates some global memory and yields to B with the lock still taken and without finishing the memory updates and then B takes it and starts writing to the same memory, the state of this global memory could be corrupted.

Currently we don't actually take the VM lock in single-ractor mode, but that doesn't mean these issues can't happen. Yielding to another thread in the middle of manipulating global memory can still happen and it causes similar issues.

In multi-ractor mode with ractors A and B:

  1. We get the same issues as in single-ractor mode.

  2. We can also get deadlocks if A has the lock, yields to B and B is blocked waiting on the lock.

Unfortunately, many things can cause context switching in Ruby, so what is safe to call when the VM lock is taken?

Guidelines

I've come up with some guidelines. With the VM lock held,

You should be able to:

  • Create ruby objects, call ruby_xmalloc, etc.

  • Jump using EC_JUMP_TAG. The lock will automatically be unlocked depending on how far up the call stack you locked it and where you're jumping to.

You shouldn't be able to:

  • Check interrupts.

  • Call any ruby method or enter Ruby's VM loop. For example, rb_funcall is not allowed, nor is rb_warn (it can call ruby code). rb_sprintf is not allowed because it can call rb_inspect.

  • Call rb_nogvl

  • Enter any blocking operation managed by Ruby.

  • Call a ruby-level mechanism that can context switch, like rb_mutex_lock.

The Fix

Of course, unlocking during finalizers is the main fix but there are other places that also need unlocking. I think adding assertions that the VM lock is not held will be important in finding these bugs and not creating regressions in the future. We don't have to add lots of these, just in a few places. These assertions, which only run in debug mode, should also run when in single-ractor mode.

Future Work

I think some documentation would be helpful for what is and isn't allowed while holding the VM lock and other locks in the cruby source. I am currently working on a Concurrency Guide for cruby developers that includes this info. It will not go over every lock, just the VM lock and the "all other locks" category.

Updated by ko1 (Koichi Sasada) about 6 hours ago · Edited

We concluded that there was context switching going on while a thread held the VM lock. During the investigation into the issue, we added assertions in the code that we never yield to another thread with the VM lock held.

I agree to avoid it. Context switches are invoked by CHECK_INTS macro. Where the macro is placed in the VM locking?

...but most notably in finalizers

Also CHECK_INTS invokes finalizers.


In other words, we should not run any Ruby code on a thread while the VM is locked because any Ruby code run anything.

Updated by ko1 (Koichi Sasada) about 6 hours ago

Check ruby interrupts. Since jumping can pop ruby frames and popping frames checks interrupts, you are allowed. It should never context switch with the VM lock held, even if the ruby thread's quantum

Should be allowed?

Updated by luke-gru (Luke Gruber) about 5 hours ago · Edited

Well, I'm not sure if it should be allowed. The reason I said it should be is that currently, EC_JUMP_TAG is supported due to auto lock_rec. I thought that could cause ruby frames to be popped, but I think I was wrong.

Raising with rb_raise definitely shouldn't be allowed, it calls initialize on the new error object. I'll update the guidelines.

Actions #4

Updated by luke-gru (Luke Gruber) about 5 hours ago

  • Description updated (diff)
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0