Project

General

Profile

Feature #20057

Updated by kjtsanaktsidis (KJ Tsanaktsidis) 5 months ago

This ticket is to discuss some changes to `rb_register_postponed_job` that @ko1 and myself propose to make for Ruby 3.3. The motivation for this work is to fix a bug in the current implementation, which can cause the registered functions to be called with the wrong `data` argument (https://bugs.ruby-lang.org/issues/19991). 

 There's a long discussion on the associated PR (https://github.com/ruby/ruby/pull/8949) (https://github.com/ruby/ruby/pull/9041) but in the end we came to the conclusion that the best way to fix this bug involved actually changing the current semantics of `rb_register_postponed_job`. I'm opening this issue to get feedback on this approach and to see if anybody knows of a reason why we should not release this for Ruby 3.3. 

 ## Current behaviour in Ruby 3.2 

 Currently, Ruby has two functions for interacting with postponed jobs. These jobs can be enqueued from anywhere (including signal handlers), and will be executed next time Ruby checks for `RUBY_VM_CHECK_INTS()`. 

 * `rb_postponed_job_register(func, data)`: Schedules `func(data)` to be executed the next time `RUBY_VM_CHECK_INTS` is checked. 
 * `rb_postponed_job_register_once(func, data)`: Works like `rb_postponed_job_register`, _except_ if `func` is already scheduled to be executed (either with this `data` or with different `data`), in which case it does nothing. 

 The postponed jobs are stored in a fixed sized array (of length 1024), so it's possible that enqueuing them could fail if the buffer is full. In this case, they signal this by returning `0` (otherwise, they return `1` for successful enqueue or `2` because `rb_postponed_job_register_once` did nothing because `func` was already in the queue). 

 Unfortunately, as I mentioned before, the implementation of these functions are subject to a race condition because `func` and `data` are not written into the postponed job buffer together atomically (they are two separate variables and CPUs tend not to have double-word atomic instructions). Again, see https://bugs.ruby-lang.org/issues/19991 for the full details. 

 ## What we have done 

 Whilst working on this issue, we had a look at all of the in-the-wild usages of these APIs on rubygems. The only real usage of these APIs is for profiling tools, and the following was true for essentially all of them: 

 * Each gem only is registering a single callback function, 
 * Almost all of the usages either make no use of the `data` argument at all, or pass some kind of never-changing global context into it. 
 * There are only a very small handful of gems using these APIs at all 

 Thus, we concluded that the current behaviour of allowing scheduling and execution of arbitrary `(func, data)` pairs is actually not really needed. Instead, we could offer a more limited API which would meet the needs of all current users, whilst making it easy to avoid the race conditions in the current implementation. 

 The new API is as follows: 

 * `rb_postponed_job_preregister(func, data)`: This function registers `func`/`data` into a small, fixed-size table, and return a handle to this registration. Subsequent calls to this function with the same `func` will return the same handle, and overwrite the `data` with new data if it is different. The size of the table is 32 entries on most systems, which is still enough to use literally every gem on rubygems that actually uses these APIs at the same time. The intention is that libraries would call this function in their initialization routines, storing the handle for later. 
 * `rb_postponed_job_trigger(handle)`: This function takes the handle from `rb_postponed_job_preregister` and schedules it for execution the next time `RUBY_VM_CHECK_INTS` is called. If the handle is already scheduled, this will not cause it to be scheduled twice; each `func` can only be called a maximum of one time for each call to `RUBY_VM_CHECK_INTS`, essentially. 

 All of the usages of the old `rb_postponed_job_register{,_once}` functions in the Ruby tree have been replaced by calls to the above two functions, and these two old functions have been marked with the deprecated attribute. They have also been re-implemented in terms of the new functions; both `rb_postponed_job_register` and `rb_postponed_job_register_once` are now both equivalent to `rb_postponed_job_trigger(rb_postponed_job_prereigster(func, data))`. This means that: 

 * `rb_postponed_job_register` now works like `rb_postponed_job_register_once` i.e. `func` can only be executed one time per `RUBY_VM_CHECK_INTS`, no matter how many times it is registered 
 * They are also called with the _last_ `data` to be registered, not the first (which is how `rb_postponed_job_register_once` previously worked) 

 I verified that stackprof still builds & works correctly with the new implementation of `rb_postponed_job_register`. 

 ## What else we tried 

 I tried a couple of things to keep the current semantics of `rb_postponed_job_register{,_once}` intact, without introducing new APIs. 

 * First, I tried protecting postponed job buffer by masking signals around the critical section & using a POSIX semaphore instead of a pthread mutex: https://github.com/ruby/ruby/pull/8856. However, there was a concern that this would be too slow (since `RUBY_VM_CHECK_INTS` is called very often, and both the semaphore and the signal mask require calling into the kernel). 
 * Then, I implemented a lock-free ringbuffer to store the postponed job queue: https://github.com/ruby/ruby/compare/master...KJTsanaktsidis:ruby:old_circular_ringbuffer. However, the concern with this implementation was that it was too complex. 

 ## Ruby 3.3 

 As of right now, we have merged these changes (from https://github.com/ruby/ruby/pull/8949), https://github.com/ruby/ruby/pull/9041), and @ko1 plans for them to go out in 3.3-rc1. The point of opening this issue is to ask: does anybody foresee any problem with our approach?

Back