Project

General

Profile

Feature #16792

Make Mutex held per Fiber instead of per Thread

Added by Eregon (Benoit Daloze) 3 months ago. Updated about 2 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:97907]

Description

Currently, Mutex in CRuby is held per Thread.
In JRuby and TruffleRuby, Mutex is held per Fiber (because it's simply easier implementation-wise).

While a user could theoretically notice the difference, it seems extremely uncommon in practice (probably incorrect synchronization).

The usage pattern for a Mutex is using #synchronize or lock+unlock.
Such a pattern protects/surrounds a region of code using some resource, and such a region of code is always on the same Fiber since it's on a given Ruby "stack".

With #16786 it becomes more relevant to have Mutex held per Fiber, otherwise Mutex#lock will hurt scalability of that proposal significantly.
This means, if a Fiber does Mutex#lock and it's already held by another Fiber of the same Thread, and the Thread#scheduler is enabled, instead of just raising an error (which made sense before, because it would be a deadlock, but no longer the case with scheduler),
or disabling fiber scheduling entirely until #unlock (current state in #16786, makes Mutex#lock special and hurts scalability),
we would just go to the scheduler and schedule another Fiber (for instance, the one holding that Mutex, or any other ready to be run Fiber).

This is not a new idea and in fact Crystal already does this with its non-blocking Fibers, which is very similar with #16786:
https://github.com/crystal-lang/crystal/blob/612825a53c831ce7d17368c8211342b199ca02ff/src/mutex.cr#L72

Mutex#lock is just like other blocking operations, so let's make it so building on #16786.
I believe it's the natural and intuitive thing to do for Fiber concurrency with a scheduler.

Queue#pop and SizedQueue#push could be other candidates to handle in a similar way.

Here is an early commit to make Mutex held per Fiber, it's quite trivial as you can see:
https://github.com/ruby/ruby/compare/master...eregon:mutex-per-fiber
It passes test-all and test-spec.


Related issues

Related to Ruby master - Feature #16786: Light-weight scheduler for improved concurrency.OpenActions
#1

Updated by Eregon (Benoit Daloze) 3 months ago

  • Backport deleted (2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN)
  • Tracker changed from Bug to Feature
#2

Updated by Eregon (Benoit Daloze) 3 months ago

  • Related to Feature #16786: Light-weight scheduler for improved concurrency. added
#3

Updated by Eregon (Benoit Daloze) 3 months ago

  • Description updated (diff)
#4

Updated by Eregon (Benoit Daloze) 3 months ago

  • Description updated (diff)

Updated by ioquatix (Samuel Williams) 3 months ago

Characterising this as trivial hides the impact of this both on the scheduler design and application code which expects per-thread mutex.

Once the scheduler lands into master, we will need to scope out something like wait_mutex and all the associated scheduling that needs to happen.

Can you check the Crystal implementation? How do they implement fair scheduling?

Finally, as this breaks assumptions about application code, I proposed Mutex.new(blocking: false/true). I'm on the fence regarding this interface, but at least it should be clear that such a change has a track record of breaking application code. So if we decide to make the mutex per-fibre, we need to anticipate this problem, either through analysis of existing source code, issuing warnings, or something else.

Ultimately, I think keeping it simple would be great. But I'm hesitant to do so since it may break existing code. headius (Charles Nutter) maybe you can comment on what user code was affected so we can see if there is some other way to mitigate it.

Updated by Eregon (Benoit Daloze) 3 months ago

Application code which expects per-thread mutex

I'm happy to see examples of that.
I've never seen it in 5+ more years of bug reports on TruffleRuby, and JRuby AFAIK always had Mutex per Fiber and yet no real issue because of it.
The only cases that would rely on this seems broken synchronization
(e.g., taking the lock in one Fiber and unlocking in another Fiber, which I believe is always a bug, as lock/unlock needs to be used like lock; begin; ...; ensure; unlock; end, which is always on the same stack so on the same Fiber).

Updated by Eregon (Benoit Daloze) 3 months ago

ioquatix (Samuel Williams) wrote in #note-5:

Can you check the Crystal implementation? How do they implement fair scheduling?

It's linked in the description.
Mutex is already not fair so I don't think it matters.

Finally, as this breaks assumptions about user code, I proposed Mutex.new(blocking: false/true).

It doesn't.

I'm on the fence regarding this interface, but at least it should be clear that such a change has a track record of breaking application code.

Any example? Please don't claim such things if there is no example.

Updated by ioquatix (Samuel Williams) 3 months ago

Any example? Please don't claim such things if there is no example.

I'll have to defer to headius (Charles Nutter) for examples, since he was the one that mentioned it. I can only assume he is telling the truth when he said some application code was broken.

The only reason to look at it is to check if it's a valid use case or not. If we can't find any valid use case, it confirms your assessment, which is a good thing.

It doesn't.

I did play around with this a bit, and I also wondered if we should add it to Ruby spec before making changes.

https://github.com/ruby/ruby/pull/3032/files#diff-1a99f5621b805d95b67228708b652ff9R25

That is the current semantic and if we introduce non-blocking fiber, it can cause deadlock, as shown in the test, if you disable the mutex -> blocking relationship.

Updated by ioquatix (Samuel Williams) 3 months ago

My initial reply might have come across as overly critical and that was not my intention.

I want to say, I agree with this proposal, and I think it's a good idea.

I would like to see that we merge the scheduler proposal first, which preserves the current semantics of Mutex. Then, shortly afterwards and before Ruby 3 is released, we should implement this.

One more point:

will hurt scalability of that proposal significantly.

There is no evidence to suggest this. I do agree it can hurt the scalability of a system but it's entirely possible to build such a system without a Mutex.

Updated by ioquatix (Samuel Williams) 3 months ago

In order to experiment with this, I'd like to propose the following hooks to the scheduler:

class Scheduler
  # A fiber has tried to lock a mutex, but it failed.
  def wait_mutex(mutex)
  end

  # A mutex which has previously called `wait_mutex` may now be available to lock.
  def notify_mutex(mutex)
  end
end

Eregon (Benoit Daloze) do you think this is sufficient? I'm a little bit concerned about the reentrancy guarantees of notify_mutex. Should we define that method to be reentrant/thread safe? Ideally, we have a way to notify the scheduler to reschedule the fiber to try and lock the mutex again.

cc jeremyevans0 (Jeremy Evans)

Updated by ko1 (Koichi Sasada) about 2 months ago

I think it seems difficult to implement it.
If an interpreter manages everything, it is easy (at least I can image how to implement it).

(1) API

I'm not sure we can implement Mutex scheduling with the hooks introduced at #10.
If there is only one thread, maybe it is easy to implement.

Maybe notify_mutex(mutex) will be called with locked mutex by the interpreter. It means we need to introduce lock_nonblock or similar API for Mutex (maybe _nonblock is not good name because it is different from IO#read_nonblock). Mutex#notice_when_locked?

Also we need to wait this notification with wait for the ready of IO operations. how to write it? Use pipe trick or interrupt mechanism?

(2) Queue/SizedQueue/CV

there are several blocking operations because of the thread synchronization, how to treat them?
Introduce unified hook method like wait_synchronization(sync_object) and notify_synchronization(sync_object)?
Introduce pop_nonblock similar to Mutex#sync_nonblock?

(3) implemented by all schedulers?

maybe most of code are same between scheduler implementations. can we provide a framework to implement it?

Also available in: Atom PDF