Project

General

Profile

Feature #16786

Light-weight scheduler for improved concurrency.

Added by ioquatix (Samuel Williams) about 2 months ago. Updated 4 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:97878]

Description

Abstract

We propose to introduce a light weight fiber scheduler, to improve the concurrency of Ruby code with minimal changes.

Background

We have been discussing and considering options to improve Ruby scalability for several years. More context can be provided by the following discussions:

The final Ruby Concurrency report provides some background on the various issues considered in the latest iteration: https://www.codeotaku.com/journal/2020-04/ruby-concurrency-final-report/index

Proposal

We propose to introduce the following concepts:

  • A Scheduler interface which provides hooks for user-supplied event loops.
  • Non-blocking Fiber which can invoke the scheduler when it would otherwise block.

Scheduler

The per-thread fiber scheduler interface is used to intercept blocking operations. A typical implementation would be a wrapper for a gem like EventMachine or Async. This design provides separation of concerns between the event loop implementation and application code. It also allows for layered schedulers which can perform instrumentation, enforce constraints (e.g. during testing) and provide additional logging. You can see a sample implementation here.

class Scheduler
  # Wait for the given file descriptor to become readable.
  def wait_readable(io)
  end

  # Wait for the given file descriptor to become writable.
  def wait_writable(io)
  end

  # Wait for the given file descriptor to match the specified events within
  # the specified timeout.
  # @param event [Integer] a bit mask of +IO::WAIT_READABLE+,
  #   `IO::WAIT_WRITABLE` and `IO::WAIT_PRIORITY`.
  # @param timeout [#to_f] the amount of time to wait for the event.
  def wait_any(io, events, timeout)
  end

  # Sleep the current task for the specified duration, or forever if not
  # specified.
  # @param duration [#to_f] the amount of time to sleep.
  def wait_sleep(duration = nil)
  end

  # The Ruby virtual machine is going to enter a system level blocking
  # operation.
  def enter_blocking_region
  end

  # The Ruby virtual machine has completed the system level blocking
  # operation.
  def exit_blocking_region
  end

  # Intercept the creation of a non-blocking fiber.
  def fiber(&block)
    Fiber.new(blocking: false, &block)
  end

  # Invoked when the thread exits.
  def run
    # Implement event loop here.
  end
end

A thread has a non-blocking fiber scheduler. All blocking operations on non-blocking fibers are hooked by the scheduler and the scheduler can switch to another fiber. If any mutex is acquired by a fiber, then a scheduler is not called; the same behaviour as blocking Fiber.

Schedulers can be written in Ruby. This is a desirable property as it allows them to be used in different implementations of Ruby easily.

To enable non-blocking fiber switching on blocking operations:

  • Specify a scheduler: Thread.current.scheduler = Scheduler.new.
  • Create several non-blocking fibers: Fiber.new(blocking:false) {...}.
  • As the main fiber exits, Thread.current.scheduler.run is invoked which begins executing the event loop until all fibers are finished.

Time/Duration Arguments

Tony Arcieri suggested against using floating point values for time/durations, because they can accumulate rounding errors and other issues. He has a wealth of experience in this area so his advice should be considered carefully. However, I have yet to see these issues happen in an event loop. That being said, round tripping between struct timeval and double/VALUE seems a bit inefficient. One option is to have an opaque argument that responds to to_f as well as potentially seconds and microseconds or some other such interface (could be opaque argument supported by IO.select for example).

File Descriptor Arguments

Because of the public C interface we may need to support a specific set of wrappers for CRuby.

int rb_io_wait_readable(int);
int rb_io_wait_writable(int);
int rb_wait_for_single_fd(int fd, int events, struct timeval *tv);

One option is to introduce hooks specific to CRuby:

class Scheduler
  # Wrapper for rb_io_wait_readable(int) C function.
  def wait_readable_fd(fd)
    wait_readable(::IO.from_fd(fd, autoclose: false))
  end

  # Wrapper for rb_io_wait_readable(int) C function.
  def wait_writable_fd(fd)
    wait_writable(::IO.from_fd(fd, autoclose: false))
  end

  # Wrapper for rb_wait_for_single_fd(int) C function.
  def wait_for_single_fd(fd, events, duration)
    wait_any(::IO.from_fd(fd, autoclose: false), events, duration)
  end
end

Alternatively, in CRuby, it may be possible to map from fd -> IO instance. Most C schedulers only care about file descriptor, so such a mapping will introduce a small performance penalty. In addition, most C level schedulers will not care about IO instance.

Non-blocking Fiber

We propose to introduce per-fiber flag blocking: true/false.

A fiber created by Fiber.new(blocking: true) (the default Fiber.new) becomes a "blocking Fiber" and has no changes from current Fiber implementation. This includes the root fiber.

A fiber created by Fiber.new(blocking: false) becomes a "non-blocking Fiber" and it will be scheduled by the per-thread scheduler when the blocking operations (blocking I/O, sleep, and so on) occurs.

Fiber.new(blocking: false) do
  puts Fiber.current.blocking? # false

  # May invoke `Thread.scheduler&.wait_readable`.
  io.read(...)

  # May invoke `Thread.scheduler&.wait_writable`.
  io.write(...)

  # Will invoke `Thread.scheduler&.wait_sleep`.
  sleep(n)
end.resume

Non-blocking fibers also supports Fiber#resume, Fiber#transfer and Fiber.yield which are necessary to create a scheduler.

Fiber Method

We also introduce a new method which simplifes the creation of these non-blocking fibers:

Fiber do
  puts Fiber.current.blocking? # false
end

This method invokes Scheduler#fiber(...). The purpose of this method is to allow the scheduler to internally decide the policy for when to start the fiber, and whether to use symmetric or asymmetric fibers.

If no scheduler is specified, it is a error: RuntimeError.new("No scheduler is available").

In the future we may expand this to support some kind of default scheduler.

Non-blocking I/O

IO#nonblock is an existing interface to control whether I/O uses blocking or non-blocking system calls. We can take advantage of this:

  • IO#nonblock = false prevents that particular IO from utilising the scheduler. This should be the default for stderr.
  • IO#nonblock = true enables that particular IO to utilise the scheduler. We should enable this where possible.

As proposed by Eric Wong, we believe that making I/O non-blocking by default is the right approach. We have expanded his work in the current implementation. By doing this, when the user writes Fiber do ... end they are guaranteed the best possible concurrency possible, without any further changes to code. As an example, one of the tests shows Net::HTTP.get being used in this way with no further modifications required.

To support this further, consider the counterpoint, that Net::HTTP.get(..., blocking: false) is required for concurrent requests. Library code may not expose the relevant options, sevearly limiting the user's ability to improve concurrency, even if that is what they desire.

Implementation

We have an evolving implementation here: https://github.com/ruby/ruby/pull/3032 which we will continue to update as the proposal changes.

Evaluation

This proposal provides the hooks for scheduling fibers. With regards to performance, there are several things to consider:

  • The impact of the scheduler design on non-concurrent workloads. We believe it's acceptable.
  • The impact of the scheduler design on concurrent workloads. Our results are promising.
  • The impact of different event loops on throughput and latency. We have independent tests which confirm the scalability of the approach.

We can control for the first two in this proposal, and depending on the design we may help or hinder the wrapper implementation.

In the tests, we provide a basic implementation using IO.select. As this proposal is finalised, we will introduce some basic benchmarks using this approach.

Discussion

The following points are good ones for discussion:

  • Handling of file descriptors vs IO instances.
  • Handling of time/duration arguments.
  • General design and naming conventions.
  • Potential platform issues (e.g. CRuby vs JRuby vs TruffleRuby, etc).

The following is planned to be described by Eregon (Benoit Daloze) in another design document:

  • Semantics of non-blocking mutex (e.g. Mutex.new(blocking: false) or some other approach).

In the future we hope to extend the scheduler to handle other blocking operations, including name resolution, file I/O (by io_uring) and others. We may need to introduce additional hooks. If these hooks are not defined on the scheduler implementation, we will revert back to the blocking implementation where possible.


Related issues

Related to Ruby master - Feature #16792: Make Mutex held per Fiber instead of per ThreadOpenActions
Related to Ruby master - Bug #16892: Reconsider the test directory name for schedulerClosedioquatix (Samuel Williams)Actions

Updated by shevegen (Robert A. Heiler) about 2 months ago

One issue I see is that this adds another API (Scheduler) for people to have to
remember. They will have to know how/when to use Mutex, Thread, Fibers, perhaps
Guilds, and now Scheduler.

Is this really what we want to have in ruby? Aspects such as class Hash, String,
Array etc... are quite simple to use and understand. The whole parallelism part,
on the other hand, seems to spawn more and more complexity on its own.

Updated by ioquatix (Samuel Williams) about 2 months ago

One issue I see is that this adds another API (Scheduler) for people to have to

In practice, users do not see this interface. If you check Async implementation, it's completely hidden from user, but allows Async to handle native Ruby I/O into it's own reactor/event loop (on scheduler branch using the proposed implementation here).

#3

Updated by ioquatix (Samuel Williams) about 2 months ago

  • Description updated (diff)
#4

Updated by ioquatix (Samuel Williams) about 2 months ago

  • Description updated (diff)

Updated by headius (Charles Nutter) about 2 months ago

Notes from recent discussions about this on Slack:

Scheduler API should pass IO objects, not file descriptors

The current design calls for the Scheduler methods like wait_readable to pass only a numeric file descriptor as an argument. While this might model how the C code works (every IO boils down to a file descriptor), it does not match how Ruby code works. If the goal is that this API can be implemented from Ruby code, the implementation must receive IO objects. If it does not, all sorts of problems result:

  • There's nothing you can do with a raw file descriptor, so it would have to be passed back out to C or wrapped in a new IO object.
  • The new IO object would have to be cloexec: false or else it would end up closing the original fd when collected.
  • The new IO object would not reflect the original type of IO that created the fd, which means no calling File or Socket-specific APIs, and any subclassed behavior would be lost.

There are no other places in Ruby where you work with raw file descriptors. They occasionally leak out into Ruby via IO#fileno and IO.for_fd but only in the process of turning them back into IO objects.

A final minor point is that not all implementations will support raw file descriptors. JRuby running in non-native mode uses only JVM channels, which do not expose file descriptors. Most of the time JRuby is run in native mode, but this should be considered; leaking the file descriptor into Ruby-land is very un-Ruby.

Mutex will have to be addressed

This proposal punts on enhancing Mutex and considers any fiber that has locked a mutex as now being blocking. I assume this means it goes back to cooperative scheduling until the lock has been released.

I think this is going to limit many/most uses of this API. Given that context switches between threads can now occur on any IO operation, the need for synchronizing access to mutable data is even more important. There will be more locking required to ensure scheduled fibers are not stepping on each others' work, which means more cases will be forced back into blocking mode.

I understand this omission is to keep the scope small, but I think it's a big risk to go forward with this feature before making it mutex-aware.

Don't introduce Fiber()

The new Fiber do form is confusing to me and I'm pretty sure it will be confusing to users. I guarantee people will be asking when to use Fiber.new do versus Fiber do and there's no indication why this special form has been added nor what it does differently. These two forms will also be easily mistaken and result in people calling the wrong one.

In addition, I think we need an API form that always produces a nonblocking fiber. In this proposal, Fiber() calls Scheduler#fiber, which as stated:

If no scheduler is specified, it creates a normal blocking fiber. An alternative is to make it an error.

So calling the same method will have different behaviors depending on whether there's a Scheduler installed, and depending on what that Scheduler chooses to do. As with the old proposal, where fibers would magically become nonblocking when a Scheduler is installed, now we have the reverse case: fibers intended to be nonblocking will not be nonblocking depending on the behavior of the Scheduler.

If Fiber.new(blocking: false) is too much to expect, perhaps Fiber.nonblock do or similar would be a good choice?

This intertwining of behaviors between Fiber and Scheduler seems problematic to me.

Updated by headius (Charles Nutter) about 2 months ago

A thought occurs: if you have to create a "special" fiber to get nonblocking behavior anyway, why isn't it named something else? How about something like Worker?

  • Create a new nonblocking worker fiber with Worker.new do
  • Workers and only workers are schedulable

I believe AutoFiber has been suggested in the past, along with other names.

We've already decided we can't change behavior of existing Fiber code. It makes sense to me that there should be a new name for this concept.

Updated by enebo (Thomas Enebo) about 2 months ago

I have not had much time to digest this from a single reading but a question immediately screams out at me in reading this: Why not differentiate what we think of as Fiber today with this new type of Fiber (e.g. ScheduledFiber)?

It feels like a boolean tacked on to fiber makes the notion that it is scheduled pretty opaque. nonblocking: true does not give me any sense that how that non-blocking fiber is scheduled (since we can change the scheduling behavior). I would assume Ruby just magically handles it but then I also expect a specific set behavior for that scheduling. The fact that we can change that scheduling makes me think the noun used to describe it could make that clearer.

Updated by ioquatix (Samuel Williams) about 2 months ago

Why not differentiate what we think of as Fiber today with this new type of Fiber (e.g. ScheduledFiber)?

From the user's point of view, it's still a fiber, and can be scheduled like a fiber: resume/yield/transfer and so on. The scheduler also sees it as a fiber and uses fiber methods for scheduling.

Scheduler API should pass IO objects, not file descriptors

We really only have two options that I can think of:

  • Internally have a table of fd -> IO and use this, although there are C extensions where this still won't work because there was never an IO instance so we still need to construct it. The details of constructing an IO instance in this case are trivial but there is still a cost.
  • Expose this detail in the scheduler design and leave it up to the implementation. Most scheduler designs just need the file descriptor and don't care about IO so there is little value in reconstructing the full IO object when it's immediately discarded or unused.

Don't introduce Fiber()

I'm okay with this. In Async we already have constructs that users are familiar with. I'll have to defer to matz (Yukihiro Matsumoto) for specifically what kind of interface he wants to expose. This interface was based on our initial discussion at RWC2019.

There are two benefits from introducing such a name:

  • It hides the implementation of symmetric/asymmetric switching by the scheduler.
  • It provides a uniform interface which high level libraries like Async and EventMachine can hook into. They do not need to use framework-specific constructs/methods for task construction.

Mutex will have to be addressed

The entire Async stack including Falcon works without depending on Mutex working the way you suggest it needs to. So I respectfully disagree with your assertions. The only place it's used is in signal handling setup IIRC.

Semantically, the proposed implementation doesn't change the behaviour of Mutex. We want to avoid introducing changes that break user code.

Given that context switches between threads can now occur on any IO operation

That's either wrong (do you mean non-blocking fibers?) or irrelevant to this proposal (yes, threads can always context switch on I/O operation and that's not changed by this proposal).

Updated by ioquatix (Samuel Williams) about 2 months ago

In addition, I thought about it more, and I think Fiber do ... end without a scheduler should be a hard error, otherwise, the default implementation needs to expose symmetric/asymmetric co-routine (resume/transfer). In addition, even if we choose hard error now, we can extend it in the future with default scheduler or some other similar idea (not in this proposal please).

#10

Updated by ioquatix (Samuel Williams) about 2 months ago

  • Description updated (diff)

Updated by ioquatix (Samuel Williams) about 2 months ago

  • Description updated (diff)

Added autoclose: false to reflect discussion point from @headius.

Updated by ioquatix (Samuel Williams) about 2 months ago

I looked at C interface.

We can introduce new interface, something like:

int rb_wait_readable(VALUE io);
int rb_wait_writable(VALUE io);

// Similar to wait_for_single_fd:
int rb_wait_events(VALUE io, int events, struct timeval * timeout);

We can make adaptors for existing C interface in the scheduler:

e.g.

def wait_readable(io)
end

def wait_readable_fd(fd)
  wait_readable(IO.from_fd(fd, autoclose: false))
end

An alternative implementation could wrap rb_io_wait_readable(fd) and use an internal lookup table fd -> IO. But this won't always exist, so it must lazy construct IO instances. It is a fact that kqueue/epoll/io_uring doesn't care about IO, only file descriptor. So it will immediately be called IO.fileno in 99.9% of code and there is a small performance cost.

VALUE descriptors[MAX] = {Qnil}; // should be implemented by weak map.

void rb_io_wait_readable(int fd) {
  VALUE io = descriptors[fd];
  if (io == Qnil) {
    descriptors[fd] = io = rb_io_from_fd(fd);
  }
  rb_wait_readable(io);
}

It's just idea, but I am happy to try it out. I want to hear some feedback first about which design makes most sense.

#13

Updated by Eregon (Benoit Daloze) about 2 months ago

  • Related to Feature #16792: Make Mutex held per Fiber instead of per Thread added

Updated by Eregon (Benoit Daloze) about 2 months ago

I created #16792 to change Mutex to be held per Fiber instead of per Thread.
Based on that it should be easy to integrate with the Scheduler.
I agree that seems an important case to address, and I think we shouldn't have any builtin operation disabling the scheduler as that is both more complicated to understand and a large limitation for scalability of the model.

Updated by Eregon (Benoit Daloze) about 2 months ago

I think is a great proposal.

I think we need to try to support existing code as much as possible, because all the existing Ruby code will never be rewritten to use a different pattern.
So we need the proposal to compose really well with existing code which might use threads and Mutex for good reasons, and I think part of that is making Mutex#lock reschedule.

I'm also a bit concerned about Fiber() being rather unclear.
I would prefer to be explicit here like Thread.current.scheduler.fiber {} but that's indeed quite long.
Maybe we can make Fiber.new(blocking: false) call Thread.current.scheduler.fiber {} or Thread.current.scheduler.register(fiber_instance) ?

Updated by ioquatix (Samuel Williams) about 2 months ago

Thanks Eregon (Benoit Daloze) for your feedback.

Maybe we can make Fiber.new(blocking: false) call Thread.current.scheduler.fiber {} or Thread.current.scheduler.register(fiber_instance) ?

Fiber.new is a constructor and is independent of the scheduler in every way.

The blocking or non-blocking state is simply stored into the fiber itself.

Because of that, I disagree with the constructor doing anything complicated or invoking any hook, at least at this time. Because then, the time at which you construct the fiber might impact it's behaviour in the scheduler, which I think is unnecessary and maybe confusing to user.

Additionally, we should not expose user to Fiber.new(blocking: true/false) because it's detail of scheduler implementation and to avoid breaking existing code (where Fiber.new defaults to blocking fiber which preserves existing behaviour).

Users need a simple entry point for concurrency. This is proposed as Fiber {}. I cannot make it any simpler than that. Fiber as a name is already reserved by Ruby, so making a method of the same name is similar to class Integer/Integer(...).

I've had feedback from developer over several years who told me Async {} is so simple and easy. So the ergonomics are good for users and the feedback supports that.

I cannot see any value in making it longer, more explicit, tied to the scheduler, or adding arguments to make it blocking (which we want users to avoid). Most users don't understand blocking/nonblocking so we should avoid forcing them to deal with it.

Updated by ioquatix (Samuel Williams) about 2 months ago

  • Description updated (diff)

Make wait_readable, wait_writable and wait_any take IO argument. Add explicit wrappers for CRuby.

Updated by ioquatix (Samuel Williams) about 2 months ago

  • Description updated (diff)

Tidy up proposal.

Updated by ioquatix (Samuel Williams) about 1 month ago

I asked for more feedback from community.

https://twitter.com/ioquatix/status/1251024336502190081

  • Fiber do ... end: ~49% like it.
  • AsyncFiber do ... end: ~27% like it.
  • Fiber.new(blocking:false) do ... end.resume: ~20% like it.
  • Thread.scheduler.fiber do ... end: ~5% like it.

Some options were truncated in the poll because Twitter limits the length of the option. The sample size was ~280 people. It's not super scientific, but my experience is that polls do not change significantly after the first 100 votes.

So, I'm confident that Fiber do ... end is the right approach and the community showed strong support for it.

Also a few more notes:

  • I'm against AsyncFiber as it's confusing naming w.r.t. Async the gem I maintain.
  • I checked Worker but there is a gem called that already.

Updated by Eregon (Benoit Daloze) about 1 month ago

I think Fiber() is OK after your replies, if it raises if there is no scheduler so there is no silent error. It would just need good documentation.

Fiber.new(blocking: false) should be clearly marked as "should only be used by scheduler not directly by user code", as that would miss the registration.

Updated by sam.saffron (Sam Saffron) about 1 month ago

My big concern here is that this does not cover why #13618 was deficient and this complete greenfield implementation solves the issues it had?

#13618 had kqueue and epoll implementations which this would leave unimplemented, as far as I recall we were simply stuck on naming with 13618, there was nothing fundamentally objectionable there to ko1 (Koichi Sasada) and matz (Yukihiro Matsumoto)

Updated by Eregon (Benoit Daloze) about 1 month ago

sam.saffron (Sam Saffron) I'll let ioquatix (Samuel Williams) reply in more details but my point of view on that is:
#13618 is not flexible, and rather hardcodes everything including the scheduler, the IO selectors, etc, which would add a huge implementation cost to alternative Ruby implementations.
This proposal is far more flexible, and a much smaller change which is far easier to review, maintain and evolve.
Also the author of that proposal seems rather inactive recently (not blaming, just a fact), which would be an issue to maintain that code.

That said, #13618 is quite similar to this proposal and so in essence not so different: they both use rb_io_wait_readable/rb_io_wait_writable/rb_wait_for_single_fd as a way to reschedule on blocking IO method calls.

Efficient selectors with kqueue/epoll can be provided by nio4r which already works on CRuby, JRuby and TruffleRuby.

Updated by ioquatix (Samuel Williams) 26 days ago

My big concern here is that this does not cover why #13618 was deficient and this complete greenfield implementation solves the issues it had?

This proposal is really an evolution of #13618. The reasons why that proposal did not move forward have already been outlined.

Personally, I'd like to take the event loop implementations from #13618 and put them into a gem, so CRuby, TruffleRuby and JRuby can all benefit. The proposal here is for the interface which allows that to happen, and we already have a proof of concept using NIO4r which has been used in production for years.

Updated by chrisseaton (Chris Seaton) 23 days ago

I recently did a deep dive into this approach and how it would fit into the Ruby ecosystem as part of my work on TruffleRuby.

I think what is being proposed here looks like a very practical idea for improving concurrency on Ruby, for the common use-case of applications doing a lot of IO with many clients. The core idea is simple to explain and understand, which I think is the real strong point here.

I also appreciate how the proposal has been architectured to have a pluggable backend. As a researcher that means we're open to experiment with some more radical ideas but running the same user code.

If you didn't know, TruffleRuby implements fibres as threads, which is also what JRuby does. This is because the JVM doesn't have any lighweight threading mechanism at the moment. The JVM will get fibres through a project called loom, and some experimental work we did to integrate this into TruffleRuby was promising. It should work the same on JRuby. I'm planning to implement this issue in TruffleRuby for experimentation even if we don't have the expected performance characteristics of fibre yet.

Updated by ko1 (Koichi Sasada) 20 days ago

Sorry for late response.
First of all, I agree to merge it and try before next release (please wait Matz's comment for merging).

There are several considerations.

non-blocking fiber creation API

For me, the name Fiber() is not clear because there are traditional fibers. How about Fiber.schedule{ ... } or something which shows the fiber will be scheduled? It should raise if the scheduler is not set to the thread.

mixing with blocking fiber

  • I heard that the root fiber (a default fiber per thread) is blocking. It should be noted.
  • What's happen when a non-blocking fiber creates a blocking fiber (Enumerator, etc) and it runs blocking I/O operation? I think it should be blocking though. However, resuming blocking fiber will be a blocking operation.

Scheduler class

Your example shows that the Scheduler class only inherits Object class. Do we need a Fiber::Scheduler class as base class?

At least it can provide Scheduler#fiber method.

Fiber#resume/.yield for non-blocking fiber

I understand they are needed to make scheduler in Ruby, but it is confusing. I think non-blocking fiber should not have an ability to context switch by users outside of scheduler.

For example, if a fiber F1 is waiting for a network IO and a fiber F2 resume F1, then it will be blocking until the IO is ready.

How about to prohibit context switch by Fiber class methods, but provide Fiber#Scheduler methods?

# like that
class Fiber::Scheduler
  def resume(fib) = native_impl
  # or transfer?
end

class SelectScheduler < Fiber::Scheduler
  def wait_readable io
    ready_io = select(...)
    ready_fiber = ...
    resume(ready_fiber)
  end
end

BTW, hooks are called on root fiber (if a fiber F1 calls blocking IO operations, back to the root fiber and call the hook?) sorry if I missed the explanation.

Scheduler hooks

  • wait_readable, wait_writable hooks are easy to understand. However, wait_any is not clear for me.
  • There is a wait_sleep, but I'm not sure the hooks are enough or not.
  • enter_blocking_region/leave_blocking_region are strongly connected to the MRI, so I'm not sure we should provide it here. For example, def notice(type, ...) which is called by an interpreter with information can hide the details from the method names (user should know the details to utilize the information).

Context switch predictability for atomic operations

I'm negative yet about this proposal because we can not predict the context switch completely. Compare with the threading, the predictability is very high, but we can not predict context switch timing 100% because most of non-blocking IO operations can be context switch points. It can violate atomic operations such as mutating Hash/Array/Object/... twice or more at once.

I know most of people include Samuel and Matz are optimistic for this issue.
I also agree the danger of this kind of violation is very low compare with threading.

How to provide a safety

There are several ideas.

  • (1) Users understand the code deeply where are context-switching points.
    • Pros. we don't need to introduce any mechanism.
    • Cons. difficult to make it perfect (human-readable is not perfect)
  • (2) Use Mutex correctly and non-blocking fibers are take care about it.
    • Pros. it is highly compatible with threading. It means we can use same code on multi-threading and multi-nonblocking fiber app.
    • Cons. users need to use Mutex correctly. Schedulers should manage Mutexs.
  • (3) Introduce new context-switch control mechanism such as Fiber.exclude{ ... } like Ruby 1.8 or Fiber.blcoking{ ... } to prevent Fiber scheduling (context-switching) in a block.
    • Pros. easy to implement.
    • Cons. users need to use this method correctly.
  • (4) Introduce non-context-switch assertion mechanism such as Fiber.should_not_switch{ ... } (user asserts that there is no context-switching point). If there is an IO operation, it cause assertion violate error even if there is only one (itself) non-blocking fiber.
    • Pros. easy to implement.
    • Cons. users need to use this method correctly.
  • (5) ((2) + (4)) Assume locking Mutex as an assertion.
    • Pros. compatible with Mutex code.
    • Cons. users need to use Mutex correctly.
  • (6) Restrict the non-blocking IOs more, for example, only net/http enables it.
    • Pros. make more predictable.
    • Cons. concurrency will be reduced.

mmm, (5) seems fine? (if any Mutex is locked by a fiber, then fiber context switch will be an error).
In general, holding Mutex's lock long time is not recommended.

((7) is using Ractor, a position talk ;p)

How to survey the existing program?

Implement (5) and run some programs can show how many code need atomic operations and can run blocking IO operations are called in such atomic operations.

Updated by ioquatix (Samuel Williams) 20 days ago

non-blocking fiber creation API

It was voted by community, strongly in favour of Fiber do ... end. If you think your suggestion is better, we should confirm with community.

Scheduler class

We can introduce Fiber::Scheduler however I don't think it's necessary. The benefit would be adding default functionality to existing schedulers.

In fact, such a default implementation could be provided by a gem or some other code which can be shared between implementations.

I heard that the root fiber (a default fiber per thread) is blocking. It should be noted.

Yes, this is outlined in the proposal, the default fiber is blocking, including the root fiber.

What's happen when a non-blocking fiber creates a blocking fiber

Resuming a blocking fiber is a blocking operation. This is good behaviour and ensures things like Enumerator won't be broken by this proposal.

For example, if a fiber F1 is waiting for a network IO and a fiber F2 resume F1, then it will be blocking until the IO is ready.

Scheduler will require access to resume/transfer/yield operations, so removing them is not realistic. In addition, Async uses these operations in Async::Semaphore & Async::Queue implementations, as well as other places.

Scheduler should be robust against spurious wake-ups (example using Async given below). However, user who calls #resume without any care will suffer the consequences if the code is not robust.

require 'async'
require 'async/io'

Async do
  i, o = Async::IO.pipe

  f1 = Async do
    puts i.read(1024)
  end

  f2 = Async do
    10.times do
      puts "Resuming f1"
      f1.instance_variable_get(:@fiber).resume
      puts "f1 yielded"
    end
  end

  o.write("Hello World")
  o.close
end

BTW, hooks are called on root fiber, if a fiber F1 calls blocking IO operations, back to the root fiber and call the hook?

My interpretation of this is you are asking if the root fiber (which is blocking) will enter a recursive loop when invoking non-blocking operations. The answer is no.

However, wait_any is not clear for me.

wait_any is modelled after https://ruby-doc.org/stdlib-2.7.1/libdoc/io/wait/rdoc/IO.html#method-i-wait

We can change name to something else, do you have better idea?

There is a wait_sleep, but I'm not sure the hooks are enough or not.

What do you mean it is enough or not? Do you mean there are other ways to sleep?

enter_blocking_region/leave_blocking_region are strongly connected to the MRI.

Yes, agreed. These hooks were added as a result of our meeting in Japan.

I'd be happy to remove it but I think it provides very valuable insight into blocking operations within MRI. Maybe other implementations can comment on whether it's useful or not. Even if it's not supported, not much functionality is lost, so I don't see the point in removing it - it's not critical, but it's very useful.

user should know the details to utilize the information

To get the method name, you can use caller, which is shown in Async::Scheduler implementation: https://github.com/socketry/async/blob/c173f5880c566724f104855941f9af12fbf4d7e7/lib/async/scheduler.rb#L100-L112

I think it's best to avoid preparing the arguments (e.g. method name) when it may not be used, to avoid the overhead on critical path.

I'm negative yet about this proposal because we can not predict the context switch completely.

To me, this is actually a major feature of the proposal. We provide the tools to make a concurrent context which enables us to progressively improve concurrency. e.g.

Fiber do
  # Any operation in here may be scheduling point.
  connect(resolve("ruby-lang.org"))
end

This proposal doesn't include concurrent DNS resolution. But with Ruby 3.1, we can introduce such a feature without impacting user code. That means, resolve("ruby-lang.org") can become switching point. The same change in Node.js requires rewriting the code, which we want to avoid. In the future I want to introduce non-blocking DNS, File, other system calls, etc.

So users should not rely on blocking operations for synchronisation.

To retain compatibility with Mutex, when a Mutex is locked on a thread, that entire thread becomes blocking w.r.t. non-blocking operations. This ensures existing code continues to work correctly, at the cost of reduced concurrency when holding a Mutex.

The next step, as proposed by Eregon (Benoit Daloze), is to make Mutex fiber aware. This improves the opportunity for concurrency but does not change the semantics of user code.

Regarding some of the other options you list, one you have not considered is this:

Fiber do
    # Non-blocking fiber.
    # io operations, etc.

    # This is effectively the same as `Fiber.exclusive`.
    Fiber.new do
        # Modify shared mutable state, any operation here is blocking so is guaranteed to be sequential.
    end.resume
end

So no new construct is required to force sequential execution.

So, in effect, this proposal is already implement (2) + (4) / (5).

In general, holding Mutex's lock long time is not recommended.

Agreed.

Updated by ioquatix (Samuel Williams) 20 days ago

  • Description updated (diff)

Specify the root fiber is also blocking.

Updated by ioquatix (Samuel Williams) 20 days ago

  • Description updated (diff)

Add clarification about introducing new hooks.

Updated by ko1 (Koichi Sasada) 20 days ago

ioquatix (Samuel Williams) wrote in #note-26:

non-blocking fiber creation API

It was voted by community, strongly in favour of Fiber do ... end. If you think your suggestion is better, we should confirm with community.

I don't think we should refer this kind of result because the voted people does not know concerns.

Scheduler class

We can introduce Fiber::Scheduler however I don't think it's necessary. The benefit would be adding default functionality to existing schedulers.

In fact, such a default implementation could be provided by a gem or some other code which can be shared between implementations.

It can be, Maybe we should discuss later after more trials.

What's happen when a non-blocking fiber creates a blocking fiber

Resuming a blocking fiber is a blocking operation. This is good behaviour and ensures things like Enumerator won't be broken by this proposal.

oK.

For example, if a fiber F1 is waiting for a network IO and a fiber F2 resume F1, then it will be blocking until the IO is ready.

Scheduler will require access to resume/transfer/yield operations, so removing them is not realistic. In addition, Async uses these operations in Async::Semaphore & Async::Queue implementations, as well as other places.

Not sure it should be implemented independently (it should be a scheduler's duty IMO). But I agree it is more flexible.

Scheduler should be robust against spurious wake-ups (example using Async given below). However, user who calls #resume without any care will suffer the consequences if the code is not robust.

ok.

BTW, hooks are called on root fiber, if a fiber F1 calls blocking IO operations, back to the root fiber and call the hook?

My interpretation of this is you are asking if the root fiber (which is blocking) will enter a recursive loop when invoking non-blocking operations. The answer is no.

My question is which fiber context is used for wait_xxx method? i asked Samuel and the answer is a fiber which called the blocking IO.

However, wait_any is not clear for me.

wait_any is modelled after https://ruby-doc.org/stdlib-2.7.1/libdoc/io/wait/rdoc/IO.html#method-i-wait

We can change name to something else, do you have better idea?

I have no knowledge about it...

There is a wait_sleep, but I'm not sure the hooks are enough or not.

What do you mean it is enough or not? Do you mean there are other ways to sleep?

Sorry, my question is, if we recognized other hooks are needed after release it, what happens?

the answer was: #28

enter_blocking_region/leave_blocking_region are strongly connected to the MRI.

Yes, agreed. These hooks were added as a result of our meeting in Japan.

I'd be happy to remove it but I think it provides very valuable insight into blocking operations within MRI. Maybe other implementations can comment on whether it's useful or not. Even if it's not supported, not much functionality is lost, so I don't see the point in removing it - it's not critical, but it's very useful.

user should know the details to utilize the information

To get the method name, you can use caller, which is shown in Async::Scheduler implementation: https://github.com/socketry/async/blob/c173f5880c566724f104855941f9af12fbf4d7e7/lib/async/scheduler.rb#L100-L112

I think it's best to avoid preparing the arguments (e.g. method name) when it may not be used, to avoid the overhead on critical path.

I doubt it is performance issue.

I'm negative yet about this proposal because we can not predict the context switch completely.

To me, this is actually a major feature of the proposal. We provide the tools to make a concurrent context which enables us to progressively improve concurrency. e.g.

Fiber do
  # Any operation in here may be scheduling point.
  connect(resolve("ruby-lang.org"))
end

This proposal doesn't include concurrent DNS resolution. But with Ruby 3.1, we can introduce such a feature without impacting user code. That means, resolve("ruby-lang.org") can become switching point. The same change in Node.js requires rewriting the code, which we want to avoid. In the future I want to introduce non-blocking DNS, File, other system calls, etc.

So users should not rely on blocking operations for synchronisation.

To retain compatibility with Mutex, when a Mutex is locked on a thread, that entire thread becomes blocking w.r.t. non-blocking operations. This ensures existing code continues to work correctly, at the cost of reduced concurrency when holding a Mutex.

The next step, as proposed by Eregon (Benoit Daloze), is to make Mutex fiber aware. This improves the opportunity for concurrency but does not change the semantics of user code.

Regarding some of the other options you list, one you have not considered is this:

Fiber do
  # Non-blocking fiber.
  # io operations, etc.

  # This is effectively the same as `Fiber.exclusive`.
  Fiber.new do
      # Modify shared mutable state, any operation here is blocking so is guaranteed to be sequential.
  end.resume
end

So no new construct is required to force sequential execution.

creating blocking fiber is interesting idea, but the backtrace is separated so it shouldn't be used IMO.

Updated by ko1 (Koichi Sasada) 20 days ago

note that I missed:

If any mutex is acquired by a fiber, then a scheduler is not called; the same behaviour as blocking Fiber.

in description, so I agree there is no issue if Mutex is used correctly.

checking such situation (io ops on Mutex locking) will be a good survey.

Updated by Dan0042 (Daniel DeLorme) 20 days ago

Really looking forward to this API, it's very promising.

What exactly are the implications of enter_blocking_region/exit_blocking_region? Does it mean the scheduler should not resume fibers even if IO is ready? Since the scheduler is meant to be written in ruby, it would be nice to provide some guidance for rubyists who may not be deeply knowledgeable about MRI internals.

I am also one of those against the Fiber method name because intuitively it sounds like it's equivalent to Fiber.new. I would actually prefer any of the alternatives since they indicate what is different about this fiber; the fact that it doesn't block on IO. So in that respect I think NonblockingFiber or Fiber.nonblocking would also be good alternatives imho. I understand that a Twitter poll is not meant to be scientific but, in addition to issues with selection bias (inherent to the internet), the questions were such that the "non-blocking name" votes (51%) were split among 3 options while the "generic name" votes all went to 1 option. The results should be taken with a pinch of salt or two.

Updated by ioquatix (Samuel Williams) 19 days ago

I don't think we should refer this kind of result because the voted people does not know concerns.

I think you are underestimating the collective knowledge of the community. That poll had almost 300 responses. I've also been working on this proposal for ~3 years and talked to many developers. So I believe Fiber do ... end is the right approach. Matz can make his decision, but my job is to present to him the proposal and the evidence.

That poll shows that community doesn't like the alternative names presented here. Can you list the concerns you have so we can see if changing the name is the right way to address them? Or maybe it's a matter of clear documentation, etc.

What exactly are the implications of enter_blocking_region/exit_blocking_region?

It's a mechanism for detecting blocking operations that release the GVL. It can allow us to report back to the user that they are performing blocking operations on the event loop thread, which will cause some pain depending on the duration of the operation.

You can see a sample implementation here: https://github.com/socketry/async/blob/scheduler/lib/async/scheduler.rb#L100-L112

Updated by matz (Yukihiro Matsumoto) 19 days ago

Accepted for experimentation.

We still have some concerns, for example, mixture with blocking and non-blocking fibers. mame (Yusuke Endoh) will describe the concern.
In addition, I don't like the method name Fiber, since the fiber created by the method is not the original fiber at all. It is not a good idea to steal the role of existing concept in the language. We need a new name.

Matz.

Updated by ioquatix (Samuel Williams) 19 days ago

Thanks Matz.

since the fiber created by the method is not the original fiber at all.

Can you clarify "not the original fiber at all"? It's the same way Integer(...) creates instance of class Integer.

Updated by duerst (Martin Dürst) 19 days ago

ioquatix (Samuel Williams) wrote in #note-34:

Thanks Matz.

since the fiber created by the method is not the original fiber at all.

Can you clarify "not the original fiber at all"? It's the same way Integer(...) creates instance of class Integer.

I can't speak for Matz, but my guess is that he meant "not the original type of fiber", i.e. not the same as you'd get e.g. with Fiber.new.

Updated by ioquatix (Samuel Williams) 19 days ago

Using latest master:

class Scheduler
  def fiber(&block)
    fiber = Fiber.new(blocking: false, &block)

    fiber.resume

    return fiber
  end
end

Thread.current.scheduler = Scheduler.new

f1 = Fiber do
  puts "Hello World"
end

puts f1.class
# Fiber

f2 = Fiber.new do
  puts "Hello World"
end

f2.resume

puts f2.class
# Fiber
#37

Updated by hsbt (Hiroshi SHIBATA) 18 days ago

  • Related to Bug #16892: Reconsider the test directory name for scheduler added

Updated by matz (Yukihiro Matsumoto) 18 days ago

ioquatix (Samuel Williams) I was well represented by Martin-sensei (duerst (Martin Dürst)).
The fiber created by Fiber() do ...end does context-switch on I/O operations. The traditional (or original) fibers don't.
So naming the function Fiber may indicate that all fibers can switch context upon I/O operations.
Am I worrying too much?

Matz.

Updated by midnight (Sarun R) 18 days ago

Hi, I am a Ruby user that would probably vote for

Fiber do
end

only if I see the poll back then.

The result should be taken as a grain of salt.
I second ko1 (Koichi Sasada) on the point that the people who voted for the choice just doesn't know concerns.

If everything were decided by voting, we would not have the sane language that we grew to love.
Reasons should be the first choice to decide, and if reasons just don't cut it, relying on the gut feeling of someone knowledgeable in the area should be better than using voting results.

Don't worry too much about the community because the language is already flexible enough for wrapping or sugar-coating the core API into the form we like at the expense of some performance cost. (For someone who truly opinionates about it.)

Updated by mgomes (Mauricio Gomes) 4 days ago

What about borrowing a little from Crystal? The non-blocking API could be:

spawn do
  # non-blocking fiber
end

I like how it has a completely different interface without introducing a new term.

Also available in: Atom PDF