Project

General

Profile

Feature #17325

Adds Fiber#cancel, which forces a Fiber to break/return

Added by nevans (Nicholas Evans) 2 months ago. Updated about 2 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:100849]

Description

Calling Fiber#cancel will force a fiber to return, skipping rescue and catch blocks but running all ensure blocks. It behaves as if a break or return were used to jump from the last suspension point to the top frame of the fiber. Control will be transferred to the canceled fiber so it can run its ensure blocks.

Propagation from resuming to resumed fibers

Any non-root living fiber can be canceled and cancellation will propagate to child (resumed) fibers. In this way, a suspended task can be canceled even if it is e.g. resuming into an enumerator, and the enumerator will be canceled as well. Transfer of control should match #17221's (much improved) transfer/resume semantics. After the cancellation propagates all the way to the bottom of the fiber resume stack, the last fiber in the chain will then be resumed. Resuming fibers will not run until they are yielded back into.

Suspension of canceled fibers

Canceled fibers can still transfer control with resume, yield, and transfer, which may be necessary in order to release resources from ensure blocks. For simplicity, subsequent cancels will behave similarly to calling break or return inside an ensure block, and the last cancellation reason will overwrite earlier reasons.

Alternatives

Fiber#raise could be used, but:

  • Can only raise on resumable fibers.
  • Cannot propagate cancellation down to resumed fibers.
  • Exceptions are bigger and slower than break.
  • #raise can't (and shouldn't) be sent to resuming fibers. (It can't propagate.)
  • Exceptions can be caught. This might be desirable, but that should be at the discretion of the calling fiber.

Catch/Throw could be used (with an anonymous Object.new), but:

  • We would need to add Fiber#throw (or wrap/intercept Fiber.yield).
  • A hypothetical Fiber#throw should probably have similar semantics to #resume and thus only be allowed on resumable fibers.
    • In that case, it wouldn't propagate down to resumed fibers.
  • catch adds an extra stack frame.

We could use go-style "Context" objects that contain a "done?" queue/future.

  • These would need to be explicitly passed around.
  • Although their usage could be enforced via linters like rubocop, I think that placing it off to the side will give developers the impression that it is optional Some sort of cancel propagation mechanism is not optional for structured concurrency.
  • It should built into any task-scheduler library, which would allow application code to use it explicitly.
  • But this suffers the same problem as current Fiber wrappers: it works fine if your code uses the wrapper, but code that uses fibers without the wrapper can be incompatible and introduce bugs (e.g. fibers that are released without running their ensure blocks).
  • This make sense for a language like go which doesn't have exceptions but does have a convention of returning an "error" value. It feels out of place in ruby, IMO. Letting the fiber-task-scheduler mitigates that... for code that uses the fiber-task-scheduler.

We could add a keyword option to Fiber#raise that gives it similar propagation semantics to this.

  • IMO, the simplicity of Fiber#raise simply being a specialized version of Fiber#resume is worth preserving.
  • The propagation changes alone are enough of a semantic difference to warrant a new method.

We could implement Fiber#cancel by using fiber.raise(FiberCancellationError) on the bottom fiber and catching that exception during termination of the canceled fiber.

  • This would have the "benefit" that the exception could be rescued.
  • I might be wrong, but I think that doing this would mostly duplicate my PR, but with some added complexity around exception construction and catching.
  • It might be a good keyword option? e.g. Fiber#cancel(with_exception: [true,#exception,#to_str])

Just let the task-fiber-scheduler library handle this.

  • That's what I'm already doing now. It's mostly fine. It works in my code.
  • Putting it into ruby core should lead to a small performance boost on very commonly repeated code.
    • There's probably a better way to store the cancel_reason that doesn't require the overhead of adding another VALUE to rb_fiber_struct. Maybe it can be placed directly into errinfo?
  • Although the common cases can be handled via a trampoline fiber or #17221, there can still be situations where your application's fiber-scheduler library might not know about fibers created by other libraries. This adds interoperability to a common scenario.
  • Coroutine cancellation is IMO a core feature. It's important to have something like this for all applications and libraries to use as a baseline for interoperability.

Implementation: https://github.com/ruby/ruby/pull/3766


Related issues

Related to Ruby master - Bug #17331: Let Fiber#raise work with transferring fibersClosedActions

Updated by nevans (Nicholas Evans) 2 months ago

This is my first big foray into the ruby C API, so I hope it makes sense what I was trying to do. Did I handle vm_fiber_break and fiber_start appropriately? Is there is somewhere more appropriate to save the cancel reason than a new VALUE on rb_fiber_struct? E.g. could I safely assign the canceled fiber's ec->errinfo from the calling Fiber, instead of using cancel_reason?

When this compiles, it warns about:

vm_insnhelper.h:251:1: warning: ‘vm_call_iseq_optimizable_p’ defined but not used [-Wunused-function]
  251 | vm_call_iseq_optimizable_p(const struct rb_callinfo *ci, const struct rb_callcache *cc)     

So I'm probably doing something wrong there, too.

#3

Updated by nevans (Nicholas Evans) 2 months ago

  • Description updated (diff)
#4

Updated by nevans (Nicholas Evans) 2 months ago

  • File deleted (0001-Added-Fiber-cancel-which-forces-a-Fiber-to-return.patch)

Updated by nevans (Nicholas Evans) 2 months ago

  • Description updated (diff)

updated description and github PR

Updated by nevans (Nicholas Evans) 2 months ago

  • Description updated (diff)

I've updated the description to 1) better describe the motivation and 2) match the PR's current behavior when a Fiber is canceled multiple times.

Updated by Eregon (Benoit Daloze) 2 months ago

This sounds like Thread#kill but for Fiber.
However, Fiber is cooperative so it seems this shouldn't be needed.

What's a concrete issue with Fiber#raise? It seems redundant with it to me.
If the application swallows Exception I think that's the application problem, and I think we should not add a second way to unwind a Fiber just for that.
The performance difference is likely not very relevant, as I expect exiting a Fiber is not a common operation, and most of the cost is likely unwinding the stack & running ensure.
Because this runs ensure, it's still possible for that Fiber to hang, or to raise another exception in ensure which can then be caught.
So, it's not a guarantee either.
fiber.raise(Exception) seems enough, and this seems to be only marginally different, at the cost of another method and yet another way to terminate a Fiber.

Regarding propagation, I'm not sure what's the difference, but exceptions in Fibers are propagated:

$ ruby -e 'f=Fiber.new { raise "foo" }; p((f.resume rescue $!))'
#<RuntimeError: foo>

Updated by nevans (Nicholas Evans) about 2 months ago

Thanks for taking a look at this, Benoit. I agree it's not obvious why this is necessary with Fiber#raise, so I'll try to explain my reasoning in more detail:

Yes, a library (e.g. async) could write a "suspend" function that wraps resume, yield, and transfer. That library could implement cancel (and raise) by tunneling resume types & values through a struct. That library can't directly control fibers that are created outside its wrapper (e.g. Enumerator), but our "suspend" wrapper could implement a "trampoline" fiber to interoperate with most simple cases like Enumerable. I know all of this is possible, because I've been using a suspend wrapper like this on one of my work projects for a few years now! :) Still, it's not hard to imagine incompatible fiber-creating libraries that circumvent our wrapper in unpredictable ways. Putting these into Fiber gives all fiber-using applications and libraries a shared baseline they can rely upon.

I disagree on "not a common operation". I think that relatively short-lived fibers could become as common-place as relatively short-lived goroutines are in go (and I hope they do). And I think non-exceptional cancellation is a very important concept for asynchronous code to handle explicitly. Other languages with coroutines special case "cancel" (even if some of them tunnel cancelation via error types). E.g. go's Context type uses Done() and the go vet tool checks that CancelFuncs are used on all control-flow paths. Kotlin coroutine's Job class has cancel methods but not raise methods.

Propagation

The most important difference from Fiber#raise isn't that it is unrescuable or uses fewer resources, but how propagation works:

  • You can cancel any living fiber except the root-fiber.
    • I've disallowed canceling the root to avoid accidentally shutting the thread down.
    • I'm currently raising FiberError for terminated fibers, but I think I'd like to change that. I think canceling dead fibers should simply return nil or false. That way we can safely call cancel on any fiber, and the only reason it would raise an exception is if the child fiber terminates with an exception during cancellation.
  • Cancel propagates down through the linked list of resumed fibers.
  • Execution of ensure still runs from bottom up.
  • It does not propagate an error or cancel up from the canceled fiber.

E.g. if an async task fiber is resuming into an lazy enumerator which is resuming into another fiber (and so on) and your thread scheduler wants to cancel your task's fiber but doesn't know about those other fibers, Fiber#raise won't work. If your fiber is transferring, Fiber#raise won't work. Fiber#cancel will work on any living fiber while still following the rules (#17221) for transferring control.

Just as break doesn't generally need to know or care about the implementation of intervening library code, canceling a fiber shouldn't need to know or care about what the implementation of any sub-fibers may have been resumed.

semantics of raise/rescue vs unexceptional break/return

I'm not against temporarily or explicitly blocking cancellation. Avoiding swallowed exceptions is not the most important feature, but it's still a useful one I think. And any small performance gain would be desirable, but not the primary driver. Aside from either of those, raise has different semantics from break or return (or throw). (As currently written) this is only for semantically unexceptional flow-control. And this isn't a matter of "application code can handle it" because applications can't control their intervening library code, nor can they control fibers created by intervening library code. It's quite common to see code like the following:

def foo
  some_library_code_runs_this_block_many_stack_layers_deep do
    result = etc_etc_etc
    return result.retval if result.finished?
  end
end

We could wrap this in an exception handler, but it would be more confusing to the casual reader than simply using return or break (or maybe catch and throw). For jumping to the ensure block of a particular method we use return. For block scope, break. For fiber-scope: Fiber#cancel.

I expect that return statement to non-exceptionally rewind the stack without being caught by any catch or rescue. I don't want a library's hidden rescue Exception to subvert my break or return (libraries shouldn't do this, but sometimes they do).

It's not as simple as an "application problem". Task cancellation could be triggered by application code or by library code. A task-scheduler library might call Fiber#cancel, and the fibers being canceled might be in application code or in library code or might be suspended by resuming into fibers that are uncontrolled by or even unknown to task-scheduler. None of that should matter.

Wrapping a task with catch {|tag| ... } would be conceptually better than exception handling... but throw tag from an Enumerator won't propagate up to the return fiber. (I don't want to change this behavior.)

ruby -e 'f = Fiber.new { throw :foo }; p((catch(:foo) { f.resume } rescue nil))'
#<UncaughtThrowError: uncaught throw :foo>

Examples

To be clear, these are toy examples and I'd want most of the following to be handled for me by a fiber-task-scheduler library (like async). But that library itself should have a mechanism for canceling resuming tasks, even when it doesn't (or can't) know about the resumed child fibers of those tasks. Fiber#raise (as currently written) can't do that.

def run_server
  server = MyFiberyTCPServer.open
  # Do stuff. e.g. accept clients, assign connections to fibers, etc.
  # Those connections can create their own sub-fibers.
  # The server may know nothing about those sub-fibers. It shouldn't need to.
  # Those subfibers might even use an entirely different scheduler. That's okay.
  # Connection fibers might be un-resumable because they are resuming. No prob.
  wait_for_shutdown_signal # => transfers to some sort of fiber scheduler
ensure
  # cancels all connection-handler fibers
  server.connections.each do |c|
    # Are those connection fibers resuming other sub-fibers tasks?
    # Do we even know about those sub-tasks?
    # Can we even know about them from here?
    # Who cares? Those need to be canceled too!
    c.cancel :closing if c.alive?
    # I'd like to make dead_fiber.cancel unexceptional too
  end
end

# fetching a resource may depend on fetching several other resources first
def resource_client
  a = schedule_future { http.get("/a") }
  b = schedule_future { http.get("/b") }
  items = a.value.item_ids.map {|id| http.get("/items/#{id}") }
  combine_results(b, ary)
ensure
  # if any of the above raises an exception
  # or if *this* fiber is canceled
  # of if combine_results completed successfully before all subtasks complete
  a&.cancel rescue nil # is it resuming another fiber? don't know, don't care.
  b&.cancel rescue nil # is it resuming another fiber? don't know, don't care.
  ary&.each do |item| item.cancel rescue nil end # ditto
end

# yes, task library code would normally provide a better pattern for this
def with_timeout(seconds)
  timer = Task.schedule do
    sleep seconds
  ensure
    task.cancel :timeout
  end
  task = Task.schedule do
    yield # does this resume into sub-tasks? we shouldn't need to know.
  ensure
    timer.cancel
  end
  task.value
end

No guarantees

And yes, we can always have misbehaving code and I'm not trying to guarantee against every case. We can't guard against certain categories of bugs nor infinite loops. It's always possible someone's written:

def foo
  bar
ensure
  while true
    begin
      while true
        Fiber.yield :misbehaving
      end
    rescue Exception
      # evil code being evil
    end
  end
end

But that's entirely outside the scope of this. :)

We can have bugs here just like any code can have bugs. But in my experience, ensure code is usually much shorter and simpler than other code. Shut down, clean up, release, and reset.

Updated by nevans (Nicholas Evans) about 2 months ago

  • Description updated (diff)

Added some other alternatives to the description.

Updated by nevans (Nicholas Evans) about 2 months ago

  • Description updated (diff)

added more to alternatives

Updated by nevans (Nicholas Evans) about 2 months ago

  • Description updated (diff)

one more update to alternatives

Updated by nevans (Nicholas Evans) about 2 months ago

Like I wrote above, I think I'd like to change my PR so that terminated_fiber.cancel returns nil or false instead of raising FiberError. That way any fiber on the current thread, except for the root fiber, can be canceled without exceptions (unless an exception is raised during cancellation).

Also, this PR as written is for non-exceptional cancellation (doesn't raise in return_fiber) that are not catchable (like break). It might be worthwhile to add keywords options for all four possibilities:

use break or raise to cancel on completion, raise error in return_fiber method call
break (unrescuable) unexceptional fiber.cancel this PR. the common case
break (unrescuable) exceptional fiber.cancel(raise: err) exceptional cancellation, e.g. most timeouts
raise (rescuable) unexceptional fiber.cancel(with_exception: err) preventable cancellation
raise (rescuable) exceptional fiber.cancel(with_exception: err, raise: err) Fiber#raise with "resuming" propagation

I think the first one is most useful, but the others can be useful too. The others could be in future PRs... if this (or something like it) is accepted. But if people don't agree that this version is useful, I wasn't going to implement the others. :)

Updated by nevans (Nicholas Evans) about 2 months ago

n.b. I consider this to be related, at least in spirit, to https://bugs.ruby-lang.org/issues/595. I consider this analogous to IO#open and IO#close. We ought to have a structured/standard way to "close" fibers that have been "opened". (This analogy with IO#close is other good reason to return nil for dead fibers.)

Updated by Eregon (Benoit Daloze) about 2 months ago

Could you give an example code where Fiber#raise wouldn't work?
Is it just about #17331 or there are more issues with Fiber#raise?

I don't quite understand why you need to cancel other Fibers, they are not executing anyway.
Is it to make sure to run their ensure clauses?

I guess Fiber#cancel actually waits for that Fiber to terminate (otherwise it would run Fibers concurrently, which is of course bad).
cancel doesn't sound to me like it would wait from the name.
Maybe terminate or so is better for that.
Fiber#kill might be clearer about the intent, there is no magic "cancellation", it's just the equivalent of an exception.

For the run_server example, you could wrap it in a thread, then when the main Fiber of that Thread returns, it would already automatically kill all Fibers of the Thread. Or if the shutdown ends up terminating the main thread, then it would be enough too.
On CRuby, those cases don't seem to run ensure, but that seems a bug and TruffleRuby already runs ensure in that case:
ruby -e 'Thread.new { Fiber.new { begin; p 1; Fiber.yield; ensure; p 2; end }.resume }.join'

#15

Updated by Eregon (Benoit Daloze) about 2 months ago

  • Related to Bug #17331: Let Fiber#raise work with transferring fibers added

Also available in: Atom PDF