Project

General

Profile

Actions

Feature #19061

open

Proposal: make a concept of "consuming enumerator" explicit

Added by zverok (Victor Shepelev) over 1 year ago. Updated about 1 year ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:110312]

Description

The problem

Let's imagine this synthetic data:

lines = [
  "--EMAIL--",
  "From: zverok.offline@gmail.com",
  "To; bugs@ruby-lang.org",
  "Subject: Consuming Enumerators",
  "",
  "Here, I am presenting the following proposal.",
  "Let's talk about consuming enumerators..."
]

The logic of parsing it is more or less clear:

  • skip the first line
  • take lines until meet empty, to read the header
  • take the rest of the lines to read the body

It can be easily translated into Ruby code, almost literally:

def parse(enumerator)
  puts "Testing: #{enumerator.inspect}"
  enumerator.next
  p enumerator.take_while { !_1.empty? }
  p enumerator.to_a
end

Now, let's try this code with two different enumerators on those lines:

require 'stringio'

enumerator1 = lines.each
enumerator2 = StringIO.new(lines.join("\n")).each_line(chomp: true)

puts "Array#each"
parse(enumerator1)

puts
puts "StringIO#each_line"
parse(enumerator2)

Output (as you probably already guessed):

Array#each
Testing: #<Enumerator: [...]:each>
["--EMAIL--", "From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators"]
["--EMAIL--", "From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators", "", "Here, I am presenting the following proposal.", "Let's talk about consuming enumerators..."]

StringIO#each_line
Testing: #<Enumerator: #<StringIO:0x00005581018c50a0>:each_line(chomp: true)>
["From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators"]
["Here, I am presenting the following proposal.", "Let's talk about consuming enumerators..."]

Only the second enumerator behaves the way we wanted it to.
Things to notice here:

  1. Both enumerators are of the same class, "just enumerator," but they behave differently: one of them is consuming data on each iteration method, the other does not; but there is no programmatic way to tell whether some enumerator instance is consuming
  2. There is no easy way to make a non-consuming enumerator behave in a consuming way, to open a possibility of a sequence of processing "skip this, take that, take the rest"

Concrete proposal

  1. Introduce an Enumerator#consuming? method that will allow telling one of the other (and make core enumerators like #each_line properly report they are consuming).
  2. Introduce consuming: true parameter for Enumerator.new so it would be easy for user's code to specify the flag
  3. Introduce Enumerator#consuming method to produce a consuming enumerator from a non-consuming one:
# reference implementation is trivial:
class Enumerator
  def consuming
    source = self
    Enumerator.new { |y| loop { y << source.next } }
  end
end

enumerator3 = lines.each.consuming
parse(enumerator3)

Output:

["From: zverok.offline@gmail.com", "To; bugs@ruby-lang.org", "Subject: Consuming Enumerators"]
["Here, I am presenting the following proposal.", "Let's talk about consuming enumerators..."]
Actions #1

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

  • Description updated (diff)

Updated by mame (Yusuke Endoh) over 1 year ago

Here is my understanding:

[1, 2, 3].each.consuming?   #=> false
$stdin.each_line.consuming? #=> true

# A user must guarantee whether it is consuming or not.
Enumerator.new                  {}.consuming? #=> false
Enumerator.new(consuming: true) {}.consuming? #=> true

e = [1, 2. 3].each.consuming
p e.consuming? #=> true
p e.next #=> 1
p e.to_a #=> [2, 3]

I think there are two problems of this proposal.

Problem 1: The consuming flag depends on the underlying IO

An enumerator created from a normal file is not consuming.

e = File.foreach("normal-file")
e.next #=> "first line\n"
e.to_a #=> ["first line\n", "second line\n", "third line\n"]

However, an enumerator created from a named FIFO is consuming.

File.mkfifo("fifo-file")
fork do
  ["first line\n", "second line\n"].each do |s|
    sleep 1
    File.write("fifo-file", s)
  end
end

e = File.foreach("fifo-file")
e.next #=> "first line\n"
e.to_a #=> ["second line\n"]

I am unsure if there is a portable way to determine whether the IO is consuming or not.

Problem 2: The result of Enumerator#consuming shares the state with the original Enumerator

After Enumerator#consuming is called, calling #next and/or #rewind on the original Enumerator affects the consuming Enumerator and vice versa.

e1 = (1..5).to_enum
e2 = e1.consuming

# This call affects the state of e2
p e1.next #=> 1
p e2.next #=> 2 (is this okay?)

# Also, e2.next affects the state of e1 vice versa
p e1.next #=> 3 (is this okay again?)

# e2.rewind has no effect (as intended), but you can still rewind e2 by calling e1.rewind
e1.rewind

p e2.next #=> 1 (rewound; is this okay?)

I don't think it is intentional, but it is very difficult to implement it correctly. One possible solution I came up with is to prohibit #next and #rewind on the original Enumerator, i.e., the right to call the methods is completely transferred to the consuming one. But it introduces yet another new type of Enumerator (unrewindable Enumerator?), which is very complicated.

Updated by zverok (Victor Shepelev) over 1 year ago

Here is my understanding

This is correct.

Problem 1: The consuming flag depends on the underlying IO

That's an interesting problem indeed! I'll look deeper into it.

But for now, I consider it an edge case that can be, in the worst case, just covered by docs. E.g. something like "File.foreach reports itself as not consuming, but depending on IO properties this might not be true...", while, say, File#each_line is consuming by design, if I understand correctly.

The distinction of "consuming"/"non-consuming" [by design] still seems helpful.

Problem 2: The result of Enumerator#consuming shares the state with the original Enumerator

It is just because my reference implementation was too naive :)
By simply changing it to

class Enumerator
  def consuming
    source = dup
    Enumerator.new { |y| loop { y << source.next } }
  end
end

...for all I can tell, breaks all the ties with the original enumerator's state, and all of the examples behave reasonably:

e1 = (1..5).to_enum
e2 = e1.consuming

p e1.next #=> 1
p e2.next #=> 1 (unaffected by e1.next)
p e1.next #=> 2 (unaffected by e2.next)

e1.rewind

p e2.next #=> 2 (unaffected by rewind)

Do you see a problem with this solution?..

Updated by ioquatix (Samuel Williams) over 1 year ago

For problem 1 you can check if an IO is seekable, and this would tell you whether you could restart from the beginning.

Updated by Dan0042 (Daniel DeLorme) over 1 year ago

mame (Yusuke Endoh) wrote in #note-2:

But it introduces yet another new type of Enumerator (unrewindable Enumerator?), which is very complicated.

It's more complicated, but unrewindable enumerators already exist in practice (as shown by FIFO), so making them visible and explicit should be useful I think. Maybe #consuming? could return 3 values like [nil, :rewindable, :nonrewindable]

Updated by mame (Yusuke Endoh) over 1 year ago

zverok (Victor Shepelev) wrote in #note-3:

File#each_line is consuming by design, if I understand correctly.

Well, I guess so. To be honest, I'm not sure which ones are consuming and which ones are not.

Problem 2: The result of Enumerator#consuming shares the state with the original Enumerator

Do you see a problem with this solution?..

I think this is also a possible solution. Note that the Enumerator in the middle of #next will not be able to return #consuming. Is this okay?

e1 = (1..5).to_enum
e1.next
e1.consuming #=> can't copy execution context (TypeError)

ioquatix (Samuel Williams) wrote in #note-4:

For problem 1 you can check if an IO is seekable, and this would tell you whether you could restart from the beginning.

I think you misunderstand Problem 1 (maybe due to my bad explanation). Enumerator does not use IO#seek or something. Calling #next and #to_a on the Enumerator created from File.foreach will open the file respectively.

Dan0042 (Daniel DeLorme) wrote in #note-5:

It's more complicated, but unrewindable enumerators already exist in practice (as shown by FIFO), so making them visible and explicit should be useful I think. Maybe #consuming? could return 3 values like [nil, :rewindable, :nonrewindable]

The word "unrewindable" was not a good name, which might have confused you. I meant an Enumerator whose #next and #rewind raise an exception, say, "you cannot use #next because you have already called #consuming".

Updated by zverok (Victor Shepelev) over 1 year ago

@mame (Yusuke Endoh)

To be honest, I'm not sure which ones are consuming and which ones are not.

Which is one of the points of this ticket! The distinction is internally present (as displayed in original code samples) but never spelled out and can't be introspected. I believe that introducing the explicit concept will make it much more obvious and make people aware of it.

Note that the Enumerator in the middle of #next will not be able to return #consuming. Is this okay?

I think it is totally Ok for the first implementation, especially if #consuming will raise a bit more friendly error like "The enumerator is mid-enumeration and can't be turned into consuming" or something.

Updated by zverok (Victor Shepelev) over 1 year ago

@knu (Akinori MUSHA)

Re:

"But I'm skeptical about the usefulness of the consuming? flag" (from dev.log)

I believe it is extremely useful for introspection. For example the method like shown in the original ticket:

def parse(enumerator)
  puts "Testing: #{enumerator.inspect}"
  enumerator.next
  p enumerator.take_while { !_1.empty? }
  p enumerator.to_a
end

...will work properly (enumerator.next and enumerator.take[_while] advance the enumerator) with a consuming enumerator and surprisingly with a non-consuming. As it is too late to make all enumerators consuming :), at least the presence of the explicit notion of "consuming-ness" will make it somehow easier to explain and understand.

And also adjust when needed with enumerator = enumerator.consuming unless enumerator.consuming? or something.

Updated by mame (Yusuke Endoh) over 1 year ago

zverok (Victor Shepelev) wrote in #note-7:

I think it is totally Ok for the first implementation

Not only "the first implementation". I think it is impossible to implement the method even in the future because a Fiber cannot be duplicated.

Updated by zverok (Victor Shepelev) over 1 year ago

I think it is impossible to implement the method even in the future because a Fiber cannot be duplicated.

Of course, it is impossible directly.
I just might imagine that if it would be a common stumbling question for consuming enumerators (hardly so, but who knows), there might be some workarounds, like, IDK, trying to duplicate the initial state and make consuming enumerator start from the start if possible, or something like that.

Anyway, it is out of the scope of the current proposal :)

Actions #11

Updated by hsbt (Hiroshi SHIBATA) over 1 year ago

  • Related to Feature #19069: Default value assignment with `Hash.new` in block form added
Actions #12

Updated by hsbt (Hiroshi SHIBATA) over 1 year ago

  • Related to deleted (Feature #19069: Default value assignment with `Hash.new` in block form)

Updated by matz (Yukihiro Matsumoto) about 1 year ago

Regarding the concrete proposals:

  1. Introduce an Enumerator#consuming? method

    The consuming information is not reliable especially with I/O (some IO may not be rewindable, but lseek(2) may not return error for the IO, e.g. on MacOS). Thus we cannot implement trust-worthy consuming? method

  2. Introduce consuming: true parameter for Enumerator.new

    Since consuming? state of the enumerators are unreliable, this keyword argument is useless

  3. Introduce Enumerator#consuming method to produce a consuming enumerator from a non-consuming one

    The original PoC code modifies the original, the modified one raising error for duping internal fiber. It's not acceptable behavior (but former may be). In theory, we can overhaul the implementation of enumerators, but I don't think it's worth the cost.

The final decision may be up to the actual use-case. But I doubt the benefit.

Matz.

Updated by zverok (Victor Shepelev) about 1 year ago

@matz (Yukihiro Matsumoto) Thanks for your answer. I'll gather more evidence/real-life examples and will adjust the proposal.

My main concern though was not as much some particular usage but general awareness of the difference between the two types of enumerators.

The latest evidence of the fact that it is a problem is bug #19294 in the new feature of Ruby 3.2, where even the core team member implementing new functionality hasn't considered that some enumerators would be "consumed" by the first iteration.

I believe it to be a pretty important distinction frequently leading to idiosyncrasies and not just a random feature request. But I need to think about how to communicate my intentions and proposals better.

Actions

Also available in: Atom PDF

Like1
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1