Feature #18262: Enumerator::Lazy#partition - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #18262

open

Enumerator::Lazy#partition

Added by zverok (Victor Shepelev) over 3 years ago. Updated over 3 years ago.

Status:

Open

Assignee:

Target version:

[ruby-core:105750]

Description

(Part of my set of proposals about making .lazy more useful/popular.)

Currently:

file = File.open('very-large-file.txt')
lines_with_errors, lines_without_errors = file.lazy.partition { _1.start_with?('E:') }
lines_with_errors.class
# => Array, all file is read by this moment

This might be not very practical performance-wise and memory-wise.

I am thinking that maybe returning a pair of lazy enumerators might be a good addition to Enumerator::Lazy

Naive prototype:

class Enumerator::Lazy
  def partition(&block)
    buffer1 = []
    buffer2 = []
    source = self

    [
      Enumerator.new { |y|
        loop do
          if buffer1.empty?
            begin
              item = source.next
              if block.call(item)
                y.yield(item)
              else
                buffer2.push(item)
              end
            rescue StopIteration
              break
            end
          else
            y.yield buffer1.shift
          end
        end
      }.lazy,
      Enumerator.new { |y|
        loop do
          if buffer2.empty?
            begin
              item = source.next
              if !block.call(item)
                y.yield(item)
              else
                buffer1.push(item)
              end
            rescue StopIteration
              break
            end
          else
            y.yield buffer2.shift
          end
        end
      }.lazy
    ]
  end
end

Testing it:

Enumerator.produce(1) { |i| puts "processing #{i}"; i + 1 }.lazy
  .take(30)
  .partition(&:odd?)
  .then { |odd, even|
    p odd.first(3), even.first(3)
  }
# Prints:
# processing 1
# processing 2
# processing 3
# processing 4
# processing 5
# [1, 3, 5]
# [2, 4, 6]

As you might notice by the "processing" log, it only fetched the amount of entries that was required by produced enumerators.

The drawback would be—as my prototype implementation shows—the need of internal "buffering" (I don't think it is possible to implement lazy partition without it), but it still might be worth a shot?

Actions

Copy link

#1 [ruby-core:105751]

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

+1
Since a lazy enumerator is produced for both #select and #reject, it would make sense for #partition as well.

Actions

Copy link

#2 [ruby-core:106112]

Updated by knu (Akinori MUSHA) over 3 years ago

I agree this would be a good addition, and I think the existing users of lazy would understand the incompatibility this would bring is a necessary step to make partition more useful.

However, the buffering could be a pitfall for new users. In today's developer meeting, Matz and I agreed to suggest that the behavior should be well documented. If you were dividing a huge (or infinite) list into two where one enumerator would yield a value extremely less likely than the other and you read from the first enumerator, the buffer could become huge. That is not straightforward from what you normally expect from "lazy", so it should be noted in the documentation.

Actions

Copy link

#3 [ruby-core:106147]

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

I don't think buffering is a problem. If you consider that the current implementation fully buffers the two sub-lists before returning them, the lazy implementation literally cannot cause more buffering than the current non-lazy implementation.

Actions

Copy link

#4 [ruby-core:106150]

Updated by knu (Akinori MUSHA) over 3 years ago

Dan0042 (Daniel DeLorme) wrote in #note-3:

I don't think buffering is a problem. If you consider that the current implementation fully buffers the two sub-lists before returning them, the lazy implementation literally cannot cause more buffering than the current non-lazy implementation.

Buffering may be a whole new idea in the history of Enumerator::Lazy, so it won't hurt to document that the new Lazy#partition implementation can consume a lot of memory.

Actions

Copy link

#5 [ruby-core:106182]

Updated by Eregon (Benoit Daloze) over 3 years ago

Mmh, I think the whole point of Enumerator::Lazy is it uses constant memory and does not buffer arbitrary amount of elements.

Shouldn't it simply behave like:

class Enumerator::Lazy
  def partition(&block)
    [select(&block), reject(&block)]
  end
end

No buffering, no surprises, consistent with general Lazy methods.
The block will be executed 2 times per element if both enumerators iterate that element.
That seems completely fine since it's a predicate, and expected given you'd do the same with select/reject manually.

Actions

Copy link

#6 [ruby-core:106187]

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

I wouldn't say that constant memory is the "whole point" of Enumerator::Lazy. It's more about performing the minimum amount of computation needed, only when needed. Executing the block twice for each element is not minimal, and much more surprising to me than any amount of buffering. What if there are side effects in the block?

Actions

Copy link

#7 [ruby-core:106189]

Updated by zverok (Victor Shepelev) over 3 years ago

@Eregon (Benoit Daloze) your version is not the same for effectfull enumerators (where .lazy is extremely useful):

require 'stringio'

str = StringIO.new(<<~ROWS)
1: OK
2: Err
3: OK
4: Err
5: OK
6: Err
ROWS


err, ok = str.each_line(chomp: true).lazy.partition { _1.include?('Err') }

p [err.first(2), ok.first(2)]
# mine: [["2: Err", "4: Err"], ["1: OK", "3: OK"]]
# yours: [["2: Err", "4: Err"], ["5: OK"]]

...because yours is consuming both kinds of rows while producing errs.

I agree that if not a "whole point", the "it consumes much less memory" is an implicit expectation of a lazy enumerator. But I believe having (well-documented) quirk in partition is better than not having lazy partition at all.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Feature #18262

Enumerator::Lazy#partition

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

Updated by knu (Akinori MUSHA) over 3 years ago

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

Updated by knu (Akinori MUSHA) over 3 years ago

Updated by Eregon (Benoit Daloze) over 3 years ago

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

Updated by zverok (Victor Shepelev) over 3 years ago