Project

General

Profile

Actions

Feature #18262

open

Enumerator::Lazy#partition

Added by zverok (Victor Shepelev) about 1 month ago. Updated 16 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:105750]

Description

(Part of my set of proposals about making .lazy more useful/popular.)

Currently:

file = File.open('very-large-file.txt')
lines_with_errors, lines_without_errors = file.lazy.partition { _1.start_with?('E:') }
lines_with_errors.class
# => Array, all file is read by this moment

This might be not very practical performance-wise and memory-wise.

I am thinking that maybe returning a pair of lazy enumerators might be a good addition to Enumerator::Lazy

Naive prototype:

class Enumerator::Lazy
  def partition(&block)
    buffer1 = []
    buffer2 = []
    source = self

    [
      Enumerator.new { |y|
        loop do
          if buffer1.empty?
            begin
              item = source.next
              if block.call(item)
                y.yield(item)
              else
                buffer2.push(item)
              end
            rescue StopIteration
              break
            end
          else
            y.yield buffer1.shift
          end
        end
      }.lazy,
      Enumerator.new { |y|
        loop do
          if buffer2.empty?
            begin
              item = source.next
              if !block.call(item)
                y.yield(item)
              else
                buffer1.push(item)
              end
            rescue StopIteration
              break
            end
          else
            y.yield buffer2.shift
          end
        end
      }.lazy
    ]
  end
end

Testing it:

Enumerator.produce(1) { |i| puts "processing #{i}"; i + 1 }.lazy
  .take(30)
  .partition(&:odd?)
  .then { |odd, even|
    p odd.first(3), even.first(3)
  }
# Prints:
# processing 1
# processing 2
# processing 3
# processing 4
# processing 5
# [1, 3, 5]
# [2, 4, 6]

As you might notice by the "processing" log, it only fetched the amount of entries that was required by produced enumerators.

The drawback would be—as my prototype implementation shows—the need of internal "buffering" (I don't think it is possible to implement lazy partition without it), but it still might be worth a shot?

Updated by Dan0042 (Daniel DeLorme) about 1 month ago

+1
Since a lazy enumerator is produced for both #select and #reject, it would make sense for #partition as well.

Updated by knu (Akinori MUSHA) 18 days ago

I agree this would be a good addition, and I think the existing users of lazy would understand the incompatibility this would bring is a necessary step to make partition more useful.

However, the buffering could be a pitfall for new users. In today's developer meeting, Matz and I agreed to suggest that the behavior should be well documented. If you were dividing a huge (or infinite) list into two where one enumerator would yield a value extremely less likely than the other and you read from the first enumerator, the buffer could become huge. That is not straightforward from what you normally expect from "lazy", so it should be noted in the documentation.

Updated by Dan0042 (Daniel DeLorme) 17 days ago

I don't think buffering is a problem. If you consider that the current implementation fully buffers the two sub-lists before returning them, the lazy implementation literally cannot cause more buffering than the current non-lazy implementation.

Updated by knu (Akinori MUSHA) 17 days ago

Dan0042 (Daniel DeLorme) wrote in #note-3:

I don't think buffering is a problem. If you consider that the current implementation fully buffers the two sub-lists before returning them, the lazy implementation literally cannot cause more buffering than the current non-lazy implementation.

Buffering may be a whole new idea in the history of Enumerator::Lazy, so it won't hurt to document that the new Lazy#partition implementation can consume a lot of memory.

Updated by Eregon (Benoit Daloze) 16 days ago

Mmh, I think the whole point of Enumerator::Lazy is it uses constant memory and does not buffer arbitrary amount of elements.

Shouldn't it simply behave like:

class Enumerator::Lazy
  def partition(&block)
    [select(&block), reject(&block)]
  end
end

No buffering, no surprises, consistent with general Lazy methods.
The block will be executed 2 times per element if both enumerators iterate that element.
That seems completely fine since it's a predicate, and expected given you'd do the same with select/reject manually.

Updated by Dan0042 (Daniel DeLorme) 16 days ago

I wouldn't say that constant memory is the "whole point" of Enumerator::Lazy. It's more about performing the minimum amount of computation needed, only when needed. Executing the block twice for each element is not minimal, and much more surprising to me than any amount of buffering. What if there are side effects in the block?

Updated by zverok (Victor Shepelev) 16 days ago

Eregon (Benoit Daloze) your version is not the same for effectfull enumerators (where .lazy is extremely useful):

require 'stringio'

str = StringIO.new(<<~ROWS)
1: OK
2: Err
3: OK
4: Err
5: OK
6: Err
ROWS


err, ok = str.each_line(chomp: true).lazy.partition { _1.include?('Err') }

p [err.first(2), ok.first(2)]
# mine: [["2: Err", "4: Err"], ["1: OK", "3: OK"]]
# yours: [["2: Err", "4: Err"], ["5: OK"]]

...because yours is consuming both kinds of rows while producing errs.

I agree that if not a "whole point", the "it consumes much less memory" is an implicit expectation of a lazy enumerator. But I believe having (well-documented) quirk in partition is better than not having lazy partition at all.

Actions

Also available in: Atom PDF