Project

General

Profile

Actions

Feature #18136

open

take_while_after

Added by zverok (Victor Shepelev) over 3 years ago. Updated almost 2 years ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:105077]

Description

Sorry, I already tried that once (#16441) but I failed to produce the persuasive example.
So I am back with a couple of them, much simpler and clear than my initial.

The proposal itself: Have take_while_after which behaves like take_while but also includes the last element (first where the condition failed). Reason: there are a lot of cases where "the last good item" in enumeration is the distinctive one (one where enumeration should stop, but the item is still good.

Example 1: Take pages from paginated API, the last page will have less items than the rest (and that's how we know it is the last):

(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_while_after { |response| response.count == limit } # the last will have, say, 10 items, but should still be included!
  .map { process response somehow }

Example 2: Same as above, but "we should continue pagination" is specified with a separate data key "can_continue":

(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_while_after { |response| response['can_continue'] } # the last will have can_continue=false, but still has data
  .map { process response somehow }

Exampe 3: Taking a sentence from a list of tokens like this:

tokens = [
  {text: 'Ruby', type: :word},
  {text: 'is', type: :word},
  {text: 'cool', type: :word},
  {text: '.', type: :punctuation, ends_sentence: true},
  {text: 'Rust', type: :word},
  # ...
]

sentence = tokens.take_while_after { !_1[:ends_sentence] }

(I can get more if it is necessary!)

Neither of those can be solved by "Using take_while with proper condition.", as @matz (Yukihiro Matsumoto) suggested here: https://bugs.ruby-lang.org/issues/16441#note-9

I typically solve it by slice_after { condition }.first, but that's a) uglier and b) greedy when we are working with lazy enumerator (so for API examples, all paginated pages would be fetched at once, and only then processed).

Another consideration in #16441 was an unfortunate naming.
I am leaving it to discussion, though I tend to like #take_upto from #16446.

Updated by duerst (Martin Dürst) over 3 years ago

zverok (Victor Shepelev) wrote:

Example 1: Take pages from paginated API, the last page will have less items than the rest (and that's how we know it is the last):

(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_while_after { |response| response.count == limit } # the last will have, say, 10 items, but should still be included!
  .map { process response somehow }

Couldn't this be written with .take_while { |response| response.count > 0 }

Example 2: Same as above, but "we should continue pagination" is specified with a separate data key "can_continue":

(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_while_after { |response| response['can_continue'] } # the last will have can_continue=false, but still has data
  .map { process response somehow }

The problem may also be that the overall interface doesn't seem to be designed very well. Marking the last piece of data as special looks wrong; introducing a next object that is marked as not being part of data seems much more appropriate.

Also, a higher-level starting interface (one that produces pages rather than have to start with a series of integers) seems more appropriate.

Updated by zverok (Victor Shepelev) over 3 years ago

(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_while_after { |response| response.count == limit } # the last will have, say, 10 items, but should still be included!
  .map { process response somehow }

Couldn't this be written with .take_while { |response| response.count > 0 }

No. Target API (and it is the real one we are working in production currently, and it is an API of a well-known service!) has "(less than limit) results" as the only designation of the last page. It is not the only one doing so. And relying on the fact that the next page would be empty is a) unwanted (we do at least one extra request) and b) not always works: Some of those APIs do throw errors on attempt to access "page after last", and at least one of others does repeat the last page (10 items in my example) infinitely.

The problem may also be that the overall interface doesn't seem to be designed very well. Marking the last piece of data as special looks wrong; introducing a next object that is marked as not being part of data seems much more appropriate.

From a high theoretical point of view, it might be right (though, even this can be argued against: why do additional HTTP call to fetch empty "designates end" object, when we can pass "there is no more" with the data? I believe MOST of pagination APIs work this way!)

Anyway, there is still a lot of APIs/data structures in the wild that explicitly use "the last object also says it is last" approach. And without take_while_after there is no way to represent them as Ruby enumerator.

Updated by nobu (Nobuyoshi Nakada) over 3 years ago

My understanding is that this is “edge-trigger” while take_while is “level-trigger”.
The latter needs the next iteration to know if it changes the “level”, and the former not.
Is this correct?

Updated by zverok (Victor Shepelev) over 3 years ago

@nobu (Nobuyoshi Nakada) yeah, seems about right

(One aside notice is that we have several options for slice_...: after, before, when... But not for take_)

Updated by ufuk (Ufuk Kayserilioglu) over 3 years ago

I don't want to detract from the content of the proposal, nor do I want to bikeshed the name, but personally, take_while_after does not convey the meaning of this behaviour to me. Can I suggest an alternative name of take_upto that works in the opposite way? The Integer#upto method takes all the items including the limit, so Ruby devs should be familiar with what that means. So take_upto should be more obvious a name as to what it is doing: take all the elements up to when the condition succeeds.

Your examples become:

(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_upto { |response| response.count < limit } # the last will have, say, 10 items, but should still be included!
  .map { process response somehow }
(0..).lazy
  .map { |offset| get_page(offset, limit) }
  .take_upto { |response| !response['can_continue'] } # the last will have can_continue=false, but still has data
  .map { process response somehow }

Note that your line comments always mention the negative condition, so the code examples are now more inline with their verbalizations.

Updated by zverok (Victor Shepelev) over 3 years ago

@ufuk
I see two problems with your proposal:

  • the upto, as somebody already pointed to me, is existing in Ruby with different meaning (1.upto(5)), so, unfortunately can't be reused here
  • "like while, but with condition negated" is usually called until; but the "also takes the last item" is NOT part of until definition people used to

I don't like _after postfix much, but I don't see a definitely good alternative: in imperative code, the problem is usually solved with "postfix while" (e.g. do action while condition), I don't recall the term for "while including the matched item" in either Ruby or other language.

Updated by Dan0042 (Daniel DeLorme) over 3 years ago

ufuk (Ufuk Kayserilioglu) wrote in #note-5:

I don't want to detract from the content of the proposal, nor do I want to bikeshed the name, but personally, take_while_after does not convey the meaning of this behaviour to me. Can I suggest an alternative name of take_upto that works in the opposite way? The Integer#upto method takes all the items including the limit, so Ruby devs should be familiar with what that means. So take_upto should be more obvious a name as to what it is doing: take all the elements up to when the condition succeeds.
Note that your line comments always mention the negative condition, so the code examples are now more inline with their verbalizations.

+1
take_upto just makes more sense grammatically. And also I find it easier to reason about; we look for the boundary element (or "edge" as nobu says) by having a block that returns true, and iterate on all elements up to and including that boundary. I believe clear semantics can influence the chance of success of a proposal.

Updated by knu (Akinori MUSHA) over 3 years ago

This behavior of take_upto feels to me like drop_after. 🤔

Updated by knu (Akinori MUSHA) over 3 years ago

I want to see use cases without lazy. If typical uses cases you could think of were always with lazy, then a lazy-minded API would be what you'd need in the first place.

Updated by mame (Yusuke Endoh) over 3 years ago

In the dev meeting, we discussed this proposal in more than a hour, and reached no conclusion.

  • The listed use cases are not so clearly convincing. For the Example 2, "can_continue" field often contains a URL to receive the subsequent results (like "can_continue": "http://api.example.com/foo?since=TIMESTAMP"). In this case, this proposal is not usable.
  • Matz said that take_upto is a better name than take_while_after, However, the predicate of take_upto was very arguable. Some people think that [1, 2, 3, 4, 5].take_upto {|v| v == 3 } returns [1, 2, 3], and others think that [1, 2, 3, 4, 5].take_upto {|v| v != 3 } does so.

Updated by zverok (Victor Shepelev) over 3 years ago

  • Description updated (diff)

@knu (Akinori MUSHA)

feels to me like drop_after

Hmm, this actually sounds like a good name.

I want to see use cases without lazy.

Of course.

The first case of my initial ticket (take tokens till the token that ends sentence)

sentence = tokens.take_while_after { !_1[:ends_sentence] }
# or...
sentence = tokens.drop_after { _1[:ends_sentence] }

Here is a code from my other project, navigating some tree (think HTML DOM) and gathering the sequence of parents up to matching one:

class Node
  def route_to_parent(selector)
    Enumerator.produce(parent, &parent).take_while_after { _1 && !_1.match?(selector) }
  end
end

# E.g. some_link.route_to_parent('div#section') # => span, p, div#subsection, div#section

Dates till next Monday, included (this "included" is the gist of what I am proposing!):

Enumerator.produce(Date.today, &:succ).take_while_after{ !_1.monday? }.each { schedule_job(_1) }
# Of course, this can be replaced with take_while { !_1.tuesday? }, but in reality the code was more like 
...take_while_after { !is_first_monday_of_next_month?(_1) }
# ...which is hard to convert to take_while

One of the examples from #14781 initially had #take_while, but it was a bug :)

require 'strscan'
scanner = StringScanner.new('7+38/6')
p Enumerator.produce { scanner.scan(%r{\d+|[-+*/]}) }.slice_after { scanner.eos? }.first
# => ["7", "+", "38", "/", "6"]
# I'd actually prefer
p Enumerator.produce { scanner.scan(%r{\d+|[-+*/]}) }.take_while_after { !scanner.eos? }
# ...or, even
p Enumerator.produce { scanner.scan(%r{\d+|[-+*/]}) }.drop_after { scanner.eos? }

The listed use cases are not so clearly convincing. For the Example 2, "can_continue" field often contains a URL to receive the subsequent results (like "can_continue": "http://api.example.com/foo?since=TIMESTAMP"). In this case, this proposal is not usable.

I am not sure about the value of this argument. Yes, some APIs work this way. Some work the way I describe. If my proposal can't cover 100% of APIs in the world, it is not usable?.. Or, should I just come with a list of real-life APIs that provide continue: true or something along the lines, otherwise nobody believes they exist?..

Matz said that take_upto is a better name than take_while_after

Unfortunately, the upto, if shorter, clashes with other usages of upto (1.upto(20), which implies counting).

I think drop_after (with condition inverted regarding my initial proposal) seems to be quite a good alternative.

Updated by nobu (Nobuyoshi Nakada) over 3 years ago

Then what about take_till?

Enumerator.produce(Date.today, &:succ).take_till(&:monday?).each { schedule_job(_1) }

Updated by mame (Yusuke Endoh) over 3 years ago

zverok (Victor Shepelev) wrote in #note-11:

The listed use cases are not so clearly convincing. For the Example 2, "can_continue" field often contains a URL to receive the subsequent results (like "can_continue": "http://api.example.com/foo?since=TIMESTAMP"). In this case, this proposal is not usable.

I am not sure about the value of this argument. Yes, some APIs work this way. Some work the way I describe. If my proposal can't cover 100% of APIs in the world, it is not usable?.. Or, should I just come with a list of real-life APIs that provide continue: true or something along the lines, otherwise nobody believes they exist?..

In other words, your use case examples looked not so frequent to make it built-in.

Updated by zverok (Victor Shepelev) over 3 years ago

@mame (Yusuke Endoh)

In other words, your use case examples looked not so frequent to make it built-in.

Soooo... Let me summarize :)

In defense of the idea, I provided:

  1. Two big domains: API clients with pagination ("stop after current page"), and working with tokenized sequences ("stop after breaking token"), both well inside Ruby's intended usage;
  2. A few additional cases from production codebases (choosing dates with inclusion; navigating trees)
  3. Explanation how in some cases there is no way to rewrite code with different Enumerable methods (lazy), and in other cases rewritten code is much less clear (slice_after{}.first); so realistically, people tend to switch back to just each with break, or while etc. (which, I believe, is against the best practices)

The answer from the core team is "(empirically,) not all API clients use this kind of pagination, therefore the case is not frequent enough, therefore we can't have take_while_after/drop_after (despite already having several kinds of slice_, chunk_, take_ and drop_)".

Is that what you are saying?

Is there a way to show the case is frequent enough?
Should I provide links to public APIs of popular services which require this approach? If so, how many should be enough?
Should I provide more diverse examples (besides two big domains and some additional cases)? How many? How diverse?
Should I just stop bothering the core team? :)

Updated by Eregon (Benoit Daloze) almost 3 years ago

I recently wanted to have something like this for advent of code, and worked around by using Enumerator.produce + find + side effect but it feels clearly less elegant.
https://github.com/eregon/adventofcode/blob/a582daa1f2c549d459d6b0017b5940b2e96838d4/2021/5a.rb#L17-L20
Problem description at https://adventofcode.com/2021/day/5, in short it counts how many lines cross at a given position.

What I wanted:

  Enumerator.produce(from) { _1 + dir }.take_till { _1 == to }.each { crossings[_1] += 1 }

Workaround:

  Enumerator.produce(from) { _1 + dir }.find {
    crossings[_1] += 1
    _1 == to
  }

Range#step would work, but a creating a Range of Complex raises bad value for range (ArgumentError).
Numeric#step could have worked, but it's not defined for Complex.

For the name, take_until came to mind ("take all elements until the condition is true, including that last element for which it's true").
It seems a nice counterpart to take_while ("take all elements while/for-which the condition is true").
Also in code the similarity with until is clearer:

# take_while
begin
  e = enum.next
  c = condition(e)
  array << e if c
end while c

# take_until
begin
  e = enum.next
  array << e
end until condition(e)

take_till seems fine too, although it sounds less nice than take_until to me.
take_upto is fine by me as well.

drop_after seems confusing, because the use case is to select all elements until a given condition changes, i.e., the focus is to take/select, not to drop/reject.

take_while_after is quite a mouthful and rather unclear from the name what it does.

Updated by Dan0042 (Daniel DeLorme) almost 3 years ago

until is the negative of while, so I would expect take_until{cond} (or the synonym take_till) to behave like take_while{!cond}. So (1..9).take_until{_1==5} would/should produce [1,2,3,4], unlike the current proposal to get [1,2,3,4,5]. Changing the semantics of "until" like this would be extremely confusing.

Updated by Eregon (Benoit Daloze) almost 3 years ago

take_until, from reading the method name, clearly implies to me including the element for which the condition became true, i.e., it includes that "last" element, like 1..3 includes 3.
But if that's too confusing for others then it seems best to use another name.
As I said above, I'm fine with take_till and take_upto too.

Updated by knu (Akinori MUSHA) almost 3 years ago

I'd rather choose an obvious name, like take_until_after.

Updated by knu (Akinori MUSHA) almost 3 years ago

In today's developer meeting, Matz said the only acceptable name so far was "drop_after" (if he had to choose). He's still not quite convinced of the need for this feature.

Updated by rubyFeedback (robert heiler) almost 2 years ago

I think one big problem with the proposed name such as the three word one is that they
are somewhat rare and difficult to remember. In ruby we often have one word, or two words
for methods.

We may have some three words if I remember correctly, but these are quite rare, and it
is a LOT harder to understand "take_while_after" or "do_this_that", than comparable
".upto()" or ".downto()" and so forth.

We kind of have combined "instructions" such as:

[3,4,5].each.with_index

These are a bit different though.

We could of course call such a combined instruction via:

[3,4,5].each_with_index

so just a single method with three words.

But I think semantically it is easier to use .each,
standalone as we do now, and then "aggregate" onto it,
such as via .with_index. That conveys the meaning
quite clearly.

With three word names, as a single method, I think it
quite a lot more difficult to convey meaning.

I'd rather choose an obvious name, like take_until_after.

I think it has a very similar problem as take_while_after.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0