Feature #15446
openAdd a method `String#each_match` to the Ruby core
Description
String#each_match
would have two forms:
each_match(pattern) { |match| block } → str
each_match(pattern) → an_enumerator
The latter would be identical to the form gsub(pattern) → enumerator of String#gsub. The former would simply yield the matches to a block and return the receiver.
I frequently use the form of gsub
that returns an enumerator instead of scan
when chaining to Enumerable methods. That's because scan
returns an unneeded temporary array. This use of gsub
can also be useful when the pattern contains capture groups, which can be a complication when using scan
, as in the following example
Suppose we are given a string and wish to count the number of occurrences of each word that begins and ends with the same letter (case-insensitive).
str = "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."
r = /\b(?:[a-z]|([a-z])[a-z]*\1)\b/i
This regular expression reads, "match a word break, followed by one letter or by two or more letters with the last matching the first (case insensitive), all followed by a word break".
enum = str.each_match(r)
#=> #<Enumerator: "Viv and Bob are party...a regular guy.":gsub(/\b(?:[a-z]|([a-z])[a-z]*\1)\b/i)>
We can convert enum
to an array to see the words that will be generated by the enumerator and passed to the block.
enum.to_a
#=> ["Viv", "Bob", "Bob", "Eve", "a", "Eve", "Bob", "a", "regular"]
Continuing,
enum.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
#=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}
We could alternatively use each_match
with a block.
h = Hash.new(0)
str.each_match(r) { |word| h[word] += 1 }
#=> "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."
h #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}
This form of each_match
has no counterpart with gsub
.
Consider now how scan
would be used here. Because of the way scan
treats capture groups, we cannot write
str.scan(r)
#=> [["V"], ["B"], ["B"], ["E"], [nil], ["E"], ["B"], [nil], ["r"]]
Instead we must add a second capture group.
arr = str.scan(/\b((?:[a-z]|([a-z])[a-z]*\2))\b/i)
#=> [["Viv", "V"], ["Bob", "B"], ["Bob", "B"], ["Eve", "E"], ["a", nil], ["Eve", "E"], ["Bob", "B"], ["a", nil], ["regular", "r"]]
Then
arr.each_with_object(Hash.new(0)) { |(word,_),h| h[word] += 1 }
#=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}
This works but it's a bit of a dog's breakfast when compared to the use of the proposed method.
The problem with using gsub
in this way is that it is confusing to readers who are expecting character substitutions to be performed. I also believe that the name of this method (the "sub" in gsub
) has resulted in the form of the method that returns an enumerator to be under-appreciated and under-used.
Some comments below propose that this suggestion be adopted and, in time, the form of gsub
that returns an enumerator be deprecated.
Updated by duerst (Martin Dürst) about 6 years ago
This looks like a good idea. Actually, I might suggest that we even go further: We introduce a new method and depreciate (and ultimately remove) the functionality of producing an enumerator by gsub.
(I wouldn't mind keeping producing an enumerator with gsub, but only if that resulted in actual substitutions.)
Updated by shevegen (Robert A. Heiler) about 6 years ago
The suggested idea by Cary seems fine to me. We have to ask
matz what he thinks about the proposed idea + name choice and
functionality.
I would suggest, however had, to, if necessary, deprecate at
a later time or decouple it from the suggestion here for now.
Reason being is mostly that deprecation (and then removing
functionality) is a little bit different to the proposal of
adding a new functionality (e. g. #matches or any other name
to class String). I think the step of deprecation could be
done at a later step or in another proposal. (I don't know
if anyone depends on producing an enumerator by gsub, but
in my opinion it would be just simpler to bypass that
question for now, and only focus on the suggested method
addition Cary proposed.)
Updated by sos4nt (Stefan Schüßler) about 6 years ago
Regarding the name – I'd prefer String#each_match
.
And it should accept an optional block which yields the matches and (as opposed to gsub
) returns the receiver (i.e. no substitution):
each_match(pattern) { |match| block } → str
each_match(pattern) → an_enumerator
Updated by CaryInVictoria (Cary Swoveland) about 6 years ago
Stefan, I've incorporated both of your suggestions. Thanks.
Updated by CaryInVictoria (Cary Swoveland) about 6 years ago
- Subject changed from Add a method `String#matches` to the Ruby core to Add a method `String#each_match` to the Ruby core
- Description updated (diff)
Updated by sawa (Tsuyoshi Sawada) almost 6 years ago
I would rather propose to have String#scan
take an optional second argument that is comparable to the optional second argument capture
of String#[]
after a regexp argument:
r = /\b([a-z]|([a-z])[a-z]*\1)\b/i
str[r] # => "Viv"
str[r, 0] # => "Viv"
str[r, 1] # => "Viv"
str[r, 2] # => "V"
so that it should work like this:
str.scan(r) # => [["Viv", "V"], ["Bob", "B"], ["Bob", "B"], ["Eve", "E"], ["a", nil], ["Eve", "E"], ["Bob", "B"], ["a", nil], ["regular", "r"]]
str.scan(r, 0) # => ["Viv", "Bob", "Bob", "Eve", "a""Eve", "Bob", "a", "regular"]
str.scan(r, 1) # => ["Viv", "Bob", "Bob", "Eve", "a""Eve", "Bob", "a", "regular"]
str.scan(r, 2) # => ["V", "B", "B", "E", nil, "E", "B", nil, "r"]