Feature #15446
openAdd a method `String#each_match` to the Ruby core
Description
String#each_match
would have two forms:
each_match(pattern) { |match| block } → str
each_match(pattern) → an_enumerator
The latter would be identical to the form gsub(pattern) → enumerator of String#gsub. The former would simply yield the matches to a block and return the receiver.
I frequently use the form of gsub
that returns an enumerator instead of scan
when chaining to Enumerable methods. That's because scan
returns an unneeded temporary array. This use of gsub
can also be useful when the pattern contains capture groups, which can be a complication when using scan
, as in the following example
Suppose we are given a string and wish to count the number of occurrences of each word that begins and ends with the same letter (case-insensitive).
str = "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."
r = /\b(?:[a-z]|([a-z])[a-z]*\1)\b/i
This regular expression reads, "match a word break, followed by one letter or by two or more letters with the last matching the first (case insensitive), all followed by a word break".
enum = str.each_match(r)
#=> #<Enumerator: "Viv and Bob are party...a regular guy.":gsub(/\b(?:[a-z]|([a-z])[a-z]*\1)\b/i)>
We can convert enum
to an array to see the words that will be generated by the enumerator and passed to the block.
enum.to_a
#=> ["Viv", "Bob", "Bob", "Eve", "a", "Eve", "Bob", "a", "regular"]
Continuing,
enum.each_with_object(Hash.new(0)) { |word, h| h[word] += 1 }
#=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}
We could alternatively use each_match
with a block.
h = Hash.new(0)
str.each_match(r) { |word| h[word] += 1 }
#=> "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."
h #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}
This form of each_match
has no counterpart with gsub
.
Consider now how scan
would be used here. Because of the way scan
treats capture groups, we cannot write
str.scan(r)
#=> [["V"], ["B"], ["B"], ["E"], [nil], ["E"], ["B"], [nil], ["r"]]
Instead we must add a second capture group.
arr = str.scan(/\b((?:[a-z]|([a-z])[a-z]*\2))\b/i)
#=> [["Viv", "V"], ["Bob", "B"], ["Bob", "B"], ["Eve", "E"], ["a", nil], ["Eve", "E"], ["Bob", "B"], ["a", nil], ["regular", "r"]]
Then
arr.each_with_object(Hash.new(0)) { |(word,_),h| h[word] += 1 }
#=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}
This works but it's a bit of a dog's breakfast when compared to the use of the proposed method.
The problem with using gsub
in this way is that it is confusing to readers who are expecting character substitutions to be performed. I also believe that the name of this method (the "sub" in gsub
) has resulted in the form of the method that returns an enumerator to be under-appreciated and under-used.
Some comments below propose that this suggestion be adopted and, in time, the form of gsub
that returns an enumerator be deprecated.