Feature #15446


Add a method `String#each_match` to the Ruby core

Added by CaryInVictoria (Cary Swoveland) almost 5 years ago. Updated almost 5 years ago.

Target version:


String#each_match would have two forms:

each_match(pattern) { |match| block } → str
each_match(pattern) → an_enumerator

The latter would be identical to the form gsub(pattern) → enumerator of String#gsub. The former would simply yield the matches to a block and return the receiver.

I frequently use the form of gsub that returns an enumerator instead of scan when chaining to Enumerable methods. That's because scan returns an unneeded temporary array. This use of gsub can also be useful when the pattern contains capture groups, which can be a complication when using scan, as in the following example

Suppose we are given a string and wish to count the number of occurrences of each word that begins and ends with the same letter (case-insensitive).

 str = "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."

 r = /\b(?:[a-z]|([a-z])[a-z]*\1)\b/i

This regular expression reads, "match a word break, followed by one letter or by two or more letters with the last matching the first (case insensitive), all followed by a word break".

 enum = str.each_match(r)
    #=> #<Enumerator: "Viv and Bob are party...a regular guy.":gsub(/\b(?:[a-z]|([a-z])[a-z]*\1)\b/i)> 

We can convert enum to an array to see the words that will be generated by the enumerator and passed to the block.

    #=> ["Viv", "Bob", "Bob", "Eve", "a", "Eve", "Bob", "a", "regular"] 


enum.each_with_object( { |word, h| h[word] += 1 }
   #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1} 

We could alternatively use each_match with a block.

 h =
 str.each_match(r) { |word| h[word] += 1 }
    #=> "Viv and Bob are party animals. Bob and Eve are a couple who met on Christmas Eve. Bob is a regular guy."
 h #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1} 

This form of each_match has no counterpart with gsub.

Consider now how scan would be used here. Because of the way scan treats capture groups, we cannot write

   #=> [["V"], ["B"], ["B"], ["E"], [nil], ["E"], ["B"], [nil], ["r"]] 

Instead we must add a second capture group.

arr = str.scan(/\b((?:[a-z]|([a-z])[a-z]*\2))\b/i)
   #=> [["Viv", "V"], ["Bob", "B"], ["Bob", "B"], ["Eve", "E"], ["a", nil], ["Eve", "E"], ["Bob", "B"], ["a", nil], ["regular", "r"]]


arr.each_with_object( { |(word,_),h| h[word] += 1 }
   #=> {"Viv"=>1, "Bob"=>3, "Eve"=>2, "a"=>2, "regular"=>1}

This works but it's a bit of a dog's breakfast when compared to the use of the proposed method.

The problem with using gsub in this way is that it is confusing to readers who are expecting character substitutions to be performed. I also believe that the name of this method (the "sub" in gsub) has resulted in the form of the method that returns an enumerator to be under-appreciated and under-used.

Some comments below propose that this suggestion be adopted and, in time, the form of gsub that returns an enumerator be deprecated.


Also available in: Atom PDF