Project

General

Profile

Actions

Feature #12745

closed

String#(g)sub(!) should pass a MatchData to the block, not a String

Added by herwin (Herwin W) about 8 years ago. Updated about 3 years ago.

Status:
Feedback
Target version:
-
[ruby-core:77233]

Description

A simplified (and stupid) example: replace some placeholders in a string with function calls

def placeholder(val)
  raise 'Incorrect value' unless val == 'three'
  '3'
end

str = '1.2.[three].4'
str.gsub!(/\[(\w+)\]/) { |m| placeholder(m) }

This raises the 'incorrect value' because we don't pass the match 'three', but the full string '[three]'. It looks like we have 3 options to fix that:

  1. Match [three] instead of three in the placeholder replacement method
  2. Pass m[1..-2] instead of m to the method (or strip it in placeholder)
  3. Use $1 in the method call, ignore the value that's passed to the block

Options 1 and 2 look kind of code duplication to me (and they're possible in the simplified example, but might get tricky in real situations). I don't like option 3 because you completely ignore the value that's been passed to the block in favor of global variables, you can't use named captures, and writing code this way makes it incompatible with Rubinius.

I think it would be more logical to pass a MatchData (like what you'd get with String#match) instead of a String to the block. The #to_s returns the whole string, so in 90% of the use cases the code could remain unaltered, but the remaining 10% makes it a change that shouldn't be backported to 2.3.

Attached is a very naive patch to pass a matchdata to the block called by String#sub. The additional change in rbinstall.rb was required to run make install, which actually shows an incompatiblity (which I hadn't anticipated)


Files

ruby_string_sub_matchdata.diff (952 Bytes) ruby_string_sub_matchdata.diff herwin (Herwin W), 09/09/2016 06:32 PM

Related issues 4 (2 open2 closed)

Related to Ruby master - Feature #6802: String#scan should have equivalent yielding MatchDataAssignedmatz (Yukihiro Matsumoto)Actions
Related to Ruby master - Feature #5749: new method String#match_all neededAssignedmatz (Yukihiro Matsumoto)Actions
Related to Ruby master - Feature #5606: String#each_match(regexp)FeedbackActions
Related to Ruby master - Feature #546: String#gsub と Strnig#scan のブロックパラメータの一致Rejectedmatz (Yukihiro Matsumoto)09/06/2008Actions

Updated by akr (Akira Tanaka) about 8 years ago

matz's old idea: add String#sg (or String#gs).
ruby-dev:33553

Updated by shyouhei (Shyouhei Urabe) about 8 years ago

We looked at this issue in developer meeting today.

While matz's suggestion of new method was excavated from the past, myself and others argued possibility of extending the gsub method that exists today. Proposals include:

  1. what about just passing a MatchData instead of a String. That hurts no one because almost every time when we need a block to pass to gsub etc, what is needed is $1.
  2. what about passing a MatchData in addition to a String like gsub(){|string, md|}. It is 100% safe even when the block parameter is used.
  3. what about extending the block parameter string with a module, like open-uri does, to define additional method.
  4. there are of course attendees who think a separate method is the cleanest solution.

My idea was #3 but in reality #2 can be a perfect and easy solution. What do you think matz?

Updated by akr (Akira Tanaka) about 8 years ago

Shyouhei Urabe wrote:

  1. what about passing a MatchData in addition to a String like gsub(){|string, md|}. It is 100% safe even when the block parameter is used.

Surprisingly, it may be possible.

% ruby -e '
p "abc".gsub(/b/) {|str| "(#{str})" }
String.prepend Module.new {
  def gsub(*args)
    super(*args) {|str| yield str, $~ }
  end
}
p "abc".gsub(/b/) {|str| "(#{str})" }
p "abc".gsub(/b/) {|str, md| "(#{md.inspect})" }
'
"a(b)c"
"a(b)c"
"a(#<MatchData \"b\">)c"

It is important that String#gsub yields two arguments
instead of an two-elements array which causes incompatiblity.

% ruby -e '
p "abc".gsub(/b/) {|str| "(#{str})" }
String.prepend Module.new {
  def gsub(*args)
    super(*args) {|str| yield [str, $~] }
  end
}
p "abc".gsub(/b/) {|str| "(#{str})" }
p "abc".gsub(/b/) {|str, md| "(#{md.inspect})" }
'
"a(b)c"
"a([\"b\", #<MatchData \"b\">])c"
"a(#<MatchData \"b\">)c"

Updated by herwin (Herwin W) about 8 years ago

About the suggestions by Shyouhei Urabe: even though option 2 is very pragmatic (it solves the problem, keeps backwards compatibility and is a small change), the idea of the following code looks a bit off to me:

str.gsub!(/\[(\w+)\]/) { |_, m| placeholder(m) }

I guess you would use either a String or a MatchObject in the block, but never both (if you'd need them both, you could get the String from the MatchObject of course). Always having to ignore the first argument to the block feels a bit like a hack. Not that I have a better suggestion though.

A separate method that passes a MatchData object would be another solution that feels a bit hacky: String#gsub and String#gs would be almost identical, and which method would be preferred if you don't add a block, but use a second parameter as the replacement string? And without looking at the documentation, I would have no idea what String#gs/String#sg would do, the names are very undescriptive.

Updated by matz (Yukihiro Matsumoto) almost 8 years ago

  • Status changed from Open to Feedback
  • Assignee set to matz (Yukihiro Matsumoto)

Out of Shyouhei's 4 options,

  1. not acceptable for compatibility's sake
  2. not excited, may cause compatibility problem. besides that, String#scan cannot be enhanced by this option
  3. looks nice but I have performance concern
  4. the best solution, if we have a good name candidate.

Matz.

Updated by herwin (Herwin W) almost 8 years ago

What about sub_md as a name?

Updated by herwin (Herwin W) almost 8 years ago

I posted this link to IRC (#ruby on freenode), just to see if anyone had a good name suggestion. The suggestions #matchsub and #msub were offered. I don't really like them, #matchsub sounds like it tries to subtitute a match (exactly what #sub does, #msub reminds me of the /m modifier or regex (multiline, just as #gsub is named after the /g modifier (global)) instead of a MatchData object. (Although I smiled at the suggestion #gsubthewayitshouldhavebeenallalong, the name reminds me to much of mysql_real_escape_string to take serious)

Another option that was suggested was adding a keyword argument to toggle the behaviour. I actually liked that proposal: it keeps backwards compatibility and doesn't result in an explosion of the number of methods

Updated by Eregon (Benoit Daloze) almost 8 years ago

Maybe this functionality should just be an extra keyword argument to these methods?

Like
str.gsub!(/[(\w+)]/, md: true)

Not as concise, but much better on compatibility and clarity of the intent.

Updated by shyouhei (Shyouhei Urabe) almost 8 years ago

Benoit Daloze wrote:

Maybe this functionality should just be an extra keyword argument to these methods?

Like
str.gsub!(/[(\w+)]/, md: true)

We already use that feature for another purpose.

p "emdash".gsub(/md/, "md" => true) #=> "etrueash"

Updated by akr (Akira Tanaka) almost 8 years ago

How about String#gsb and String#sb ?

  • gsb and sb are shorter and gsub and sub.
  • gsb and sb is readable as g-substitute and substitute.

Updated by herwin (Herwin W) almost 8 years ago

Regarding String#gsb and String#sb: It is far from clear what the difference between #sub and #sb is just by the method names. And personally I don't care that much about the length of the method names, I prefer clearness over shortness.

Updated by duerst (Martin Dürst) almost 8 years ago

Herwin W wrote:

Regarding String#gsb and String#sb: It is far from clear what the difference between #sub and #sb is just by the method names. And personally I don't care that much about the length of the method names, I prefer clearness over shortness.

Agreed. I also think that the new methods should have longer names (maybe shorter than gsub_with_match_object, but still longer than just gsub), because most of the time, the old methods will do just fine (they did so for more than 20 years of Ruby history).

Updated by akr (Akira Tanaka) over 6 years ago

How about String#subm and String#gsubm ?

It is bit longer but not so long.

Updated by matz (Yukihiro Matsumoto) over 6 years ago

subm andgsubm are acceptable.

Matz.

Updated by Eregon (Benoit Daloze) over 6 years ago

shyouhei (Shyouhei Urabe) wrote:

Benoit Daloze wrote:

Maybe this functionality should just be an extra keyword argument to these methods?

Like
str.gsub!(/[(\w+)]/, md: true)

We already use that feature for another purpose.

p "emdash".gsub(/md/, "md" => true) #=> "etrueash"

Not with a Symbol though.
So I think gsub(/regexp/, md: true) would not conflict with existing usages.

matz (Yukihiro Matsumoto) wrote:

subm andgsubm are acceptable.

Those names sound very cryptic to me.

Actions #17

Updated by shyouhei (Shyouhei Urabe) about 6 years ago

  • Related to Feature #6802: String#scan should have equivalent yielding MatchData added
Actions #18

Updated by shyouhei (Shyouhei Urabe) about 6 years ago

  • Related to Feature #5749: new method String#match_all needed added
Actions #19

Updated by shyouhei (Shyouhei Urabe) about 6 years ago

Actions #20

Updated by shyouhei (Shyouhei Urabe) about 6 years ago

  • Related to Feature #546: String#gsub と Strnig#scan のブロックパラメータの一致 added

Updated by sawa (Tsuyoshi Sawada) over 5 years ago

What about having Enumerator#with_m, such that

gsub(pattern).with_m{|match, match_data| block}  new_str
sub(pattern).with_m{|match, match_data| block}  new_str

And if we want to expand that to scan, we can introduce

scan  enumerator

And then do

scan.with_m(pattern){|(match, ...), match_data| block}  str

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

sawa (Tsuyoshi Sawada) wrote:

What about having Enumerator#with_m, such that

gsub(pattern).with_m{|match, match_data| block}  new_str
sub(pattern).with_m{|match, match_data| block}  new_str

This doesn't make sense to add to Enumerator in my opinion, since most Enumerators will not be created from gsub/sub:

[1].each.with_m{}

Possibly gsub and sub could return an instance of an Enumerator subclass that supports the with_m, but that seems like overkill.

Reading through the previously discussed alternative approaches:

  • Passing a second argument to the block breaks if the block is a lambda.
  • A keyword argument via :md is currently allowed but ignored. I don't think that breaks compatibility, but with the discussion of keyword arguments in #14183, introducing an :md keyword argument may make it harder to transition such code to Ruby 3.
  • subm and gsubm are both fairly cryptic in terms of method names, but compared to other approaches, new method names seem to be the best way to handle this if we want to add support for it.

We could leave things as they are and use $1/$2/$~ to access the captures instead the block.

Updated by shugo (Shugo Maeda) about 5 years ago

jeremyevans0 (Jeremy Evans) wrote:

  • Passing a second argument to the block breaks if the block is a lambda.
  • A keyword argument via :md is currently allowed but ignored. I don't think that breaks compatibility, but with the discussion of keyword arguments in #14183, introducing an :md keyword argument may make it harder to transition such code to Ruby 3.
  • subm and gsubm are both fairly cryptic in terms of method names, but compared to other approaches, new method names seem to be the best way to handle this if we want to add support for it.

I prefer to the last approach, but don't like names subm and gsubm.

How about repl and repl_all?
JavaScripts's replace is similar to Ruby's sub, but Ruby has already replace and it's destructive.
repl is an abbreviation for replace. grepl is confusing with grep, so I propose repl_all instead.

And I prefer to match_all for the MatchData version of scan (#5749).

Updated by jrochkind (jonathan rochkind) about 3 years ago

I would love to see this resurrected.

If you want the MatchData in a gsub block, at first it looks like you can just use Regexp.last_match.

But I think this, along with variables like $1, may not be thread-safe? See for instance related at https://bugs.ruby-lang.org/issues/12689 and https://github.com/jruby/jruby/issues/3031#issuecomment-259057232 and https://github.com/jruby/jruby/issues/4868#issuecomment-347276140

If so, it makes having a clear fixed MatchData passed as an arg to block more than just a minor convenience, but necessary to do things like this in a thread-safe way at all? Or I may be misunderstanding.

Updated by Eregon (Benoit Daloze) about 3 years ago

The fix for #12689 seems clear, such variables need to be Thread-local (or even Fiber-local).
That's already what TruffleRuby does (mentioned in that issue) and it seems JRuby is doing the same (https://github.com/jruby/jruby/issues/3031#issuecomment-660601045).
And this is only a problem if a given method activation has blocks and those blocks from that same method call are executed in different threads and one of them mutates $~, which seems already fairly rare.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0