Feature #13016: String#gsub(hash) - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #13016

closed

String#gsub(hash)

Feature #13016: String#gsub(hash)

Added by shyouhei (Shyouhei Urabe) over 9 years ago. Updated over 9 years ago.

Status:

Rejected

Assignee:

Target version:

[ruby-core:78539]

Description

Background: I wanted to drop NKF dependency of my script. By doing so I noticed that I can't purge NKF.nkf '-Z4'. It can neither be rewritten using String#tr, String#encode, nor String#unicode_normalize. It is doable using String#gsub theoretically, but that requires a hand-crafted nontrivial regular expression that exactly matches what Z4 expects to convert. This is almost impossible to do, and is definitely not something debuggable.

Proposal: extend String#gsub so that it also accepts hash as its only argument, specifying input-output mapping.

# now
def convert str
  require 'nkf'
  NKF.nkf '-Z4xm0', str
end

# proposed
def convert str
  map = {  "\u3002" => "\uFF61", "\u300C" => "\uFF62", ... }
  str.gsub map
end

Related issues 1 (0 open — 1 closed)

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#1 [ruby-core:78540]

Tracker changed from Bug to Feature

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#2 [ruby-core:78541]

Is str.gsub(map) a shortcut for str.gsub(Regexp.union(map.keys)) { map[$&] } ?

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#3 [ruby-core:78542]

Akira Tanaka wrote:

Is str.gsub(map) a shortcut for str.gsub(Regexp.union(map.keys)) { map[$&] } ?

Kind of yes. I was thinking of str.gsub(Regexp.union(map.keys), map) -equivalent behaviour.

Updated by duerst (Martin Dürst) over 9 years ago Actions
Copy link
#4 [ruby-core:78543]

Shyouhei Urabe wrote:

I noticed that I can't purge NKF.nkf '-Z4'. It can neither be rewritten using String#tr, String#encode, nor String#unicode_normalize.

Can you give (a pointer to) a detailed description of what NKF, and in particular NKF.nkf -Z4, does exactly? For example, I can't find it at http://blog.layer8.sh/ja/2012/03/31/nkf_command_option/. The following may be related: 「-Z X0208中の英数字と若干の記号をASCIIに変換する。-Z1はX0208間隔をASCII spaceに変換する。-Z2はX0208間隔をASCII space 二つに変換する。趣味によって使い分けてほしい。」(ここでの「X0208間隔」は全角スペースのことでしょうか。)

It is doable using String#gsub theoretically, but that requires a hand-crafted nontrivial regular expression that exactly matches what Z4 expects to convert. This is almost impossible to do, and is definitely not something debuggable.

Please note that String#unicode_normalize, as currently implemented, also uses some huge regular expressions (though program-generated). And also has (hopefully) successfully been debugged, although with the help of testing data from Unicode.

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#5 [ruby-core:78546]

Martin Dürst wrote:

Shyouhei Urabe wrote:

I noticed that I can't purge NKF.nkf '-Z4'. It can neither be rewritten using String#tr, String#encode, nor String#unicode_normalize.

Can you give (a pointer to) a detailed description of what NKF, and in particular NKF.nkf -Z4, does exactly? For example, I can't find it at http://blog.layer8.sh/ja/2012/03/31/nkf_command_option/.

It seems there are quite few resources describing this feature on line.

I learned it by command line "nkf --help". The output says "4: JISX0208 Katakana to JISX0201 Katakana".
A few minutes of googling let me realize that it has beed there at least since 2009. https://osdn.net/projects/nkf/news/17482 (Japanese).
It seems this is the particular commit which implemented the feature in nkf: https://github.com/nurse/nkf/commit/958de30bc09aef38f2a44b5da0dbb1bb3c79e7d3
and then copied into our repository in this commit: https://github.com/ruby/ruby/commit/086e5b1a63d77bf5a4ebe10396a430d544fbe505

So in short it converts characters into what Unicode calls the "Halfwidth" ones.

Please note that String#unicode_normalize, as currently implemented, also uses some huge regular expressions (though program-generated). And also has (hopefully) successfully been debugged, although with the help of testing data from Unicode.

Thank you. That still sounds like a hustle to me. The proposed functionality would make it a lot easier for me to emulate NKF's Z4.

Updated by duerst (Martin Dürst) over 9 years ago Actions
Copy link
#6 [ruby-core:78547]

Shyouhei Urabe wrote:

It seems there are quite few resources describing this feature on line.

Thanks!

That still sounds like a hustle to me.

Can you write the last sentence in Japanese? The word 'hustle' has lots of meanings, some of them confusing.

The proposed functionality would make it a lot easier for me to emulate NKF's Z4.

I agree it would make it easier. But I'm not sure about "a lot". The main work needed to implement it is to create the hash. My understanding is that you still have to do that by hand. My suggestion would be to use literal characters, not \u escapes, in most cases because that makes it much easier to spot errors.

Compared to creating the hash, the shortening from

str.gsub(Regexp.union(map.keys)) { map[$&] }

str.gsub(map)

seems to be a minor simplification, and one that is easily done by defining a new method:

class String
  def hsub(map)
    gsub(Regexp.union(map.keys)) { map[$&] }
  end
end

I'm not against this feature, but I think it would be good to have some more examples of where it could be useful, and some check that we don't want to use String#gmap(Hash) with some other meaning in the future.

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#7 [ruby-core:79160]

Ruby has enough feature to implement String#hsub as Martin-sensei said.

However the performance of String#hsub is not good because it creates regexp object each time.
I guess creating regexp for big table each time is not acceptable for most cases.

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#8 [ruby-core:79161]

Status changed from Open to Feedback

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#9 [ruby-core:79175]

Status changed from Feedback to Rejected

We looked at this issue in yesterday's developer meeting.

While I claimed the use of regular expression is an implementation detail that I don't want to care about, attendees there said it is better to expose compiled structure (be they regexp) for performance. I agree with that, so I give up this propossal.

One note however: for instance if you have a table h = { 'a' => 'x', 'ab' => 'xy'}, You have to carefully avoid generating gsub(/a|ab/, h). This regexp would never match ab. You have to sort the hash key by its length before feeding to Regexp.union.

Updated by shyouhei (Shyouhei Urabe) about 8 years ago Actions
Copy link
#10

Has duplicate Feature #14443: Omit 'pattern' parameter in '(g)sub(!)' when 'hash' is given added

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Feature #13016

String#gsub(hash)

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#1 [ruby-core:78540]

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#2 [ruby-core:78541]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#3 [ruby-core:78542]

Updated by duerst (Martin Dürst) over 9 years ago Actions
Copy link
#4 [ruby-core:78543]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#5 [ruby-core:78546]

Updated by duerst (Martin Dürst) over 9 years ago Actions
Copy link
#6 [ruby-core:78547]

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#7 [ruby-core:79160]

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#8 [ruby-core:79161]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#9 [ruby-core:79175]

Updated by shyouhei (Shyouhei Urabe) about 8 years ago Actions
Copy link
#10

Project

General

Profile

Ruby

Custom queries

Feature #13016

String#gsub(hash)

Updated by shyouhei (Shyouhei Urabe) over 9 years ago ActionsCopy link #1 [ruby-core:78540]

Updated by akr (Akira Tanaka) over 9 years ago ActionsCopy link #2 [ruby-core:78541]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago ActionsCopy link #3 [ruby-core:78542]

Updated by duerst (Martin Dürst) over 9 years ago ActionsCopy link #4 [ruby-core:78543]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago ActionsCopy link #5 [ruby-core:78546]

Updated by duerst (Martin Dürst) over 9 years ago ActionsCopy link #6 [ruby-core:78547]

Updated by akr (Akira Tanaka) over 9 years ago ActionsCopy link #7 [ruby-core:79160]

Updated by akr (Akira Tanaka) over 9 years ago ActionsCopy link #8 [ruby-core:79161]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago ActionsCopy link #9 [ruby-core:79175]

Updated by shyouhei (Shyouhei Urabe) about 8 years ago ActionsCopy link #10

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#1 [ruby-core:78540]

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#2 [ruby-core:78541]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#3 [ruby-core:78542]

Updated by duerst (Martin Dürst) over 9 years ago Actions
Copy link
#4 [ruby-core:78543]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#5 [ruby-core:78546]

Updated by duerst (Martin Dürst) over 9 years ago Actions
Copy link
#6 [ruby-core:78547]

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#7 [ruby-core:79160]

Updated by akr (Akira Tanaka) over 9 years ago Actions
Copy link
#8 [ruby-core:79161]

Updated by shyouhei (Shyouhei Urabe) over 9 years ago Actions
Copy link
#9 [ruby-core:79175]

Updated by shyouhei (Shyouhei Urabe) about 8 years ago Actions
Copy link
#10