Project

General

Profile

Actions

Feature #20576

closed

Add MatchData#bytebegin and MatchData#byteend

Added by shugo (Shugo Maeda) 6 months ago. Updated 5 months ago.

Status:
Closed
Assignee:
-
Target version:
[ruby-core:118299]

Description

I'd like to propose MatchData#bytebegin and MatchData#byteend.
These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints.

Pull request: https://github.com/ruby/ruby/pull/10973

One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files
MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array.
Here's a benchmark result:

voyager:ruby$ cat b.rb 
require "benchmark"
require "strscan"

text = "あ" * 100000

Benchmark.bmbm do |b|
  b.report("byteoffset(0)[1]") do
    pos = 0
    while text.byteindex(/\G./, pos)
      pos = $~.byteoffset(0)[1]
    end
  end

  b.report("byteend(0)") do
    pos = 0
    while text.byteindex(/\G./, pos)
      pos = $~.byteend(0)
    end
  end
end
voyager:ruby$ ./tool/runruby.rb b.rb           
Rehearsal ----------------------------------------------------
byteoffset(0)[1]   0.020558   0.000393   0.020951 (  0.020963)
byteend(0)         0.018149   0.000000   0.018149 (  0.018151)
------------------------------------------- total: 0.039100sec

                       user     system      total        real
byteoffset(0)[1]   0.020821   0.000000   0.020821 (  0.020822)
byteend(0)         0.017455   0.000000   0.017455 (  0.017455)

Updated by Eregon (Benoit Daloze) 6 months ago

Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?

Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.

Updated by shugo (Shugo Maeda) 6 months ago

Eregon (Benoit Daloze) wrote in #note-1:

Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?

I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like $~.byteoffset(0)[1] when only the end offset is needed.

Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.

I proposed byteend for consistency with existing methods such as byteoffset.
If we choose byte_end, it may be better to introduce new aliases for such existing methods.

Updated by matz (Yukihiro Matsumoto) 6 months ago

I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?

Matz.

Updated by shugo (Shugo Maeda) 6 months ago

matz (Yukihiro Matsumoto) wrote in #note-3:

I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?

I came up with names begin_in_bytes and end_in_bytes, but byte_begin / byte_end suggested by Eregon may be better.

Updated by matz (Yukihiro Matsumoto) 5 months ago

OK. I didn't like the names (especially byteend), but after looking at them for a while I got used to it and was ready to compromise.

Matz.

Actions #6

Updated by shugo (Shugo Maeda) 5 months ago

  • Status changed from Open to Closed

Applied in changeset git|e048a073a3cba04576b8f6a1673c283e4e20cd90.


Add MatchData#bytebegin and MatchData#byteend

These methods return the byte-based offset of the beginning or end of the specified match.

[Feature #20576]

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0