Project

General

Profile

Actions

Feature #20576

open

Add MatchData#bytebegin and MatchData#byteend

Added by shugo (Shugo Maeda) 11 days ago. Updated 10 days ago.

Status:
Open
Assignee:
-
Target version:
[ruby-core:118299]

Description

I'd like to propose MatchData#bytebegin and MatchData#byteend.
These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints.

Pull request: https://github.com/ruby/ruby/pull/10973

One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files
MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array.
Here's a benchmark result:

voyager:ruby$ cat b.rb 
require "benchmark"
require "strscan"

text = "あ" * 100000

Benchmark.bmbm do |b|
  b.report("byteoffset(0)[1]") do
    pos = 0
    while text.byteindex(/\G./, pos)
      pos = $~.byteoffset(0)[1]
    end
  end

  b.report("byteend(0)") do
    pos = 0
    while text.byteindex(/\G./, pos)
      pos = $~.byteend(0)
    end
  end
end
voyager:ruby$ ./tool/runruby.rb b.rb           
Rehearsal ----------------------------------------------------
byteoffset(0)[1]   0.020558   0.000393   0.020951 (  0.020963)
byteend(0)         0.018149   0.000000   0.018149 (  0.018151)
------------------------------------------- total: 0.039100sec

                       user     system      total        real
byteoffset(0)[1]   0.020821   0.000000   0.020821 (  0.020822)
byteend(0)         0.017455   0.000000   0.017455 (  0.017455)

Updated by Eregon (Benoit Daloze) 11 days ago

Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?

Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.

Updated by shugo (Shugo Maeda) 10 days ago

Eregon (Benoit Daloze) wrote in #note-1:

Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?

I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like $~.byteoffset(0)[1] when only the end offset is needed.

Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.

I proposed byteend for consistency with existing methods such as byteoffset.
If we choose byte_end, it may be better to introduce new aliases for such existing methods.

Updated by matz (Yukihiro Matsumoto) 10 days ago

I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?

Matz.

Updated by shugo (Shugo Maeda) 10 days ago

matz (Yukihiro Matsumoto) wrote in #note-3:

I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?

I came up with names begin_in_bytes and end_in_bytes, but byte_begin / byte_end suggested by Eregon may be better.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0