I'd like to propose MatchData#bytebegin and MatchData#byteend.
These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints.
One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files
MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array.
Here's a benchmark result:
voyager:ruby$ cat b.rb
require "benchmark"
require "strscan"
text = "あ" * 100000
Benchmark.bmbm do |b|
b.report("byteoffset(0)[1]") do
pos = 0
while text.byteindex(/\G./, pos)
pos = $~.byteoffset(0)[1]
end
end
b.report("byteend(0)") do
pos = 0
while text.byteindex(/\G./, pos)
pos = $~.byteend(0)
end
end
end
voyager:ruby$ ./tool/runruby.rb b.rb
Rehearsal ----------------------------------------------------
byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963)
byteend(0) 0.018149 0.000000 0.018149 ( 0.018151)
------------------------------------------- total: 0.039100sec
user system total real
byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822)
byteend(0) 0.017455 0.000000 0.017455 ( 0.017455)
Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?
I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like $~.byteoffset(0)[1] when only the end offset is needed.
Regarding naming, byteend seems hard to read, I think byte_begin/byte_end is much clearer.
I proposed byteend for consistency with existing methods such as byteoffset.
If we choose byte_end, it may be better to introduce new aliases for such existing methods.
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names bytebegin, byteend are follow the byteindex tradition, but it is very hard to read (especially byteend). Any other name suggestions?
I came up with names begin_in_bytes and end_in_bytes, but byte_begin / byte_end suggested by Eregon may be better.