Project

General

Profile

Actions

Bug #20512

closed

Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed

Added by giner (Stanislav German-Evtushenko) about 2 months ago. Updated about 2 months ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
[ruby-core:118037]

Description

Slicing of a single character of UTF-8 string becomes ~15 times faster after method "length" is executed on the string.

# Single byte symbols
letters = ("a".."z").to_a
length = 100000

str = length.times.map{letters[rand(26)]}.join

# Slow
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.169156201

str.length  # performance hack

# Fast
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.009883919


# UTF-8 Symbols
letters = ("а".."я").to_a
length = 10000

str = length.times.map{letters[rand(26)]}.join

# Slow
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.326204007

str.length  # performance hack

# Fast
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.016943093

Updated by byroot (Jean Boussier) about 2 months ago

What is happening here is that length triggers scanning the string coderange.

And when the coderange is unknown, String#[] is slower for variable-length character encodings (like UTF-8).

On 3.3:

require 'json'
require 'objspace'
require 'benchmark'

# Single byte symbols
letters = ("a".."z").to_a
length = 100000

str = length.times.map{letters[rand(26)]}.join

# Slow
p Benchmark.realtime { length.times{|i| str[i]} }
p Benchmark.realtime { length.times{|i| str[i]} }
puts JSON.parse(ObjectSpace.dump(str))["coderange"]
p Benchmark.realtime { str.length }  # performance hack
puts JSON.parse(ObjectSpace.dump(str))["coderange"]
$ ruby -v /tmp/str.rb
ruby 3.3.1 (2024-04-23 revision c56cd86388) [arm64-darwin23]
0.17216699989512563
0.1763450000435114
unknown
5.999580025672913e-06
7bit
0.004894999787211418

See how coderange changes from unknown to 7bit, allowing String#[] to treat the string as pure ASCII, hence can directly compute the substring position with a simple offset.

The question here is whether String#[] should trigger scanning the coderange. It would definitely make some code faster, but may slow down some others, so it's a bit debatable, but I'd be in favor of it.

Actions #2

Updated by nobu (Nobuyoshi Nakada) about 2 months ago

  • Status changed from Open to Closed

Applied in changeset git|7d144781a93df66379922717da711a09d1cf78ff.


[Bug #20512] Set coderange in Range#each of strings

Actions

Also available in: Atom PDF

Like0
Like0Like0