Bug #20512: Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed - Ruby - Ruby Issue Tracking System

Actions

Bug #20512

closed

Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed

Bug #20512: Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed

Added by giner (Stanislav German-Evtushenko) almost 2 years ago. Updated almost 2 years ago.

Status:

Closed

Assignee:

Target version:

ruby -v:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]

Backport:

3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN

[ruby-core:118037]

Description

Slicing of a single character of UTF-8 string becomes ~15 times faster after method "length" is executed on the string.

# Single byte symbols
letters = ("a".."z").to_a
length = 100000

str = length.times.map{letters[rand(26)]}.join

# Slow
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.169156201

str.length  # performance hack

# Fast
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.009883919


# UTF-8 Symbols
letters = ("а".."я").to_a
length = 10000

str = length.times.map{letters[rand(26)]}.join

# Slow
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.326204007

str.length  # performance hack

# Fast
start = Time.now
length.times{|i| str[i]}
puts Time.now - start  # 0.016943093

Updated by byroot (Jean Boussier) almost 2 years ago Actions
Copy link
#1 [ruby-core:118040]

What is happening here is that length triggers scanning the string coderange.

And when the coderange is unknown, String#[] is slower for variable-length character encodings (like UTF-8).

On 3.3:

require 'json'
require 'objspace'
require 'benchmark'

# Single byte symbols
letters = ("a".."z").to_a
length = 100000

str = length.times.map{letters[rand(26)]}.join

# Slow
p Benchmark.realtime { length.times{|i| str[i]} }
p Benchmark.realtime { length.times{|i| str[i]} }
puts JSON.parse(ObjectSpace.dump(str))["coderange"]
p Benchmark.realtime { str.length }  # performance hack
puts JSON.parse(ObjectSpace.dump(str))["coderange"]

$ ruby -v /tmp/str.rb
ruby 3.3.1 (2024-04-23 revision c56cd86388) [arm64-darwin23]
0.17216699989512563
0.1763450000435114
unknown
5.999580025672913e-06
7bit
0.004894999787211418

See how coderange changes from unknown to 7bit, allowing String#[] to treat the string as pure ASCII, hence can directly compute the substring position with a simple offset.

The question here is whether String#[] should trigger scanning the coderange. It would definitely make some code faster, but may slow down some others, so it's a bit debatable, but I'd be in favor of it.

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#2

Status changed from Open to Closed

Applied in changeset git|7d144781a93df66379922717da711a09d1cf78ff.

[Bug #20512] Set coderange in Range#each of strings

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #20512

Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed

Updated by byroot (Jean Boussier) almost 2 years ago Actions
Copy link
#1 [ruby-core:118040]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#2

Project

General

Profile

Ruby

Custom queries

Bug #20512

Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed

Updated by byroot (Jean Boussier) almost 2 years ago ActionsCopy link #1 [ruby-core:118040]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago ActionsCopy link #2

Updated by byroot (Jean Boussier) almost 2 years ago Actions
Copy link
#1 [ruby-core:118040]

Updated by nobu (Nobuyoshi Nakada) almost 2 years ago Actions
Copy link
#2