Bug #8129: String#index has drastically different performance when a single unicode character is included - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #8129

closed

String#index has drastically different performance when a single unicode character is included

Bug #8129: String#index has drastically different performance when a single unicode character is included

Added by zmoazeni (Zach Moazeni) about 13 years ago. Updated about 13 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

2.0.0-p0

Backport:

[ruby-core:53559]

Description

=begin
I created a simple ruby script:

#! /usr/bin/env ruby

raise "need a file name" unless ARGV[0]
contents = File.read(ARGV[0])

326_000.times do |i|
contents[(i + 23) % contents.size]
end

And I uploaded two files below. One is all ASCII characters and the other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 minutes!

Any idea why the performance is so dramatically different between the two?
=end

Files

Download all files

all_ascii.css (193 KB) all_ascii.css		zmoazeni (Zach Moazeni), 03/20/2013 08:23 AM
one_unicode.css (193 KB) one_unicode.css	The first line contains a unicode "em dash", otherwise all ascii	zmoazeni (Zach Moazeni), 03/20/2013 08:23 AM

Updated by Anonymous about 13 years ago Actions
Copy link
#1 [ruby-core:53561]

Status changed from Open to Rejected

When all the characters in a string are ASCII characters (single bytes), the byte index for any given character can be calculated in constant time.

When the string contains multibyte characters, finding the byte index given a character index becomes O(n).

If you need fast character indexing, try splitting the string into an array or characters.

Updated by nobu (Nobuyoshi Nakada) about 13 years ago Actions
Copy link
#2 [ruby-core:53562]

Description updated (diff)

Updated by nobu (Nobuyoshi Nakada) about 13 years ago Actions
Copy link
#3 [ruby-core:53563]

=begin
You may want to:

use regexp, e.g. (({scan})).
convert to fix width wide char encoding, i.e., ((|UTF-32LE|)) or ((|UTF-32BE|)).
=end

Updated by zmoazeni (Zach Moazeni) about 13 years ago Actions
Copy link
#4 [ruby-core:53564]

Thanks for the feedback Charlie and Nobuyoshi. This came up from https://github.com/kschiess/parslet/issues/73 which heavily uses String#index (http://www.ruby-doc.org/core-2.0/String.html#method-i-index) by passing a position to search from as the source content was consumed.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #8129

String#index has drastically different performance when a single unicode character is included

Updated by Anonymous about 13 years ago Actions
Copy link
#1 [ruby-core:53561]

Updated by nobu (Nobuyoshi Nakada) about 13 years ago Actions
Copy link
#2 [ruby-core:53562]

Updated by nobu (Nobuyoshi Nakada) about 13 years ago Actions
Copy link
#3 [ruby-core:53563]

Updated by zmoazeni (Zach Moazeni) about 13 years ago Actions
Copy link
#4 [ruby-core:53564]

Project

General

Profile

Ruby

Custom queries

Bug #8129

String#index has drastically different performance when a single unicode character is included

Updated by Anonymous about 13 years ago ActionsCopy link #1 [ruby-core:53561]

Updated by nobu (Nobuyoshi Nakada) about 13 years ago ActionsCopy link #2 [ruby-core:53562]

Updated by nobu (Nobuyoshi Nakada) about 13 years ago ActionsCopy link #3 [ruby-core:53563]

Updated by zmoazeni (Zach Moazeni) about 13 years ago ActionsCopy link #4 [ruby-core:53564]

Updated by Anonymous about 13 years ago Actions
Copy link
#1 [ruby-core:53561]

Updated by nobu (Nobuyoshi Nakada) about 13 years ago Actions
Copy link
#2 [ruby-core:53562]

Updated by nobu (Nobuyoshi Nakada) about 13 years ago Actions
Copy link
#3 [ruby-core:53563]

Updated by zmoazeni (Zach Moazeni) about 13 years ago Actions
Copy link
#4 [ruby-core:53564]