Bug #8129: String#index has drastically different performance when a single unicode character is included - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #8129

closed

String#index has drastically different performance when a single unicode character is included

Added by zmoazeni (Zach Moazeni) over 12 years ago. Updated over 12 years ago.

Status:

Rejected

Assignee:

Target version:

ruby -v:

2.0.0-p0

Backport:

[ruby-core:53559]

Description

=begin
I created a simple ruby script:

#! /usr/bin/env ruby

raise "need a file name" unless ARGV[0]
contents = File.read(ARGV[0])

326_000.times do |i|
contents[(i + 23) % contents.size]
end

And I uploaded two files below. One is all ASCII characters and the other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 minutes!

Any idea why the performance is so dramatically different between the two?
=end

Files

Download all files

all_ascii.css (193 KB) all_ascii.css		zmoazeni (Zach Moazeni), 03/20/2013 08:23 AM
one_unicode.css (193 KB) one_unicode.css	The first line contains a unicode "em dash", otherwise all ascii	zmoazeni (Zach Moazeni), 03/20/2013 08:23 AM

Actions

Copy link

#1 [ruby-core:53561]

Updated by Anonymous over 12 years ago

Status changed from Open to Rejected

When all the characters in a string are ASCII characters (single bytes), the byte index for any given character can be calculated in constant time.

When the string contains multibyte characters, finding the byte index given a character index becomes O(n).

If you need fast character indexing, try splitting the string into an array or characters.

Actions

Copy link

#2 [ruby-core:53562]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago

Description updated (diff)

Actions

Copy link

#3 [ruby-core:53563]

Updated by nobu (Nobuyoshi Nakada) over 12 years ago

=begin
You may want to:

use regexp, e.g. (({scan})).
convert to fix width wide char encoding, i.e., ((|UTF-32LE|)) or ((|UTF-32BE|)).
=end

Actions

Copy link

#4 [ruby-core:53564]

Updated by zmoazeni (Zach Moazeni) over 12 years ago

Thanks for the feedback Charlie and Nobuyoshi. This came up from https://github.com/kschiess/parslet/issues/73 which heavily uses String#index (http://www.ruby-doc.org/core-2.0/String.html#method-i-index) by passing a position to search from as the source content was consumed.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #8129

String#index has drastically different performance when a single unicode character is included

Updated by Anonymous over 12 years ago

Updated by nobu (Nobuyoshi Nakada) over 12 years ago

Updated by nobu (Nobuyoshi Nakada) over 12 years ago

Updated by zmoazeni (Zach Moazeni) over 12 years ago