Bug #8129

String#index has drastically different performance when a single unicode character is included

Added by Zach Moazeni almost 4 years ago. Updated almost 4 years ago.

Target version:


I created a simple ruby script:

#! /usr/bin/env ruby

raise "need a file name" unless ARGV[0]
contents =[0])

326_000.times do |i|
contents[(i + 23) % contents.size]

And I uploaded two files below. One is all ASCII characters and the other has a single Unicode character in the first line (an "em dash").

String#index has dramatically different performance for the two strings. Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 minutes!

Any idea why the performance is so dramatically different between the two?

all_ascii.css View (193 KB) Zach Moazeni, 03/20/2013 08:23 AM

one_unicode.css View - The first line contains a unicode "em dash", otherwise all ascii (193 KB) Zach Moazeni, 03/20/2013 08:23 AM


#1 [ruby-core:53561] Updated by Charlie Somerville almost 4 years ago

  • Status changed from Open to Rejected

When all the characters in a string are ASCII characters (single bytes), the byte index for any given character can be calculated in constant time.

When the string contains multibyte characters, finding the byte index given a character index becomes O(n).

If you need fast character indexing, try splitting the string into an array or characters.

#2 [ruby-core:53562] Updated by Nobuyoshi Nakada almost 4 years ago

  • Description updated (diff)

#3 [ruby-core:53563] Updated by Nobuyoshi Nakada almost 4 years ago

You may want to:
* use regexp, e.g. (({scan})).
* convert to fix width wide char encoding, i.e., ((|UTF-32LE|)) or ((|UTF-32BE|)).

#4 [ruby-core:53564] Updated by Zach Moazeni almost 4 years ago

Thanks for the feedback Charlie and Nobuyoshi. This came up from which heavily uses String#index ( by passing a position to search from as the source content was consumed.

Also available in: Atom PDF