Bug #8129
closedString#index has drastically different performance when a single unicode character is included
Description
=begin
I created a simple ruby script:
#! /usr/bin/env ruby
raise "need a file name" unless ARGV[0]
contents = File.read(ARGV[0])
326_000.times do |i|
contents[(i + 23) % contents.size]
end
And I uploaded two files below. One is all ASCII characters and the other has a single Unicode character in the first line (an "em dash").
String#index has dramatically different performance for the two strings. Locally, I'm seeing ~1.5 seconds with all_ascii.css and ~30 seconds with one_unicode.css on 1.9.3-p385. It gets worse with ruby 2.0, all_ascii.css still takes ~1 sec, but one_unicode.css takes ~2.5 minutes!
Any idea why the performance is so dramatically different between the two?
=end
Files
Updated by Anonymous about 11 years ago
- Status changed from Open to Rejected
When all the characters in a string are ASCII characters (single bytes), the byte index for any given character can be calculated in constant time.
When the string contains multibyte characters, finding the byte index given a character index becomes O(n).
If you need fast character indexing, try splitting the string into an array or characters.
Updated by nobu (Nobuyoshi Nakada) about 11 years ago
- Description updated (diff)
Updated by nobu (Nobuyoshi Nakada) about 11 years ago
=begin
You may want to:
- use regexp, e.g. (({scan})).
- convert to fix width wide char encoding, i.e., ((|UTF-32LE|)) or ((|UTF-32BE|)).
=end
Updated by zmoazeni (Zach Moazeni) about 11 years ago
Thanks for the feedback Charlie and Nobuyoshi. This came up from https://github.com/kschiess/parslet/issues/73 which heavily uses String#index (http://www.ruby-doc.org/core-2.0/String.html#method-i-index) by passing a position to search from as the source content was consumed.