Bug #6509

String#gsub is too slow if receiver includes a binary

Added by okkez _ almost 3 years ago. Updated almost 3 years ago.

[ruby-dev:45688]
Status:Closed
Priority:Normal
Assignee:Yui NARUSE
ruby -v:ruby 2.0.0dev (2012-05-28 trunk 35830) [x86_64-linux] Backport:

Description

=begin

以下のようなコードで String#gsub が遅くなります。

  • b = "" の場合(A): 0.2840230464935303
  • b = "\xB9" の場合(B): 4.183771848678589

# -- coding: utf-8 --

a = ("abcde\n"50000).force_encoding("binary")
#b = ""
b = "\xB9".force_encoding("binary")
c = ("efghi\n"
50000).force_encoding("binary")

d = "#{a}#{b}#{c}"

start = Time.now.to_f
d.gsub(/\n/) { "" }
puts(Time.now.to_f - start)

それぞれの場合で、プロファイルを取ってみたので添付します。

(B)の場合に、search_nonascii を約20万回呼び出して処理時間の92%を費しています。
(A)の場合は、約10万回しか呼び出しておらず、処理時間も短いです。

=end

callgrind.out.9937 - (A)の場合 (521 KB) okkez _, 05/29/2012 10:03 AM

callgrind.out.10091 - (B)の場合 (521 KB) okkez _, 05/29/2012 10:03 AM

Associated revisions

Revision 35863
Added by Yui NARUSE almost 3 years ago

  • string.c (rb_enc_cr_str_buf_cat): don't reset coderange as unknown. the condition 'ptr_a8 && str_cr != ENC_CODERANGE_7BIT' means not unknown, str is also ASCII-8BIT because str_encindex == ptr_encindex, and nont (str_cr == ENC_CODERANGE_UNKNOWN) and str_cr != ENC_CODERANGE_7BIT means str_cr is valid because ASCII-8BIT can't be broken. [Bug #6509]

Revision 35863
Added by Yui NARUSE almost 3 years ago

  • string.c (rb_enc_cr_str_buf_cat): don't reset coderange as unknown. the condition 'ptr_a8 && str_cr != ENC_CODERANGE_7BIT' means not unknown, str is also ASCII-8BIT because str_encindex == ptr_encindex, and nont (str_cr == ENC_CODERANGE_UNKNOWN) and str_cr != ENC_CODERANGE_7BIT means str_cr is valid because ASCII-8BIT can't be broken. [Bug #6509]

History

#1 Updated by Shyouhei Urabe almost 3 years ago

  • Category changed from core to M17N
  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE

str_gsubの中でdestが一回non asciiになってしまったらそれ以降はsearch_nonasciiしても無駄という気がしますが専門家のご意見をうかがいたいところです。

#2 Updated by Yui NARUSE almost 3 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r35863.
okkez, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • string.c (rb_enc_cr_str_buf_cat): don't reset coderange as unknown. the condition 'ptr_a8 && str_cr != ENC_CODERANGE_7BIT' means not unknown, str is also ASCII-8BIT because str_encindex == ptr_encindex, and nont (str_cr == ENC_CODERANGE_UNKNOWN) and str_cr != ENC_CODERANGE_7BIT means str_cr is valid because ASCII-8BIT can't be broken. [Bug #6509]

Also available in: Atom PDF