Bug #8585

Time for CSV.generate grows quadratic with number of rows

Added by Peter Vandenabeele 10 months ago. Updated 10 months ago.

[ruby-core:55709]
Status:Closed
Priority:Normal
Assignee:-
Category:-
Target version:-
ruby -v:2.1.0dev and 2.0.0 Backport:1.9.3: UNKNOWN, 2.0.0: UNKNOWN

Description

Hi,

I want to generate a CSV string, from millions of rows.
I see the time to create the string grows quadratic
with the amount of rows. With this issue, I cannot use
ruby 2.0.0 to create the CSV file.

I did not see this problem was not present in ruby 1.9.3.

I see the problem is present in ruby 2.0.0 and ruby-head.

Using ruby-head

Installed with rvm reinstall ruby-head (built from version 3a01b9e)

peterv@peter64:~/p/dbd$ rvm use ruby-head
Using /home/peter
v/.rvm/gems/ruby-head

peterv@peter64:~/p/dbd$ ruby -v
ruby 2.1.0dev (2013-06-30) [x86
64-linux]

peterv@peter64:~/p/dbd$ uname -a
Linux peter64 3.5.0-34-generic #55~precise1-Ubuntu SMP Fri Jun 7 16:25:50 UTC 2013 x86
64 x8664 x8664 GNU/Linux

peter_v@peter64:~/p/dbd$ rvm current
ruby-head

peterv@peter64:~/p/dbd$ cat bin/test4.rb
#!/usr/bin/env ruby

count = ARGV[0].to_i
unless count > 0
puts "Give a 'count' as first argument."
exit(1)
end

require 'csv'

row_data = [
"59ffbb3b-1e48-4c1f-81d8-d93afc84c966",
"2013-06-28 19:14:55.975000806 UTC",
"a11f290e-c441-41bc-8b8c-4e6c27b1b6fc",
"c73e6241-d46f-4952-8377-c11372346d15",
"test",
"BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0"]

puts "starting CSV.generate"
start_time = Time.now

csvstring = CSV.generate(forcequotes: true) do |csv|
count.times do
csv << row_data
end
end

puts "CSV.generate took #{Time.now - start_time} seconds"

peterv@peter64:~/p/dbd$ time bin/test4.rb 10_000
starting CSV.generate
CSV.generate took 1.01238478 seconds

real 0m1.045s
user 0m1.044s
sys 0m0.004s

peterv@peter64:~/p/dbd$ time bin/test4.rb 20_000
starting CSV.generate
CSV.generate took 3.815373614 seconds

real 0m3.847s
user 0m3.844s
sys 0m0.000s

peterv@peter64:~/p/dbd$ time bin/test4.rb 40_000
starting CSV.generate
CSV.generate took 17.176208859 seconds

real 0m17.212s
user 0m17.177s
sys 0m0.020s

peterv@peter64:~/p/dbd$ time bin/test4.rb 80_000
starting CSV.generate
CSV.generate took 71.400916725 seconds

real 1m11.436s
user 1m11.320s
sys 0m0.036s
peter_v@peter64:~/p/dbd$

Using ruby-1.9.3-p448

This is as expected LINEAR growth of time with number of rows.

peterv@peter64:~/p/dbd$ rvm use ruby-1.9.3
Using /home/peter
v/.rvm/gems/ruby-1.9.3-p448

peterv@peter64:~/p/dbd$ ruby -v
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86
64-linux]

peter_v@peter64:~/p/dbd$ rvm current
ruby-1.9.3-p448

peterv@peter64:~/p/dbd$ time bin/test4.rb 10_000
starting CSV.generate
CSV.generate took 0.125396387 seconds

real 0m0.150s
user 0m0.140s
sys 0m0.008s

peterv@peter64:~/p/dbd$ time bin/test4.rb 20_000
starting CSV.generate
CSV.generate took 0.249746069 seconds

real 0m0.274s
user 0m0.268s
sys 0m0.004s

peterv@peter64:~/p/dbd$ time bin/test4.rb 40_000
starting CSV.generate
CSV.generate took 0.498180989 seconds

real 0m0.522s
user 0m0.504s
sys 0m0.016s

peterv@peter64:~/p/dbd$ time bin/test4.rb 80_000
starting CSV.generate
CSV.generate took 0.991481147 seconds

real 0m1.015s
user 0m1.000s
sys 0m0.016s

peterv@peter64:~/p/dbd$ time bin/test4.rb 100_000
starting CSV.generate
CSV.generate took 1.243347153 seconds

real 0m1.265s
user 0m1.240s
sys 0m0.020s

peterv@peter64:~/p/dbd$ time bin/test4.rb 1000000
starting CSV.generate
CSV.generate took 12.461711974 seconds

real 0m12.492s
user 0m12.405s
sys 0m0.080s
peter_v@peter64:~/p/dbd$

bug-8585.diff Magnifier (845 Bytes) Nobuyoshi Nakada, 06/30/2013 11:38 PM

Associated revisions

Revision 41722
Added by Nobuyoshi Nakada 10 months ago

csv.rb: get rid of discarding coderange

  • lib/csv.rb (CSV#<<): use StringIO#setencoding instead of creating new StringIO instance with String#forceencoding, forcing encoding discards the cached coderange bits and can make further operations very slow. [Bug #8585]

History

#1 Updated by Peter Vandenabeele 10 months ago

Using

CSV.open(filename, 'w')

I can write large CSV files to disk in Ruby 2.0.0
(e.g. 10 M rows in 132 seconds)

It is only writing it to string that forms a problem in
ruby 2.0.0 and ruby-head.

#2 Updated by Benoit Daloze 10 months ago

Good find!

A git bisect led to r37485 aka 58ef0f06:

Author: naruse
Date: Tue Nov 6 00:49:57 2012 +0000

* ruby.c (load_file_internal): set default source encoding as
  UTF-8 instead of US-ASCII.  [Feature #6679]

* parse.y (parser_initialize): set default parser encoding as
  UTF-8 instead of US-ASCII.

So definitely looks encoding-related.
And worrying this is causing such performance regression.

#3 Updated by Benoit Daloze 10 months ago

Adding "# encoding: US-ASCII" at the top of the script makes it identical to the previous behavior, therefore taking the same time. I would certainly not call this a solution though.

#4 Updated by Charlie Somerville 10 months ago

This is most likely due to character indexing in UTF-8 being O(n).

I'd suggest reworking CSV.generate to not use character indexing, or convert input strings to UTF-32 first.

#5 Updated by Nobuyoshi Nakada 10 months ago

Eregon (Benoit Daloze) wrote:

Adding "# encoding: US-ASCII" at the top of the script makes it identical to the previous behavior, therefore taking the same time. I would certainly not call this a solution though.

The file already has that line.

This slug seems because String#encode in do_quote lambda in init_separators is called for each fields.

#6 Updated by Benoit Daloze 10 months ago

nobu (Nobuyoshi Nakada) wrote:

The file already has that line.

I meant at the top of the test script provided in the description.

This slug seems because String#encode in do_quote lambda in init_separators is called for each fields.

Any idea why this makes the whole process quadratic?

#7 Updated by Nobuyoshi Nakada 10 months ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r41722.
Peter, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


csv.rb: get rid of discarding coderange

  • lib/csv.rb (CSV#<<): use StringIO#setencoding instead of creating new StringIO instance with String#forceencoding, forcing encoding discards the cached coderange bits and can make further operations very slow. [Bug #8585]

Also available in: Atom PDF