Feature #4017: [PATCH] CSV parsing speedup - Ruby - Ruby Issue Tracking System

Actions

Copy link

Feature #4017

closed

[PATCH] CSV parsing speedup

Added by ender672 (Timothy Elliott) over 14 years ago. Updated over 7 years ago.

Status:

Rejected

Assignee:

kou (Kouhei Sutou)

Target version:

[ruby-core:33026]

Description

=begin
ruby_19_csv_parser_split_methods.patch
This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.

ruby_19_csv_parser_split_methods.patch
This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.

On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).

These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)
=end

Files

Download all files

ruby_19_csv_parser_split_methods.patch (11.9 KB) ruby_19_csv_parser_split_methods.patch	Patch 1/2	ender672 (Timothy Elliott), 11/03/2010 09:43 AM
ruby_19_csv_parser_speedup.patch (1.82 KB) ruby_19_csv_parser_speedup.patch	Patch 2/2	ender672 (Timothy Elliott), 11/03/2010 09:43 AM

Actions

Copy link

Updated by JEG2 (James Gray) over 14 years ago

=begin
I'm for adding the read_limit option (I prefer that name to io_read_limit).

However, I'm not sure how I feel about defaulting it to some number. That's going to mean that some documents CSV used to correctly parse could now fail. While a limit is a good idea in practice, I don't feel comfortable making that choice for the programmer. I think you need to opt-in to a sensible limit for your purposes and, until you do, we assume you want the best parsing we can provide.

So, my main question is, can we drop the default limit and keep the gains?

I would definitely before applying a patch like this if we:

Rename the limit to just read_limit
Default to using no limit

I hope that makes sense.
=end

Actions

Copy link

Updated by ender672 (Timothy Elliott) over 14 years ago

=begin

Rename the limit to just read_limit

That name sounds good. I'll submit an updated patch with the better name.

Default to using no limit

I didn't do a good job explaining how this patch uses the read_limit option. Trunk currently uses IO#gets(newline_character) to read from @io. This patch gets a performance boost by using IO#gets(quote_character, read_limit) instead. The performance advantage is that we don't always care about newline characters (for example when reading data in a quoted column) but we do always care about quote characters.

One side effect of reading upto quote characters instead of newline characters is that it exposes an issue with using IO#gets without a limit:

If you parse a huge file without newline characters (on the trunk parser) or without quote characters (in the parser from this patch) you can run out of memory.

A huge file without newlines probably only would happen if it was done intentionally, but it is quite common for CSV files to have no quote characters in them. If the default for the parser in this patch was no read_limit, these files would consume a lot of memory and possibly cause an OOM situation.

Sorry for providing the patch with so little explanation. That change from IO#gets(newline_char) to IO#gets(quote_char, read_limit) is significant and should have been mentioned earlier.

=end

Actions

Copy link

Updated by JEG2 (James Gray) over 14 years ago

=begin
Ah, sorry. I misunderstood. I think I get it now.

Basically, read_limit isn't a limit so much as a read-at-a-time control dial that can be used to tune parsing speed. It's set to a reasonable default and users can tune it the their specific needs. Am I understanding better now?

I was thinking it was bubbling the limit used by methods like IO#gets() up to the user. I've long wanted to do that, to account for data like:

"...gigs of data containing anything but a quote

Beyond that, the refactoring sounds neat to me. Under what circumstances does it turn out to be slower than the old parser? It seems like there would have to be some. I don't say that because the old parser was well tuned either. It's just that we've now shifted the focus of the parser (from newlines to quotes, if I'm understanding correctly). For example, if each line is 20 simple fields, all unnecessarily but legally quoted, do the increased IO operations slow things down?

I guess if there are such cases, we just need to decide if the ones we are optimizing for are common. My thoughts there are probably that the test sweet isn't an ideal speed measuring tool. We might want to grab several realistic CSV documents to use for that purpose.

Any thoughts on my babbling here Tim? Am I making sense.
=end

Actions

Copy link

Updated by shyouhei (Shyouhei Urabe) over 14 years ago

Status changed from Open to Assigned

=begin

=end

Actions

Copy link

#5 [ruby-core:49146]

Updated by mame (Yusuke Endoh) over 12 years ago

Description updated (diff)
Target version set to 2.6

Actions

Copy link

Updated by naruse (Yui NARUSE) over 7 years ago

Target version deleted (~~2.6~~)

Actions

Copy link

#7 [ruby-core:86025]

Updated by mame (Yusuke Endoh) over 7 years ago

Assignee changed from JEG2 (James Gray) to kou (Kouhei Sutou)

@kou (Kouhei Sutou), could you check this ticket?

Actions

Copy link

#8 [ruby-core:86073]

Updated by kou (Kouhei Sutou) over 7 years ago

Status changed from Assigned to Feedback

We need some benchmark scripts based on benchmark library (or similar library) to work on speedup.

Can someone work on creating benchmark scripts?

Actions

Copy link

#9 [ruby-core:86107]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago

I tried to create a benchmark script for CSV.parse
(Using benchmark-ips gem)

Script¶

# benchmark script for CSV.parse
# Usage: `ruby $0 [rows count(default: 1000)]`
require 'csv'
require 'benchmark/ips'

Benchmark.ips do |x|
  rows = ARGV.fetch(0, "1000").to_i

  alphas = ['AAAAA'] * 50
  unquoted = (alphas.join(',') + "\r\n") * rows
  quoted = (alphas.map { |s| %("#{s}") }.join(',') + "\r\n") * rows
  inc_col_sep = (alphas.map { |s| %(",#{s}") }.join(',') + "\r\n") * rows
  inc_row_sep = (alphas.map { |s| %("#{s}\r\n") }.join(',') + "\r\n") * rows

  hiraganas = ['あああああ'] * 50
  enc_utf8 = (hiraganas.join(',') + "\r\n") * rows
  enc_sjis = enc_utf8.encode('Windows-31J')

  x.report("unquoted") { CSV.parse(unquoted) }
  x.report("quoted") { CSV.parse(quoted) }
  x.report("include col_sep") { CSV.parse(inc_col_sep) }
  x.report("include row_sep") { CSV.parse(inc_row_sep) }
  x.report("encode utf-8") { CSV.parse(enc_utf8) }
  x.report("encode sjis") { CSV.parse(enc_sjis) }
end

Result¶

Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
Processor: Intel Core i7 6700K @ 4 GHz
Memory: 16 GB
Target revision: https://github.com/ruby/csv/commit/583cee457773c0ac9e7234b885e1a82df35a47cb

Rows: 1000¶

            unquoted     41.142  (± 0.0%) i/s -    208.000  in   5.055839s
              quoted     23.093  (± 0.0%) i/s -    116.000  in   5.024081s
     include col_sep     14.826  (± 0.0%) i/s -     75.000  in   5.059138s
     include row_sep      7.136  (± 0.0%) i/s -     36.000  in   5.045395s
        encode utf-8     34.350  (± 0.0%) i/s -    174.000  in   5.066178s
         encode sjis     34.230  (± 0.0%) i/s -    174.000  in   5.083444s

Rows: 10000¶

            unquoted      4.021  (± 0.0%) i/s -     21.000  in   5.230854s
              quoted      2.266  (± 0.0%) i/s -     12.000  in   5.327204s
     include col_sep      1.527  (± 0.0%) i/s -      8.000  in   5.242055s
     include row_sep      0.692  (± 0.0%) i/s -      4.000  in   5.780656s
        encode utf-8      3.215  (± 0.0%) i/s -     16.000  in   5.010681s
         encode sjis      3.400  (± 0.0%) i/s -     17.000  in   5.012165s

Actions

Copy link

#10 [ruby-core:86108]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago

In addition, I'm porting patches in this issue and got benchmark results.

Result¶

Ruby version: ruby 2.5.0p0 (2017-12-25 revision 61468) [x86_64-darwin16]
Processor: Intel Core i7 6700K @ 4 GHz
Memory: 16 GB
Target revision: https://github.com/tomog105/csv/commit/c56ed71097fc7e61ce2a896f013095a5ef36b548
- Since many changes were added to the code base, I'm fixed additional features from original patches.
- But, CSV#line does not work properly in this revision.

Rows 1000¶

            unquoted     91.088  (± 2.2%) i/s -    459.000  in   5.041720s
              quoted     22.547  (± 0.0%) i/s -    114.000  in   5.056618s
     include col_sep     22.167  (± 0.0%) i/s -    112.000  in   5.053609s
     include row_sep     22.369  (± 0.0%) i/s -    112.000  in   5.007715s
        encode utf-8     43.500  (± 0.0%) i/s -    220.000  in   5.058088s
         encode sjis     34.559  (± 0.0%) i/s -    174.000  in   5.035380s

Rows 10000¶

            unquoted      9.395  (± 0.0%) i/s -     47.000  in   5.014335s
              quoted      2.116  (± 0.0%) i/s -     11.000  in   5.208314s
     include col_sep      2.103  (± 0.0%) i/s -     11.000  in   5.241502s
     include row_sep      2.146  (± 0.0%) i/s -     11.000  in   5.128535s
        encode utf-8      4.217  (± 0.0%) i/s -     22.000  in   5.226532s
         encode sjis      3.370  (± 0.0%) i/s -     17.000  in   5.047115s

Compare¶

Rows 1000¶

test	trunk	patched	ratio
unquoted	41.142	91.465	221%
quoted	23.093	22.547	97.6%
include col_sep	14.826	22.167	150%
include row_sep	7.136	22.369	313%
encode utf-8	34.350	43.500	127%
encode sjis	3.400	3.370	99.1%

Rows 10000¶

test	trunk	patched	ratio
unquoted	4.021	9.395	234%
quoted	2.266	2.116	93.4%
include col_sep	1.527	2.103	138%
include row_sep	0.692	2.146	310%
encode utf-8	3.215	4.217	131%
encode sjis	3.400	3.370	99.1%

Actions

Copy link

#11 [ruby-core:86112]

Updated by kou (Kouhei Sutou) over 7 years ago

Thanks. I've added the benchmark script to the repository.

Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?

Actions

Copy link

#12 [ruby-core:86127]

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago

Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?

I attempted to solve the problem that CSV#line does not work, but this fix seems to obviously degrade the performance from the original patch.
(After fix commit: https://github.com/tomog105/csv/commit/be3261baa0ebd8246f75408c503572e0d111a139)
For example, in benchmark of the "quoted" case, it seems to be about 20% slower than trunk.

Therefore, I can not say "the patch improves performance" now.

Actions

Copy link

#13 [ruby-core:86414]

Updated by kou (Kouhei Sutou) over 7 years ago

Status changed from Feedback to Rejected

OK. I close this.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Feature #4017

[PATCH] CSV parsing speedup

Updated by JEG2 (James Gray) over 14 years ago

Updated by ender672 (Timothy Elliott) over 14 years ago

Updated by JEG2 (James Gray) over 14 years ago

Updated by shyouhei (Shyouhei Urabe) over 14 years ago

Updated by mame (Yusuke Endoh) over 12 years ago

Updated by naruse (Yui NARUSE) over 7 years ago

Updated by mame (Yusuke Endoh) over 7 years ago

Updated by kou (Kouhei Sutou) over 7 years ago

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago

Script¶

Result¶

Rows: 1000¶

Rows: 10000¶

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago

Result¶

Rows 1000¶

Rows 10000¶

Compare¶

Rows 1000¶

Rows 10000¶

Updated by kou (Kouhei Sutou) over 7 years ago

Updated by tomog105 (Tomohiro Ogoke) over 7 years ago

Updated by kou (Kouhei Sutou) over 7 years ago