Feature #4017


[PATCH] CSV parsing speedup

Added by ender672 (Timothy Elliott) almost 13 years ago. Updated over 5 years ago.

Target version:


This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.

This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.

On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).

These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)


ruby_19_csv_parser_split_methods.patch (11.9 KB) ruby_19_csv_parser_split_methods.patch Patch 1/2 ender672 (Timothy Elliott), 11/03/2010 09:43 AM
ruby_19_csv_parser_speedup.patch (1.82 KB) ruby_19_csv_parser_speedup.patch Patch 2/2 ender672 (Timothy Elliott), 11/03/2010 09:43 AM
Actions #1

Updated by JEG2 (James Gray) almost 13 years ago

I'm for adding the read_limit option (I prefer that name to io_read_limit).

However, I'm not sure how I feel about defaulting it to some number. That's going to mean that some documents CSV used to correctly parse could now fail. While a limit is a good idea in practice, I don't feel comfortable making that choice for the programmer. I think you need to opt-in to a sensible limit for your purposes and, until you do, we assume you want the best parsing we can provide.

So, my main question is, can we drop the default limit and keep the gains?

I would definitely before applying a patch like this if we:

  • Rename the limit to just read_limit
  • Default to using no limit

I hope that makes sense.

Actions #2

Updated by ender672 (Timothy Elliott) almost 13 years ago


  • Rename the limit to just read_limit

That name sounds good. I'll submit an updated patch with the better name.

  • Default to using no limit

I didn't do a good job explaining how this patch uses the read_limit option. Trunk currently uses IO#gets(newline_character) to read from @io. This patch gets a performance boost by using IO#gets(quote_character, read_limit) instead. The performance advantage is that we don't always care about newline characters (for example when reading data in a quoted column) but we do always care about quote characters.

One side effect of reading upto quote characters instead of newline characters is that it exposes an issue with using IO#gets without a limit:

If you parse a huge file without newline characters (on the trunk parser) or without quote characters (in the parser from this patch) you can run out of memory.

A huge file without newlines probably only would happen if it was done intentionally, but it is quite common for CSV files to have no quote characters in them. If the default for the parser in this patch was no read_limit, these files would consume a lot of memory and possibly cause an OOM situation.

Sorry for providing the patch with so little explanation. That change from IO#gets(newline_char) to IO#gets(quote_char, read_limit) is significant and should have been mentioned earlier.


Actions #3

Updated by JEG2 (James Gray) almost 13 years ago

Ah, sorry. I misunderstood. I think I get it now.

Basically, read_limit isn't a limit so much as a read-at-a-time control dial that can be used to tune parsing speed. It's set to a reasonable default and users can tune it the their specific needs. Am I understanding better now?

I was thinking it was bubbling the limit used by methods like IO#gets() up to the user. I've long wanted to do that, to account for data like:

"...gigs of data containing anything but a quote

Beyond that, the refactoring sounds neat to me. Under what circumstances does it turn out to be slower than the old parser? It seems like there would have to be some. I don't say that because the old parser was well tuned either. It's just that we've now shifted the focus of the parser (from newlines to quotes, if I'm understanding correctly). For example, if each line is 20 simple fields, all unnecessarily but legally quoted, do the increased IO operations slow things down?

I guess if there are such cases, we just need to decide if the ones we are optimizing for are common. My thoughts there are probably that the test sweet isn't an ideal speed measuring tool. We might want to grab several realistic CSV documents to use for that purpose.

Any thoughts on my babbling here Tim? Am I making sense.

Actions #4

Updated by shyouhei (Shyouhei Urabe) almost 13 years ago

  • Status changed from Open to Assigned



Updated by mame (Yusuke Endoh) almost 11 years ago

  • Description updated (diff)
  • Target version set to 2.6
Actions #6

Updated by naruse (Yui NARUSE) almost 6 years ago

  • Target version deleted (2.6)

Updated by mame (Yusuke Endoh) over 5 years ago

  • Assignee changed from JEG2 (James Gray) to kou (Kouhei Sutou)

@kou (Kouhei Sutou), could you check this ticket?

Updated by kou (Kouhei Sutou) over 5 years ago

  • Status changed from Assigned to Feedback

We need some benchmark scripts based on benchmark library (or similar library) to work on speedup.

Can someone work on creating benchmark scripts?

Updated by tomog105 (Tomohiro Ogoke) over 5 years ago

I tried to create a benchmark script for CSV.parse
(Using benchmark-ips gem)


# benchmark script for CSV.parse
# Usage: `ruby $0 [rows count(default: 1000)]`
require 'csv'
require 'benchmark/ips'

Benchmark.ips do |x|
  rows = ARGV.fetch(0, "1000").to_i

  alphas = ['AAAAA'] * 50
  unquoted = (alphas.join(',') + "\r\n") * rows
  quoted = ( { |s| %("#{s}") }.join(',') + "\r\n") * rows
  inc_col_sep = ( { |s| %(",#{s}") }.join(',') + "\r\n") * rows
  inc_row_sep = ( { |s| %("#{s}\r\n") }.join(',') + "\r\n") * rows

  hiraganas = ['あああああ'] * 50
  enc_utf8 = (hiraganas.join(',') + "\r\n") * rows
  enc_sjis = enc_utf8.encode('Windows-31J')"unquoted") { CSV.parse(unquoted) }"quoted") { CSV.parse(quoted) }"include col_sep") { CSV.parse(inc_col_sep) }"include row_sep") { CSV.parse(inc_row_sep) }"encode utf-8") { CSV.parse(enc_utf8) }"encode sjis") { CSV.parse(enc_sjis) }


Rows: 1000

            unquoted     41.142  (± 0.0%) i/s -    208.000  in   5.055839s
              quoted     23.093  (± 0.0%) i/s -    116.000  in   5.024081s
     include col_sep     14.826  (± 0.0%) i/s -     75.000  in   5.059138s
     include row_sep      7.136  (± 0.0%) i/s -     36.000  in   5.045395s
        encode utf-8     34.350  (± 0.0%) i/s -    174.000  in   5.066178s
         encode sjis     34.230  (± 0.0%) i/s -    174.000  in   5.083444s

Rows: 10000

            unquoted      4.021  (± 0.0%) i/s -     21.000  in   5.230854s
              quoted      2.266  (± 0.0%) i/s -     12.000  in   5.327204s
     include col_sep      1.527  (± 0.0%) i/s -      8.000  in   5.242055s
     include row_sep      0.692  (± 0.0%) i/s -      4.000  in   5.780656s
        encode utf-8      3.215  (± 0.0%) i/s -     16.000  in   5.010681s
         encode sjis      3.400  (± 0.0%) i/s -     17.000  in   5.012165s

Updated by tomog105 (Tomohiro Ogoke) over 5 years ago

In addition, I'm porting patches in this issue and got benchmark results.


Rows 1000

            unquoted     91.088  (± 2.2%) i/s -    459.000  in   5.041720s
              quoted     22.547  (± 0.0%) i/s -    114.000  in   5.056618s
     include col_sep     22.167  (± 0.0%) i/s -    112.000  in   5.053609s
     include row_sep     22.369  (± 0.0%) i/s -    112.000  in   5.007715s
        encode utf-8     43.500  (± 0.0%) i/s -    220.000  in   5.058088s
         encode sjis     34.559  (± 0.0%) i/s -    174.000  in   5.035380s

Rows 10000

            unquoted      9.395  (± 0.0%) i/s -     47.000  in   5.014335s
              quoted      2.116  (± 0.0%) i/s -     11.000  in   5.208314s
     include col_sep      2.103  (± 0.0%) i/s -     11.000  in   5.241502s
     include row_sep      2.146  (± 0.0%) i/s -     11.000  in   5.128535s
        encode utf-8      4.217  (± 0.0%) i/s -     22.000  in   5.226532s
         encode sjis      3.370  (± 0.0%) i/s -     17.000  in   5.047115s


Rows 1000

test trunk patched ratio
unquoted 41.142 91.465 221%
quoted 23.093 22.547 97.6%
include col_sep 14.826 22.167 150%
include row_sep 7.136 22.369 313%
encode utf-8 34.350 43.500 127%
encode sjis 3.400 3.370 99.1%

Rows 10000

test trunk patched ratio
unquoted 4.021 9.395 234%
quoted 2.266 2.116 93.4%
include col_sep 1.527 2.103 138%
include row_sep 0.692 2.146 310%
encode utf-8 3.215 4.217 131%
encode sjis 3.400 3.370 99.1%

Updated by kou (Kouhei Sutou) over 5 years ago

Thanks. I've added the benchmark script to the repository.

Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?

Updated by tomog105 (Tomohiro Ogoke) over 5 years ago

Does the comparison result say "the patch will improve performance"?
If so, can you complete your work to adapt the patch to the master and create a pull request?

I attempted to solve the problem that CSV#line does not work, but this fix seems to obviously degrade the performance from the original patch.
(After fix commit:
For example, in benchmark of the "quoted" case, it seems to be about 20% slower than trunk.

Therefore, I can not say "the patch improves performance" now.

Updated by kou (Kouhei Sutou) over 5 years ago

  • Status changed from Feedback to Rejected

OK. I close this.


Also available in: Atom PDF