Project

General

Profile

Actions

Feature #4017

closed

[PATCH] CSV parsing speedup

Added by ender672 (Timothy Elliott) over 10 years ago. Updated about 3 years ago.

Status:
Rejected
Priority:
Normal
Target version:
-
[ruby-core:33026]

Description

=begin
ruby_19_csv_parser_split_methods.patch
This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.

ruby_19_csv_parser_split_methods.patch
This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.

On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).

These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)
=end


Files

ruby_19_csv_parser_split_methods.patch (11.9 KB) ruby_19_csv_parser_split_methods.patch Patch 1/2 ender672 (Timothy Elliott), 11/03/2010 09:43 AM
ruby_19_csv_parser_speedup.patch (1.82 KB) ruby_19_csv_parser_speedup.patch Patch 2/2 ender672 (Timothy Elliott), 11/03/2010 09:43 AM
Actions

Also available in: Atom PDF