Feature #1981

[PATCH] CSV Parsing Speedup

Added by ender672 (Timothy Elliott) almost 8 years ago. Updated about 6 years ago.

Target version:


This patch replaces the regex used in the Ruby 1.9 CSV parser with ruby code.

Running all CSV tests (ts_all.rb) is much faster (36% on my machine). Probably because they don't have to rebuild the regex over and over again. I tested on large CSV files and got an average speedup of 23% on my machine.

(James, this patch is improved & faster from the one I emailed to you on 8/21. Putting into this ticket to make it easier to track any changes.)

ruby_19_csv_speedup_01.patch (7.71 KB) ruby_19_csv_speedup_01.patch Patch against current svn (24629) ender672 (Timothy Elliott), 08/23/2009 04:44 AM
csv.rb (80.6 KB) csv.rb Patched csv.rb ender672 (Timothy Elliott), 08/23/2009 04:44 AM
ruby_19_csv_speedup_02.patch (7.71 KB) ruby_19_csv_speedup_02.patch Same patch as before, against r26974 ender672 (Timothy Elliott), 03/20/2010 03:36 AM
csv.rb (80.8 KB) csv.rb Patched lib/csv.rb, r26974 ender672 (Timothy Elliott), 03/20/2010 03:36 AM
check_for_stray_quotes_19.patch (1.1 KB) check_for_stray_quotes_19.patch ender672 (Timothy Elliott), 03/23/2010 04:16 PM


#1 Updated by JEG2 (James Gray) almost 8 years ago

  • Due date set to 12/31/2009
  • Target version set to 2.0.0

I just wanted to state that I am aware of this patch and do plan to apply something like it. I'm purposefully waiting a bit to see how the new FasterCSV release, which uses a similar non-regex approach, fairs in the wild. When I fill confident we've made the right choice to switch parsing strategies, I will commit some version of this patch.

#2 Updated by JEG2 (James Gray) over 7 years ago

Timothy, is it possible for you to redo this patch against the current trunk, so I can look at what it would take to get it applied?

You mention that it is faster which is great. Are you also pretty confident that it maintains CSV's m17n savvy implementation? I don't want to lose the ability to parse in any encoding. I have some tests for this and I'll try to look over the code as we change it, but I know I'm not perfect and could have missed something.

#3 Updated by ender672 (Timothy Elliott) over 7 years ago

I updated the patch to against trunk. Attaching to this ticket update.

I took care to make it pass all tests, including the m17n tests in test/csv/test_encodings.rb . I have not tested it against m17n CSV files outside of the tests. It would be a good idea to test against a few large m17n CSV files to make sure that the new operations used by this parser (String.gsub!, String#count & String#last) perform well in that scenario.


#4 Updated by JEG2 (James Gray) over 7 years ago

Thanks Timothy, but this patch doesn't include your latest fix from the FasterCSV side, to restore the strict parser behavior. This tests is currently failing:

def test_non_regex_edge_cases
# An early version of the non-regex parser fails this test
[ [ "foo,\"foo,bar,baz,foo\",\"foo\"",
["foo", "foo,bar,baz,foo", "foo"] ] ].each do |edge_case|
assert_equal(edge_case.last, CSV.parse_line(edge_case.first))

 assert_raise(CSV::MalformedCSVError) do
   CSV.parse_line("1,\"23\"4\"5\", 6")


Can I bother you for one more patch that includes this fix as well? Thanks!

(I'm applying all of the new FasterCSV fixes and am now missing only this in my local checkout, so a patch adding this after the latest patch you gave me would be best. You also don't need to add the test, since I already have.)

James Edward Gray II

P.S. It is quite a bit faster and does seem to play well with m17n, from what I can see. Great work!


#5 Updated by ender672 (Timothy Elliott) over 7 years ago

Patch attached.

Tim Elliott

#6 Updated by JEG2 (James Gray) over 7 years ago

  • Status changed from Open to Closed

Thanks for all the help Tim. We're all caught up now with all of your patches applied.

Also available in: Atom PDF