Feature #1981

[PATCH] CSV Parsing Speedup

Added by Timothy Elliott almost 7 years ago. Updated about 5 years ago.



This patch replaces the regex used in the Ruby 1.9 CSV parser with ruby code.

Running all CSV tests (ts_all.rb) is much faster (36% on my machine). Probably because they don't have to rebuild the regex over and over again. I tested on large CSV files and got an average speedup of 23% on my machine.

(James, this patch is improved & faster from the one I emailed to you on 8/21. Putting into this ticket to make it easier to track any changes.)

ruby_19_csv_speedup_01.patch Magnifier - Patch against current svn (24629) (7.71 KB) Timothy Elliott, 08/23/2009 04:44 AM

csv.rb Magnifier - Patched csv.rb (80.6 KB) Timothy Elliott, 08/23/2009 04:44 AM

ruby_19_csv_speedup_02.patch Magnifier - Same patch as before, against r26974 (7.71 KB) Timothy Elliott, 03/20/2010 03:36 AM

csv.rb Magnifier - Patched lib/csv.rb, r26974 (80.8 KB) Timothy Elliott, 03/20/2010 03:36 AM

check_for_stray_quotes_19.patch Magnifier (1.1 KB) Timothy Elliott, 03/23/2010 04:16 PM


#1 Updated by James Gray almost 7 years ago

  • Due date set to 12/31/2009
  • Target version set to 2.0.0

I just wanted to state that I am aware of this patch and do plan to apply something like it. I'm purposefully waiting a bit to see how the new FasterCSV release, which uses a similar non-regex approach, fairs in the wild. When I fill confident we've made the right choice to switch parsing strategies, I will commit some version of this patch.

#2 Updated by James Gray about 6 years ago

Timothy, is it possible for you to redo this patch against the current trunk, so I can look at what it would take to get it applied?

You mention that it is faster which is great. Are you also pretty confident that it maintains CSV's m17n savvy implementation? I don't want to lose the ability to parse in any encoding. I have some tests for this and I'll try to look over the code as we change it, but I know I'm not perfect and could have missed something.

#3 Updated by Timothy Elliott about 6 years ago

I updated the patch to against trunk. Attaching to this ticket update.

I took care to make it pass all tests, including the m17n tests in test/csv/test_encodings.rb . I have not tested it against m17n CSV files outside of the tests. It would be a good idea to test against a few large m17n CSV files to make sure that the new operations used by this parser (String.gsub!, String#count & String#last) perform well in that scenario.


#4 Updated by James Gray about 6 years ago

Thanks Timothy, but this patch doesn't include your latest fix from the FasterCSV side, to restore the strict parser behavior. This tests is currently failing:

def test_non_regex_edge_cases
# An early version of the non-regex parser fails this test
[ [ "foo,\"foo,bar,baz,foo\",\"foo\"",
["foo", "foo,bar,baz,foo", "foo"] ] ].each do |edge_case|
assert_equal(edge_case.last, CSV.parse_line(edge_case.first))

 assert_raise(CSV::MalformedCSVError) do
   CSV.parse_line("1,\"23\"4\"5\", 6")


Can I bother you for one more patch that includes this fix as well? Thanks!

(I'm applying all of the new FasterCSV fixes and am now missing only this in my local checkout, so a patch adding this after the latest patch you gave me would be best. You also don't need to add the test, since I already have.)

James Edward Gray II

P.S. It is quite a bit faster and does seem to play well with m17n, from what I can see. Great work!


#5 Updated by Timothy Elliott about 6 years ago

Patch attached.

Tim Elliott

#6 Updated by James Gray about 6 years ago

  • Status changed from Open to Closed

Thanks for all the help Tim. We're all caught up now with all of your patches applied.

Also available in: Atom PDF