[PATCH] CSV Parsing Speedup
This patch replaces the regex used in the Ruby 1.9 CSV parser with ruby code.
Running all CSV tests (ts_all.rb) is much faster (36% on my machine). Probably because they don't have to rebuild the regex over and over again. I tested on large CSV files and got an average speedup of 23% on my machine.
(James, this patch is improved & faster from the one I emailed to you on 8/21. Putting into this ticket to make it easier to track any changes.)
#1 Updated by James Gray almost 6 years ago
- Due date set to 12/31/2009
- Target version set to 2.0.0
I just wanted to state that I am aware of this patch and do plan to apply something like it. I'm purposefully waiting a bit to see how the new FasterCSV release, which uses a similar non-regex approach, fairs in the wild. When I fill confident we've made the right choice to switch parsing strategies, I will commit some version of this patch.
#2 Updated by James Gray about 5 years ago
Timothy, is it possible for you to redo this patch against the current trunk, so I can look at what it would take to get it applied?
You mention that it is faster which is great. Are you also pretty confident that it maintains CSV's m17n savvy implementation? I don't want to lose the ability to parse in any encoding. I have some tests for this and I'll try to look over the code as we change it, but I know I'm not perfect and could have missed something.
#3 Updated by Timothy Elliott about 5 years ago
I updated the patch to against trunk. Attaching to this ticket update.
I took care to make it pass all tests, including the m17n tests in test/csv/test_encodings.rb . I have not tested it against m17n CSV files outside of the tests. It would be a good idea to test against a few large m17n CSV files to make sure that the new operations used by this parser (String.gsub!, String#count & String#last) perform well in that scenario.
#4 Updated by James Gray about 5 years ago
Thanks Timothy, but this patch doesn't include your latest fix from the FasterCSV side, to restore the strict parser behavior. This tests is currently failing:
# An early version of the non-regex parser fails this test
[ [ "foo,\"foo,bar,baz,foo\",\"foo\"",
["foo", "foo,bar,baz,foo", "foo"] ] ].each do |edge_case|
assert_raise(CSV::MalformedCSVError) do CSV.parse_line("1,\"23\"4\"5\", 6") end
Can I bother you for one more patch that includes this fix as well? Thanks!
(I'm applying all of the new FasterCSV fixes and am now missing only this in my local checkout, so a patch adding this after the latest patch you gave me would be best. You also don't need to add the test, since I already have.)
James Edward Gray II
P.S. It is quite a bit faster and does seem to play well with m17n, from what I can see. Great work!