Project

General

Profile

Actions

Feature #1981

closed

[PATCH] CSV Parsing Speedup

Added by ender672 (Timothy Elliott) about 15 years ago. Updated over 13 years ago.

Status:
Closed
Target version:
[ruby-core:25038]

Description

=begin
This patch replaces the regex used in the Ruby 1.9 CSV parser with ruby code.

Running all CSV tests (ts_all.rb) is much faster (36% on my machine). Probably because they don't have to rebuild the regex over and over again. I tested on large CSV files and got an average speedup of 23% on my machine.

(James, this patch is improved & faster from the one I emailed to you on 8/21. Putting into this ticket to make it easier to track any changes.)
=end


Files

ruby_19_csv_speedup_01.patch (7.71 KB) ruby_19_csv_speedup_01.patch Patch against current svn (24629) ender672 (Timothy Elliott), 08/23/2009 04:44 AM
csv.rb (80.6 KB) csv.rb Patched csv.rb ender672 (Timothy Elliott), 08/23/2009 04:44 AM
ruby_19_csv_speedup_02.patch (7.71 KB) ruby_19_csv_speedup_02.patch Same patch as before, against r26974 ender672 (Timothy Elliott), 03/20/2010 03:36 AM
csv.rb (80.8 KB) csv.rb Patched lib/csv.rb, r26974 ender672 (Timothy Elliott), 03/20/2010 03:36 AM
check_for_stray_quotes_19.patch (1.1 KB) check_for_stray_quotes_19.patch ender672 (Timothy Elliott), 03/23/2010 04:16 PM
Actions #1

Updated by JEG2 (James Gray) about 15 years ago

  • Due date set to 12/31/2009
  • Target version set to 2.0.0

=begin
I just wanted to state that I am aware of this patch and do plan to apply something like it. I'm purposefully waiting a bit to see how the new FasterCSV release, which uses a similar non-regex approach, fairs in the wild. When I fill confident we've made the right choice to switch parsing strategies, I will commit some version of this patch.
=end

Actions #2

Updated by JEG2 (James Gray) over 14 years ago

=begin
Timothy, is it possible for you to redo this patch against the current trunk, so I can look at what it would take to get it applied?

You mention that it is faster which is great. Are you also pretty confident that it maintains CSV's m17n savvy implementation? I don't want to lose the ability to parse in any encoding. I have some tests for this and I'll try to look over the code as we change it, but I know I'm not perfect and could have missed something.
=end

Updated by ender672 (Timothy Elliott) over 14 years ago

=begin
I updated the patch to against trunk. Attaching to this ticket update.

I took care to make it pass all tests, including the m17n tests in test/csv/test_encodings.rb . I have not tested it against m17n CSV files outside of the tests. It would be a good idea to test against a few large m17n CSV files to make sure that the new operations used by this parser (String.gsub!, String#count & String#last) perform well in that scenario.

=end

Actions #4

Updated by JEG2 (James Gray) over 14 years ago

=begin
Thanks Timothy, but this patch doesn't include your latest fix from the FasterCSV side, to restore the strict parser behavior. This tests is currently failing:

def test_non_regex_edge_cases
# An early version of the non-regex parser fails this test
[ [ "foo,"foo,bar,baz,foo","foo"",
["foo", "foo,bar,baz,foo", "foo"] ] ].each do |edge_case|
assert_equal(edge_case.last, CSV.parse_line(edge_case.first))
end

 assert_raise(CSV::MalformedCSVError) do
   CSV.parse_line("1,\"23\"4\"5\", 6")
 end

end

Can I bother you for one more patch that includes this fix as well? Thanks!

(I'm applying all of the new FasterCSV fixes and am now missing only this in my local checkout, so a patch adding this after the latest patch you gave me would be best. You also don't need to add the test, since I already have.)

James Edward Gray II

P.S. It is quite a bit faster and does seem to play well with m17n, from what I can see. Great work!

=end

Actions #5

Updated by ender672 (Timothy Elliott) over 14 years ago

=begin
Patch attached.

Thanks,
Tim Elliott
=end

Actions #6

Updated by JEG2 (James Gray) over 14 years ago

  • Status changed from Open to Closed

=begin
Thanks for all the help Tim. We're all caught up now with all of your patches applied.
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0