[PATCH] CSV parsing speedup
This patch breaks the CSV parser into multiple methods that are easier to understand and it allows for the performance optimizations in the second patch. It removes all regular expressions from the parser, resulting in a ~25% speed improvement in the CSV test suite. It adds a new CSV parser option, :io_read_limit, which determines the max size for IO reads. This option defaults to 2048 which to was the fastest in my benchmarks.
This patch adds two shortcuts to the patch above that significantly improve parsing of CSV files that have many quoted columns. It has to be applied on top of the first patch.
On large CSV files I observed that these patches resulted in a 20% - 60% reduction of time it takes to parse. If this patchset looks good, I would like to experiment with further improvements that take advantage of io_read_limit to always read from IO in large chunks (right now it only does so with CSV files that have no quote characters).
These patches maintain m17n support and multi-character separator support (and boy, it's tough to make those tests happy :)
#1 Updated by James Gray over 6 years ago
I'm for adding the read_limit option (I prefer that name to io_read_limit).
However, I'm not sure how I feel about defaulting it to some number. That's going to mean that some documents CSV used to correctly parse could now fail. While a limit is a good idea in practice, I don't feel comfortable making that choice for the programmer. I think you need to opt-in to a sensible limit for your purposes and, until you do, we assume you want the best parsing we can provide.
So, my main question is, can we drop the default limit and keep the gains?
I would definitely before applying a patch like this if we:
- Rename the limit to just read_limit
- Default to using no limit
I hope that makes sense.
#2 Updated by Timothy Elliott over 6 years ago
* Rename the limit to just read_limit
That name sounds good. I'll submit an updated patch with the better name.
- Default to using no limit
I didn't do a good job explaining how this patch uses the read_limit option. Trunk currently uses IO#gets(newline_character) to read from @io. This patch gets a performance boost by using IO#gets(quote_character, read_limit) instead. The performance advantage is that we don't always care about newline characters (for example when reading data in a quoted column) but we do always care about quote characters.
One side effect of reading upto quote characters instead of newline characters is that it exposes an issue with using IO#gets without a limit:
If you parse a huge file without newline characters (on the trunk parser) or without quote characters (in the parser from this patch) you can run out of memory.
A huge file without newlines probably only would happen if it was done intentionally, but it is quite common for CSV files to have no quote characters in them. If the default for the parser in this patch was no read_limit, these files would consume a lot of memory and possibly cause an OOM situation.
Sorry for providing the patch with so little explanation. That change from IO#gets(newline_char) to IO#gets(quote_char, read_limit) is significant and should have been mentioned earlier.
#3 Updated by James Gray over 6 years ago
Ah, sorry. I misunderstood. I think I get it now.
Basically, read_limit isn't a limit so much as a read-at-a-time control dial that can be used to tune parsing speed. It's set to a reasonable default and users can tune it the their specific needs. Am I understanding better now?
I was thinking it was bubbling the limit used by methods like IO#gets() up to the user. I've long wanted to do that, to account for data like:
"...gigs of data containing anything but a quote
Beyond that, the refactoring sounds neat to me. Under what circumstances does it turn out to be slower than the old parser? It seems like there would have to be some. I don't say that because the old parser was well tuned either. It's just that we've now shifted the focus of the parser (from newlines to quotes, if I'm understanding correctly). For example, if each line is 20 simple fields, all unnecessarily but legally quoted, do the increased IO operations slow things down?
I guess if there are such cases, we just need to decide if the ones we are optimizing for are common. My thoughts there are probably that the test sweet isn't an ideal speed measuring tool. We might want to grab several realistic CSV documents to use for that purpose.
Any thoughts on my babbling here Tim? Am I making sense.