FastCSV
A fast Ragel-based CSV parser.
Only reads CSVs using " as the quote character, , as the delimiter and \r, \n or \r\n as the line terminator.
Usage
FastCSV.raw_parse is implemented in C and is the fastest way to read CSVs with FastCSV.
require 'fastcsv'
# Read from file.
File.open(filename) do |f|
FastCSV.raw_parse(f) do |row|
# do stuff
end
end
# Read from an IO object.
FastCSV.raw_parse(StringIO.new("foo,bar\n")) do |row|
# do stuff
end
# Read from a string.
FastCSV.raw_parse("foo,bar\n") do |row|
# do stuff
end
# Transcode like with the CSV module.
FastCSV.raw_parse("\xF1\n", encoding: 'iso-8859-1:utf-8') do |row|
# ["ñ"]
end
FastCSV can be used as a drop-in replacement for CSV (replace CSV with FastCSV) except:
- The
:quote_char("),:col_sep(,) and:row_sep(:auto) options are ignored. #2 - If FastCSV raises an error, you can't continue reading. #3 Its error messages don't perfectly match those of CSV.
A few minor caveats:
- Use
FastCSV.parse_line(string, options)instead ofstring.parse_csv(options). - If you were passing CSV an IO object on which you had wrapped
#gets(for example, as described in this article),#getswill not be called. - The
:field_size_limitoption is ignored. If you need to prevent DoS attacks – the ostensible reason for this option – limit the size of the input, not the size of quoted fields. - FastCSV doesn't support UTF-16 or UTF-32. See UTF-8 Everywhere.
Development
ragel -G2 ext/fastcsv/fastcsv.rl
ragel -Vp ext/fastcsv/fastcsv.rl | dot -Tpng -o machine.png
rake compile
gem uninstall fastcsv
rake install
rake
rspec test/runner.rb test/csv
Implementation
FastCSV implements its Ragel-based CSV parser in C at FastCSV::Parser.
FastCSV is a subclass of CSV. It overrides #shift, replacing the parsing code, in order to act as a drop-in replacement.
FastCSV's raw_parse requires a block to which it yields one row at a time. FastCSV uses Fibers to pass control back to #shift while parsing.
CSV delegates IO methods to the IO object it's reading. IO methods that move the pointer within the file like rewind changes the behavior of CSV's #shift. However, FastCSV's C code won't take notice. We therefore null the Fiber whenever the pointer is moved, so that #shift uses a new Fiber.
CSV's #shift runs the regular expression in the :skip_lines option against a row's raw text. FastCSV::Parser implements a row method, which returns the most recently parsed row's raw text.
FastCSV is tested against the same tests as CSV. See TESTS.md for details.
Why?
We evaluated many CSV Ruby gems, and they were either too slow or had implementation errors. rcsv is fast and libcsv-based, but it skips blank rows (Ruby's CSV module returns an empty array) and silently fails on input with an unclosed quote. bamfcsv is well implemented, but it's considerably slower on large files. We looked for Ragel-based CSV parsers to copy, but they either had implementation errors or could not handle large files. commas looks good, but it performs a memory check on each character, which is overkill.
Bugs? Questions?
This project's main repository is on GitHub: http://github.com/opennorth/fastcsv, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
Acknowledgements
Started as a Ruby 2.1 fork of MoonWolf [email protected]'s CSVScan, found in this commit. CSVScan uses Ragel code from HPricot from this commit. Most of the Ruby (i.e. non-C, non-Ragel) methods are copied from CSV.
Copyright (c) 2014 Open North Inc., released under the MIT license



