Creeker - Stream parser for large Excel (xlsx and xlsm) files.

Based on Creek gem, Creeker is a Ruby gem that provides a fast, simple and efficient method of parsing large Excel (xlsx and xlsm) files.

Here we solve the issue of heavy RAM and memory usage, using multi-threading and Garbage collection.

BENCHMARKS:

Parsing 150,000 rows of excel:

Creek gem: 98~99.5% CPU usage
Creeker gem: 23~24.5% CPU usage

Installation

Creeker can be used from the command line or as part of a Ruby web framework. To install the gem using terminal, run the following command:

gem install creeker

To use it in Rails, add this line to your Gemfile:

gem 'creeker'

Basic Usage

Creeker can simply parse an Excel file by looping through the rows enumerator:

To get the headers

creeker = Creeker::Book.new 'spec/fixtures/sample.xlsx'
sheet = creeker.sheets[0]
headers = sheet.rows.first.values

To parse whole file, we recommend using multi-threading and garbage collections as follows:

require 'creeker'
creeker = Creeker::Book.new 'spec/fixtures/sample.xlsx', multi_thread: true
sheet = creeker.sheets[0]

sheet.rows.each do |row|
  puts row # => {"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}
end

sheet..each do |row|
  puts row # => {"collapsed"=>"false", "customFormat"=>"false", "customHeight"=>"true", "hidden"=>"false", "ht"=>"12.1", "outlineLevel"=>"0", "r"=>"1", "cells"=>{"A1"=>"Content 1", "B1"=>nil, C1"=>nil, "D1"=>"Content 3"}}
end

sheet.state   # => 'visible'
sheet.name    # => 'Sheet1'
sheet.rid     # => 'rId2'

Filename considerations

By default, Creeker will ensure that the file extension is either *.xlsx or *.xlsm, but this check can be circumvented as needed:

path = 'sample-as-zip.zip'
Creeker::Book.new path, :check_file_extension => false

By default, the Rails file_field_tag uploads to a temporary location and stores the original filename with the StringIO object. (See this section of the Rails Guides for more information.)

Creeker can parse this directly without the need for file upload gems such as Carrierwave or Paperclip by passing the original filename as an option:

# Import endpoint in Rails controller
def import
  file = params[:file]
  Creeker::Book.new file.path, check_file_extension: false
end

Parsing images

Creeker does not parse images by default. If you want to parse the images, use with_images method before iterating over rows to preload images information. If you don't call this method, Creeker will not return images anywhere.

Cells with images will be an array of Pathname objects. If an image is spread across multiple cells, same Pathname object will be returned for each cell.

sheet.with_images.rows.each do |row|
  puts row # => {"A1"=>[#<Pathname:/var/folders/ck/l64nmm3d4k75pvxr03ndk1tm0000gn/T/creeker__drawing20161101-53599-274q0vimage1.jpeg>], "B2"=>"Fluffy"}
end

Images for a specific cell can be obtained with images_at method:

puts sheet.images_at('A1') # => [#<Pathname:/var/folders/ck/l64nmm3d4k75pvxr03ndk1tm0000gn/T/creeker__drawing20161101-53599-274q0vimage1.jpeg>]

# no images in a cell
puts sheet.images_at('C1') # => nil

Creeker will most likely return nil for a cell with images if there is no other text cell in that row - you can use images_at method for retrieving images in that cell.

Remote files

remote_url = 'http://dev-builds.libreoffice.org/tmp/test.xlsx'
Creeker::Book.new remote_url, remote: true

Contributing

Contributions are welcomed. You can fork a repository, add your code changes to the forked branch, ensure all existing unit tests pass, create new unit tests which cover your new changes and finally create a pull request.

After forking and then cloning the repository locally, install the Bundler and then use it to install the development gem dependencies:

gem install bundler
bundle install

Once this is complete, you should be able to run the test suite:

rake

Bug Reporting

Please use the Issues page to report bugs or suggest new enhancements.

License

Creeker has been published under MIT License