Class: RemoteTable

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb

Overview

Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.

Defined Under Namespace

Modules: Delimited, FixedWidth, Html, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml

Constant Summary collapse

WHITESPACE =
/\s+/
SINGLE_SPACE =
' '
EXTERNAL_ENCODING =
'UTF-8'
EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'
GOOGLE_DOCS_SPREADSHEET =
[
  /docs.google.com/i,
  /spreadsheets.google.com/i
]
VALID =
{
  :compression => [:gz, :zip, :bz2, :exe],
  :packing => [:tar],
  :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv],
}
DEFAULT =
{
  :streaming => false,
  :warn_on_multiple_downloads => true,
  :headers => :first_row,
  :keep_blank_rows => false,
  :skip => 0,
  :encoding => 'UTF-8',
}
OLD_SETTING_NAMES =
{
  :pre_select => [:select],
  :pre_reject => [:reject],
  :delimiter  => [:col_sep],
}
VERSION =
'3.2.0'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable

Create a new RemoteTable, which is an Enumerable.

Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.

Does not immediately download/parse… it’s lazy-loading.

Examples:

Open an XLSX

RemoteTable.new('http://www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx')

Open a CSV inside a ZIP file

RemoteTable.new 'http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
                :filename => 'Annex Tables/Annex 3/Table A-93.csv',
                :skip => 1,
                :pre_select => proc { |row| row['Vehicle Age'].strip =~ /^\d+$/ }

Overloads:

  • #initialize(settings) ⇒ RemoteTable

    Parameters:

    • settings (Hash)

      Settings including :url.

  • #initialize(url, settings) ⇒ RemoteTable

    Parameters:

    • url (String)

      The URL to the local or remote file.

    • settings (Hash)

      Settings.



390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
# File 'lib/remote_table.rb', line 390

def initialize(*args)
  @download_count_mutex = ::Mutex.new
  @extend_bang_mutex = ::Mutex.new

  @cache = []
  @download_count = 0

  settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {}

  @url = if args.first.is_a? ::String
    args.first
  else
    grab settings, :url
  end
  @format = RemoteTable.guess_format grab(settings, :format)
  if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url }
    @url = RemoteTable.google_spreadsheet_csv_url url
    @format = :delimited
  end

  @headers = grab settings, :headers
  if headers.is_a?(::Array) and headers.any?(&:blank?)
    raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank"
  end
  @quote_char = grab settings, :quote_char

  @compression = grab(settings, :compression) || RemoteTable.guess_compression(url)
  @packing = grab(settings, :packing) || RemoteTable.guess_packing(url)

  @streaming = grab settings, :streaming
  @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads
  @delimiter = grab settings, :delimiter
  @sheet = grab settings, :sheet
  @keep_blank_rows = grab settings, :keep_blank_rows
  @form_data = grab settings, :form_data
  @skip = grab settings, :skip
  @encoding = grab settings, :encoding
  @row_xpath = grab settings, :row_xpath
  @column_xpath = grab settings, :column_xpath
  @row_css = grab settings, :row_css
  @column_css = grab settings, :column_css
  @glob = grab settings, :glob
  @filename = grab settings, :filename
  @cut = grab settings, :cut
  @crop = grab settings, :crop
  @schema = grab settings, :schema
  @schema_name = grab settings, :schema_name
  @pre_select = grab settings, :pre_select
  @pre_reject = grab settings, :pre_reject
  @errata = grab settings, :errata
  @parser = grab settings, :parser

  @other_options = settings

  @local_copy = LocalCopy.new self
  extend!
end

Instance Attribute Details

#column_cssString (readonly)

The CSS selector used to find columns in HTML or XML.

Returns:

  • (String)


254
255
256
# File 'lib/remote_table.rb', line 254

def column_css
  @column_css
end

#column_xpathString (readonly)

The XPath used to find columns in HTML or XML.

Returns:

  • (String)


246
247
248
# File 'lib/remote_table.rb', line 246

def column_xpath
  @column_xpath
end

#compressionSymbol (readonly)

The compression type. Guessed from URL if not provided. :gz, :zip, :bz2, and :exe (treated as :zip) are supported.

Returns:

  • (Symbol)


262
263
264
# File 'lib/remote_table.rb', line 262

def compression
  @compression
end

#cropRange (readonly)

Use a range of rows in a plaintext file.

Examples:

Only take rows 21 through 37

RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls",
                :headers => false,
                :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
                :crop => (21..37))

Returns:

  • (Range)


303
304
305
# File 'lib/remote_table.rb', line 303

def crop
  @crop
end

#cutString (readonly)

Pick specific columns out of a plaintext file using an argument to the UNIX [cut utility](en.wikipedia.org/wiki/Cut_%28Unix%29).

Examples:

Pick ALMOST out of ABCDEFGHIJKLMNOPQRSTUVWXYZ

# $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | cut -c '1,12,13,15,19,20'
# ALMOST
RemoteTable.new 'file:///atoz.txt', :cut => '1,12,13,15,19,20'

Returns:

  • (String)


292
293
294
# File 'lib/remote_table.rb', line 292

def cut
  @cut
end

#delimiterString (readonly)

The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep. Default is ‘,’.

Returns:

  • (String)


238
239
240
# File 'lib/remote_table.rb', line 238

def delimiter
  @delimiter
end

#encodingString (readonly)

The original encoding of the source file. Default is UTF-8.

Returns:

  • (String)


234
235
236
# File 'lib/remote_table.rb', line 234

def encoding
  @encoding
end

#errataHash (readonly)

An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash.

  • #rejects?(row) - if row should be treated like it doesn’t exist

  • #correct!(row) - destructively update a row to fix something

See the Errata library at github.com/seamusabshere/errata for an example implementation.

Returns:



340
341
342
# File 'lib/remote_table.rb', line 340

def errata
  @errata
end

#filenameString (readonly)

The filename, which can be used to pick a file out of an archive.

Examples:

Specify the filename to get out of a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :filename => '2008_FE_guide_ALL_rel_dates_-no sales-for DOE-5-1-08.csv'

Returns:

  • (String)


282
283
284
# File 'lib/remote_table.rb', line 282

def filename
  @filename
end

#form_dataString (readonly)

Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded.

Returns:

  • (String)


226
227
228
# File 'lib/remote_table.rb', line 226

def form_data
  @form_data
end

#formatHash (readonly)

The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml

Note: treats all docs.google.com and spreadsheets.google.com URLs as :delimited.

Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)

Returns:



258
259
260
# File 'lib/remote_table.rb', line 258

def format
  @format
end

#globString (readonly)

The glob used to pick a file out of an archive.

Examples:

Pick out the only CSV in a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :glob => '/*.csv'

Returns:

  • (String)


274
275
276
# File 'lib/remote_table.rb', line 274

def glob
  @glob
end

#headers:first_row, ... (readonly)

Headers specified by the user: :first_row (the default), false, or a list of headers.

Returns:

  • (:first_row, false, Array<String>)


207
208
209
# File 'lib/remote_table.rb', line 207

def headers
  @headers
end

#keep_blank_rowstrue, false (readonly)

Whether to keep blank rows. Default is false.

Returns:

  • (true, false)


222
223
224
# File 'lib/remote_table.rb', line 222

def keep_blank_rows
  @keep_blank_rows
end

#other_optionsHash (readonly)

Options passed by the user that may be passed through to the underlying parsing library.

Returns:



367
368
369
# File 'lib/remote_table.rb', line 367

def other_options
  @other_options
end

#packingSymbol (readonly)

The packing type. Guessed from URL if not provided. Only :tar is supported.

Returns:

  • (Symbol)


266
267
268
# File 'lib/remote_table.rb', line 266

def packing
  @packing
end

#pre_rejectProc (readonly)

A proc that decides whether to include a row. Previously passed as :reject.

Returns:

  • (Proc)


330
331
332
# File 'lib/remote_table.rb', line 330

def pre_reject
  @pre_reject
end

#pre_selectProc (readonly)

A proc that decides whether to include a row. Previously passed as :select.

Returns:

  • (Proc)


326
327
328
# File 'lib/remote_table.rb', line 326

def pre_select
  @pre_select
end

#quote_charString (readonly)

Quote character for delimited files.

Defaults to double quotes.

Returns:

  • (String)


214
215
216
# File 'lib/remote_table.rb', line 214

def quote_char
  @quote_char
end

#row_cssString (readonly)

The CSS selector used to find rows in HTML or XML.

Returns:

  • (String)


250
251
252
# File 'lib/remote_table.rb', line 250

def row_css
  @row_css
end

#row_xpathString (readonly)

The XPath used to find rows in HTML or XML.

Returns:

  • (String)


242
243
244
# File 'lib/remote_table.rb', line 242

def row_xpath
  @row_xpath
end

#schemaArray<Array{String,Integer,Hash}> (readonly)

The fixed-width schema, given as a multi-dimensional array.

Examples:

From the tests

RemoteTable.new('http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt',
                 :format => :fixed_width,
                 :skip => 1,
                 :schema => [[ 'header4', 10, { :type => :string }  ],
                             [  'spacer',  1 ],
                             [  'header5', 10, { :type => :string } ],
                             [  'spacer',  12 ],
                             [  'header6', 10, { :type => :string } ]])

Returns:



318
319
320
# File 'lib/remote_table.rb', line 318

def schema
  @schema
end

#schema_nameString, Symbol (readonly)

If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.

Returns:

  • (String, Symbol)


322
323
324
# File 'lib/remote_table.rb', line 322

def schema_name
  @schema_name
end

#sheetObject (readonly)

The sheet specified by the user as a number or a string. @return



218
219
220
# File 'lib/remote_table.rb', line 218

def sheet
  @sheet
end

#skipInteger (readonly)

How many rows to skip at the beginning of the file or table. Default is 0.

Returns:

  • (Integer)


230
231
232
# File 'lib/remote_table.rb', line 230

def skip
  @skip
end

#streamingtrue, false (readonly)

Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.

Returns:

  • (true, false)


199
200
201
# File 'lib/remote_table.rb', line 199

def streaming
  @streaming
end

#urlString (readonly)

The URL of the local or remote file.

Examples:

Local

file:///Users/myuser/Desktop/holidays.csv

Local using an absolute path

/Users/myuser/Desktop/holidays.csv

Remote

http://data.brighterplanet.com/countries.csv

Returns:

  • (String)


180
181
182
# File 'lib/remote_table.rb', line 180

def url
  @url
end

#warn_on_multiple_downloadstrue, false (readonly)

Whether to warn the user on multiple downloads. Defaults to true.

Returns:

  • (true, false)


203
204
205
# File 'lib/remote_table.rb', line 203

def warn_on_multiple_downloads
  @warn_on_multiple_downloads
end

Class Method Details

.google_spreadsheet_csv_url(url) ⇒ String

Given a Google Docs spreadsheet URL, make sure it uses CSV output.

Returns:

  • (String)


104
105
106
107
108
109
110
111
# File 'lib/remote_table.rb', line 104

def google_spreadsheet_csv_url(url)
  uri = ::URI.parse url
  params = uri.query.split('&')
  params.delete_if { |param| param.start_with?('output=') }
  params << 'output=csv'
  uri.query = params.join('&')
  uri.to_s
end

.guess_compression(url) ⇒ Symbol?

Guess compression based on URL. Used internally.

Returns:

  • (Symbol, nil)


55
56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/remote_table.rb', line 55

def guess_compression(url)
  extname = extname(url).downcase
  case extname
  when /gz/, /gunzip/
    :gz
  when /zip/
    :zip
  when /bz2/, /bunzip2/
    :bz2
  when /exe/
    :exe
  end
end

.guess_format(basename) ⇒ Symbol?

Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.

Returns:

  • (Symbol, nil)


80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/remote_table.rb', line 80

def guess_format(basename)
  case basename.to_s.downcase.strip
  when /ods\z/, /open_?office\z/
    :ods
  when /xlsx\z/, /excelx\z/
    :xlsx
  when /xls\z/, /excel\z/
    :xls
  when /csv\z/, /tsv\z/, /delimited\z/
    # note that there is no RemoteTable::Csv class - it's normalized to :delimited
    :delimited
  when /fixed_?width\z/
    :fixed_width
  when /html?\z/
    :html
  when /xml\z/
    :xml
  when /yaml\z/, /yml\z/
    :yaml
  end
end

.guess_packing(url) ⇒ Symbol?

Guess packing from URL. Used internally.

Returns:

  • (Symbol, nil)


71
72
73
74
75
76
# File 'lib/remote_table.rb', line 71

def guess_packing(url)
  basename = basename(url).downcase
  if basename.include?('.tar') or basename.include?('.tgz')
    :tar
  end
end

.normalize_whitespace(v) ⇒ Object



113
114
115
116
117
118
# File 'lib/remote_table.rb', line 113

def normalize_whitespace(v)
  v = v.to_s.dup
  v.gsub! WHITESPACE, SINGLE_SPACE
  v.strip!
  v
end

.transpose(url, key_key, value_key, options = {}) ⇒ Object

Transpose two columns into a mapping from one to the other.



46
47
48
49
50
51
# File 'lib/remote_table.rb', line 46

def transpose(url, key_key, value_key, options = {})
  new(url, options).inject({}) do |memo, row|
    memo[row[key_key]] = row[value_key]
    memo
  end
end

Instance Method Details

#[](row_number) ⇒ Hash, Array

Get a row by row number. Zero-based.

Returns:



502
503
504
505
506
507
508
# File 'lib/remote_table.rb', line 502

def [](row_number)
  if fully_cached?
    cache[row_number]
  else
    to_a[row_number]
  end
end

#each {|Hash, Array| ... } ⇒ nil Also known as: each_row

Yield each row.

Yields:

  • (Hash, Array)

    A hash or an array depending on whether the RemoteTable has named headers (column names).

Returns:

  • (nil)


453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
# File 'lib/remote_table.rb', line 453

def each
  if fully_cached?
    cache.each do |row|
      yield row
    end
  else
    mark_download!
    preprocess!
    memo = _each do |row|
      parser.call(row).each do |virtual_row|
        virtual_row.row_hash = ::HashDigest.digest2 row
        if errata
          next if errata.rejects? virtual_row
          errata.correct! virtual_row
        end
        next if pre_select and !pre_select.call(virtual_row)
        next if pre_reject and pre_reject.call(virtual_row)
        unless streaming
          cache.push virtual_row
        end
        yield virtual_row
      end
    end
    unless streaming
      fully_cached!
    end
    memo
  end
  nil
end

#freenil

Clear the row cache in case it helps your GC.

Returns:

  • (nil)


513
514
515
516
517
# File 'lib/remote_table.rb', line 513

def free
  @fully_cached = false
  cache.clear
  nil
end

#parser#call

An object that responds to #call(row) and returns an array of one or more rows.

Returns:

  • (#call)


361
362
363
# File 'lib/remote_table.rb', line 361

def parser
  @final_parser ||= (@parser || NullParser.new)
end

#to_aArray<Hash,Array> Also known as: rows

Returns All rows.

Returns:



488
489
490
491
492
493
494
# File 'lib/remote_table.rb', line 488

def to_a
  if fully_cached?
    cache.dup
  else
    map { |row| row }
  end
end