Class: RemoteTable

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/json.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb

Overview

Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.

Defined Under Namespace

Modules: Delimited, FixedWidth, Html, Json, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml

Constant Summary collapse

WHITESPACE =
/\s+/
SINGLE_SPACE =
' '
EXTERNAL_ENCODING =
'UTF-8'
EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'
GOOGLE_DOCS_SPREADSHEET =
[
  /docs.google.com/i,
  /spreadsheets.google.com/i
]
VALID =
{
  :compression => [:gz, :zip, :bz2, :exe],
  :packing => [:tar],
  :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv, :json],
}
DEFAULT =
{
  :streaming => false,
  :warn_on_multiple_downloads => true,
  :headers => :first_row,
  :keep_blank_rows => false,
  :skip => 0,
  :encoding => 'UTF-8',
  :stop_after_untitled_headers => false,
}
OLD_SETTING_NAMES =
{
  :pre_select => [:select],
  :pre_reject => [:reject],
  :delimiter  => [:col_sep],
}
VERSION =
'3.3.3'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable

Create a new RemoteTable, which is an Enumerable.

Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.

Does not immediately download/parse… it’s lazy-loading.

Examples:

Open an XLSX

RemoteTable.new('http://www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx')

Open a CSV inside a ZIP file

RemoteTable.new 'http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
                :filename => 'Annex Tables/Annex 3/Table A-93.csv',
                :skip => 1,
                :pre_select => proc { |row| row['Vehicle Age'].strip =~ /^\d+$/ }

Overloads:

  • #initialize(settings) ⇒ RemoteTable

    Parameters:

    • settings (Hash)

      Settings including :url.

  • #initialize(url, settings) ⇒ RemoteTable

    Parameters:

    • url (String)

      The URL to the local or remote file.

    • settings (Hash)

      Settings.



405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# File 'lib/remote_table.rb', line 405

def initialize(*args)
  @download_count_mutex = ::Mutex.new
  @extend_bang_mutex = ::Mutex.new

  @cache = []
  @download_count = 0

  settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {}

  @url = if args.first.is_a? ::String
    args.first
  else
    grab settings, :url
  end
  @format = RemoteTable.guess_format grab(settings, :format)
  if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url }
    @url = RemoteTable.google_spreadsheet_csv_url url
    @format = :delimited
  end

  @headers = grab settings, :headers
  if headers.is_a?(::Array) and headers.any?(&:blank?)
    raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank"
  end
  @quote_char = grab settings, :quote_char

  @compression = grab(settings, :compression) || RemoteTable.guess_compression(url)
  @packing = grab(settings, :packing) || RemoteTable.guess_packing(url)

  @streaming = grab settings, :streaming
  @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads
  @delimiter = grab settings, :delimiter
  @sheet = grab settings, :sheet
  @keep_blank_rows = grab settings, :keep_blank_rows
  @form_data = grab settings, :form_data
  @skip = grab settings, :skip
  @encoding = grab settings, :encoding
  @row_xpath = grab settings, :row_xpath
  @column_xpath = grab settings, :column_xpath
  @row_css = grab settings, :row_css
  @column_css = grab settings, :column_css
  @glob = grab settings, :glob
  @filename = grab settings, :filename
  @cut = grab settings, :cut
  @crop = grab settings, :crop
  @schema = grab settings, :schema
  @schema_name = grab settings, :schema_name
  @pre_select = grab settings, :pre_select
  @pre_reject = grab settings, :pre_reject
  @errata = grab settings, :errata
  @root_node = grab settings, :root_node
  @parser = grab settings, :parser
  @stop_after_untitled_headers = grab settings, :stop_after_untitled_headers

  @other_options = settings

  @local_copy = LocalCopy.new self
  extend!
end

Instance Attribute Details

#column_cssString (readonly)

The CSS selector used to find columns in HTML or XML.

Returns:

  • (String)


253
254
255
# File 'lib/remote_table.rb', line 253

def column_css
  @column_css
end

#column_xpathString (readonly)

The XPath used to find columns in HTML or XML.

Returns:

  • (String)


245
246
247
# File 'lib/remote_table.rb', line 245

def column_xpath
  @column_xpath
end

#compressionSymbol (readonly)

The compression type. Guessed from URL if not provided. :gz, :zip, :bz2, and :exe (treated as :zip) are supported.

Returns:

  • (Symbol)


261
262
263
# File 'lib/remote_table.rb', line 261

def compression
  @compression
end

#cropRange (readonly)

Use a range of rows in a plaintext file.

Examples:

Only take rows 21 through 37

RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls",
                :headers => false,
                :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
                :crop => (21..37))

Returns:

  • (Range)


302
303
304
# File 'lib/remote_table.rb', line 302

def crop
  @crop
end

#cutString (readonly)

Pick specific columns out of a plaintext file using an argument to the UNIX [cut utility](en.wikipedia.org/wiki/Cut_%28Unix%29).

Examples:

Pick ALMOST out of ABCDEFGHIJKLMNOPQRSTUVWXYZ

# $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | cut -c '1,12,13,15,19,20'
# ALMOST
RemoteTable.new 'file:///atoz.txt', :cut => '1,12,13,15,19,20'

Returns:

  • (String)


291
292
293
# File 'lib/remote_table.rb', line 291

def cut
  @cut
end

#delimiterString (readonly)

The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep. Default is ‘,’.

Returns:

  • (String)


237
238
239
# File 'lib/remote_table.rb', line 237

def delimiter
  @delimiter
end

#encodingString (readonly)

The original encoding of the source file. Default is UTF-8.

Returns:

  • (String)


233
234
235
# File 'lib/remote_table.rb', line 233

def encoding
  @encoding
end

#errataHash (readonly)

An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash.

  • #rejects?(row) - if row should be treated like it doesn’t exist

  • #correct!(row) - destructively update a row to fix something

See the Errata library at github.com/seamusabshere/errata for an example implementation.

Returns:



339
340
341
# File 'lib/remote_table.rb', line 339

def errata
  @errata
end

#filenameString (readonly)

The filename, which can be used to pick a file out of an archive.

Examples:

Specify the filename to get out of a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :filename => '2008_FE_guide_ALL_rel_dates_-no sales-for DOE-5-1-08.csv'

Returns:

  • (String)


281
282
283
# File 'lib/remote_table.rb', line 281

def filename
  @filename
end

#form_dataString (readonly)

Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded.

Returns:

  • (String)


225
226
227
# File 'lib/remote_table.rb', line 225

def form_data
  @form_data
end

#formatHash (readonly)

The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml :json

Note: treats all docs.google.com and spreadsheets.google.com URLs as :delimited.

Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)

Returns:



257
258
259
# File 'lib/remote_table.rb', line 257

def format
  @format
end

#globString (readonly)

The glob used to pick a file out of an archive.

Examples:

Pick out the only CSV in a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :glob => '/*.csv'

Returns:

  • (String)


273
274
275
# File 'lib/remote_table.rb', line 273

def glob
  @glob
end

#headers:first_row, ... (readonly)

Headers specified by the user: :first_row (the default), false, or a list of headers.

Returns:

  • (:first_row, false, Array<String>)


206
207
208
# File 'lib/remote_table.rb', line 206

def headers
  @headers
end

#keep_blank_rowstrue, false (readonly)

Whether to keep blank rows. Default is false.

Returns:

  • (true, false)


221
222
223
# File 'lib/remote_table.rb', line 221

def keep_blank_rows
  @keep_blank_rows
end

#other_optionsHash (readonly)

Options passed by the user that may be passed through to the underlying parsing library.

Returns:



382
383
384
# File 'lib/remote_table.rb', line 382

def other_options
  @other_options
end

#packingSymbol (readonly)

The packing type. Guessed from URL if not provided. Only :tar is supported.

Returns:

  • (Symbol)


265
266
267
# File 'lib/remote_table.rb', line 265

def packing
  @packing
end

#pre_rejectProc (readonly)

A proc that decides whether to include a row. Previously passed as :reject.

Returns:

  • (Proc)


329
330
331
# File 'lib/remote_table.rb', line 329

def pre_reject
  @pre_reject
end

#pre_selectProc (readonly)

A proc that decides whether to include a row. Previously passed as :select.

Returns:

  • (Proc)


325
326
327
# File 'lib/remote_table.rb', line 325

def pre_select
  @pre_select
end

#quote_charString (readonly)

Quote character for delimited files.

Defaults to double quotes.

Returns:

  • (String)


213
214
215
# File 'lib/remote_table.rb', line 213

def quote_char
  @quote_char
end

#root_nodeString (readonly)

The root node of the json document. Specified as a string.

Default: nil; no root node.

Returns:

  • (String)


355
356
357
# File 'lib/remote_table.rb', line 355

def root_node
  @root_node
end

#row_cssString (readonly)

The CSS selector used to find rows in HTML or XML.

Returns:

  • (String)


249
250
251
# File 'lib/remote_table.rb', line 249

def row_css
  @row_css
end

#row_xpathString (readonly)

The XPath used to find rows in HTML or XML.

Returns:

  • (String)


241
242
243
# File 'lib/remote_table.rb', line 241

def row_xpath
  @row_xpath
end

#schemaArray<Array{String,Integer,Hash}> (readonly)

The fixed-width schema, given as a multi-dimensional array.

Examples:

From the tests

RemoteTable.new('http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt',
                 :format => :fixed_width,
                 :skip => 1,
                 :schema => [[ 'header4', 10, { :type => :string }  ],
                             [  'spacer',  1 ],
                             [  'header5', 10, { :type => :string } ],
                             [  'spacer',  12 ],
                             [  'header6', 10, { :type => :string } ]])

Returns:



317
318
319
# File 'lib/remote_table.rb', line 317

def schema
  @schema
end

#schema_nameString, Symbol (readonly)

If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.

Returns:

  • (String, Symbol)


321
322
323
# File 'lib/remote_table.rb', line 321

def schema_name
  @schema_name
end

#sheetObject (readonly)

The sheet specified by the user as a number or a string. @return



217
218
219
# File 'lib/remote_table.rb', line 217

def sheet
  @sheet
end

#skipInteger (readonly)

How many rows to skip at the beginning of the file or table. Default is 0.

Returns:

  • (Integer)


229
230
231
# File 'lib/remote_table.rb', line 229

def skip
  @skip
end

#stop_after_untitled_headersInteger (readonly)

When to trim untitled headers. Set this to 100 to prevent more than 100 untitled headers being created; the rest will be silently discarded.

Note: This is effectively a right trim… the counting starts from the left.

Default: false, don’t try

Returns:

  • (Integer)


364
365
366
# File 'lib/remote_table.rb', line 364

def stop_after_untitled_headers
  @stop_after_untitled_headers
end

#streamingtrue, false (readonly)

Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.

Returns:

  • (true, false)


198
199
200
# File 'lib/remote_table.rb', line 198

def streaming
  @streaming
end

#urlString (readonly)

The URL of the local or remote file.

Examples:

Local

file:///Users/myuser/Desktop/holidays.csv

Local using an absolute path

/Users/myuser/Desktop/holidays.csv

Remote

http://data.brighterplanet.com/countries.csv

Returns:

  • (String)


179
180
181
# File 'lib/remote_table.rb', line 179

def url
  @url
end

#warn_on_multiple_downloadstrue, false (readonly)

Whether to warn the user on multiple downloads. Defaults to true.

Returns:

  • (true, false)


202
203
204
# File 'lib/remote_table.rb', line 202

def warn_on_multiple_downloads
  @warn_on_multiple_downloads
end

Class Method Details

.google_spreadsheet_csv_url(url) ⇒ String

Given a Google Docs spreadsheet URL, make sure it uses CSV output.

Returns:

  • (String)


102
103
104
105
106
107
108
109
# File 'lib/remote_table.rb', line 102

def google_spreadsheet_csv_url(url)
  uri = ::URI.parse url
  params = uri.query.split('&')
  params.delete_if { |param| param.start_with?('output=') }
  params << 'output=csv'
  uri.query = params.join('&')
  uri.to_s
end

.guess_compression(url) ⇒ Symbol?

Guess compression based on URL. Used internally.

Returns:

  • (Symbol, nil)


51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/remote_table.rb', line 51

def guess_compression(url)
  extname = extname(url).downcase
  case extname
  when /gz/, /gunzip/
    :gz
  when /zip/
    :zip
  when /bz2/, /bunzip2/
    :bz2
  when /exe/
    :exe
  end
end

.guess_format(basename) ⇒ Symbol?

Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.

Returns:

  • (Symbol, nil)


76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'lib/remote_table.rb', line 76

def guess_format(basename)
  case basename.to_s.downcase.strip
  when /ods\z/, /open_?office\z/
    :ods
  when /xlsx\z/, /excelx\z/
    :xlsx
  when /xls\z/, /excel\z/
    :xls
  when /csv\z/, /tsv\z/, /delimited\z/
    # note that there is no RemoteTable::Csv class - it's normalized to :delimited
    :delimited
  when /fixed_?width\z/
    :fixed_width
  when /html?\z/
    :html
  when /xml\z/
    :xml
  when /yaml\z/, /yml\z/
    :yaml
  when /json\z/
    :json
  end
end

.guess_packing(url) ⇒ Symbol?

Guess packing from URL. Used internally.

Returns:

  • (Symbol, nil)


67
68
69
70
71
72
# File 'lib/remote_table.rb', line 67

def guess_packing(url)
  basename = basename(url).downcase
  if basename.include?('.tar') or basename.include?('.tgz')
    :tar
  end
end

.normalize_whitespace(v) ⇒ Object



111
112
113
114
115
116
# File 'lib/remote_table.rb', line 111

def normalize_whitespace(v)
  v = v.to_s.dup
  v.gsub! WHITESPACE, SINGLE_SPACE
  v.strip!
  v
end

.transpose(url, key_key, value_key, options = {}) ⇒ Object

Transpose two columns into a mapping from one to the other.



42
43
44
45
46
47
# File 'lib/remote_table.rb', line 42

def transpose(url, key_key, value_key, options = {})
  new(url, options).inject({}) do |memo, row|
    memo[row[key_key]] = row[value_key]
    memo
  end
end

Instance Method Details

#[](row_number) ⇒ Hash, Array

Get a row by row number. Zero-based.

Returns:



519
520
521
522
523
524
525
# File 'lib/remote_table.rb', line 519

def [](row_number)
  if fully_cached?
    cache[row_number]
  else
    to_a[row_number]
  end
end

#each {|Hash, Array| ... } ⇒ nil Also known as: each_row

Yield each row.

Yields:

  • (Hash, Array)

    A hash or an array depending on whether the RemoteTable has named headers (column names).

Returns:

  • (nil)


470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
# File 'lib/remote_table.rb', line 470

def each
  if fully_cached?
    cache.each do |row|
      yield row
    end
  else
    mark_download!
    preprocess!
    memo = _each do |row|
      parser.call(row).each do |virtual_row|
        virtual_row.row_hash = ::HashDigest.digest3 row
        if errata
          next if errata.rejects? virtual_row
          errata.correct! virtual_row
        end
        next if pre_select and !pre_select.call(virtual_row)
        next if pre_reject and pre_reject.call(virtual_row)
        unless streaming
          cache.push virtual_row
        end
        yield virtual_row
      end
    end
    unless streaming
      fully_cached!
    end
    memo
  end
  nil
end

#freenil

Clear the row cache in case it helps your GC.

Returns:

  • (nil)


530
531
532
533
534
# File 'lib/remote_table.rb', line 530

def free
  @fully_cached = false
  cache.clear
  nil
end

#parser#call

An object that responds to #call(row) and returns an array of one or more rows.

Returns:

  • (#call)


376
377
378
# File 'lib/remote_table.rb', line 376

def parser
  @final_parser ||= (@parser || NullParser.new)
end

#to_aArray<Hash,Array> Also known as: rows

Returns All rows.

Returns:



505
506
507
508
509
510
511
# File 'lib/remote_table.rb', line 505

def to_a
  if fully_cached?
    cache.dup
  else
    map { |row| row }
  end
end