Class: RemoteTable

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/json.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb

Overview

Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.

Defined Under Namespace

Modules: Delimited, FixedWidth, Html, Json, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml

Constant Summary collapse

WHITESPACE =
/\s+/
SINGLE_SPACE =
' '
EXTERNAL_ENCODING =
'UTF-8'
EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'
GOOGLE_DOCS_SPREADSHEET =
[
  /docs.google.com/i,
  /spreadsheets.google.com/i
]
VALID =
{
  :compression => [:gz, :zip, :bz2, :exe],
  :packing => [:tar],
  :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv, :json],
}
DEFAULT =
{
  :streaming => false,
  :warn_on_multiple_downloads => true,
  :headers => :first_row,
  :keep_blank_rows => false,
  :skip => 0,
  :encoding => 'UTF-8',
}
OLD_SETTING_NAMES =
{
  :pre_select => [:select],
  :pre_reject => [:reject],
  :delimiter  => [:col_sep],
}
VERSION =
'3.3.2'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable

Create a new RemoteTable, which is an Enumerable.

Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.

Does not immediately download/parse… it’s lazy-loading.

Examples:

Open an XLSX

RemoteTable.new('http://www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx')

Open a CSV inside a ZIP file

RemoteTable.new 'http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
                :filename => 'Annex Tables/Annex 3/Table A-93.csv',
                :skip => 1,
                :pre_select => proc { |row| row['Vehicle Age'].strip =~ /^\d+$/ }

Overloads:

  • #initialize(settings) ⇒ RemoteTable

    Parameters:

    • settings (Hash)

      Settings including :url.

  • #initialize(url, settings) ⇒ RemoteTable

    Parameters:

    • url (String)

      The URL to the local or remote file.

    • settings (Hash)

      Settings.



395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
# File 'lib/remote_table.rb', line 395

def initialize(*args)
  @download_count_mutex = ::Mutex.new
  @extend_bang_mutex = ::Mutex.new

  @cache = []
  @download_count = 0

  settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {}

  @url = if args.first.is_a? ::String
    args.first
  else
    grab settings, :url
  end
  @format = RemoteTable.guess_format grab(settings, :format)
  if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url }
    @url = RemoteTable.google_spreadsheet_csv_url url
    @format = :delimited
  end

  @headers = grab settings, :headers
  if headers.is_a?(::Array) and headers.any?(&:blank?)
    raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank"
  end
  @quote_char = grab settings, :quote_char

  @compression = grab(settings, :compression) || RemoteTable.guess_compression(url)
  @packing = grab(settings, :packing) || RemoteTable.guess_packing(url)

  @streaming = grab settings, :streaming
  @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads
  @delimiter = grab settings, :delimiter
  @sheet = grab settings, :sheet
  @keep_blank_rows = grab settings, :keep_blank_rows
  @form_data = grab settings, :form_data
  @skip = grab settings, :skip
  @encoding = grab settings, :encoding
  @row_xpath = grab settings, :row_xpath
  @column_xpath = grab settings, :column_xpath
  @row_css = grab settings, :row_css
  @column_css = grab settings, :column_css
  @glob = grab settings, :glob
  @filename = grab settings, :filename
  @cut = grab settings, :cut
  @crop = grab settings, :crop
  @schema = grab settings, :schema
  @schema_name = grab settings, :schema_name
  @pre_select = grab settings, :pre_select
  @pre_reject = grab settings, :pre_reject
  @errata = grab settings, :errata
  @root_node = grab settings, :root_node
  @parser = grab settings, :parser

  @other_options = settings

  @local_copy = LocalCopy.new self
  extend!
end

Instance Attribute Details

#column_cssString (readonly)

The CSS selector used to find columns in HTML or XML.

Returns:

  • (String)


252
253
254
# File 'lib/remote_table.rb', line 252

def column_css
  @column_css
end

#column_xpathString (readonly)

The XPath used to find columns in HTML or XML.

Returns:

  • (String)


244
245
246
# File 'lib/remote_table.rb', line 244

def column_xpath
  @column_xpath
end

#compressionSymbol (readonly)

The compression type. Guessed from URL if not provided. :gz, :zip, :bz2, and :exe (treated as :zip) are supported.

Returns:

  • (Symbol)


260
261
262
# File 'lib/remote_table.rb', line 260

def compression
  @compression
end

#cropRange (readonly)

Use a range of rows in a plaintext file.

Examples:

Only take rows 21 through 37

RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls",
                :headers => false,
                :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
                :crop => (21..37))

Returns:

  • (Range)


301
302
303
# File 'lib/remote_table.rb', line 301

def crop
  @crop
end

#cutString (readonly)

Pick specific columns out of a plaintext file using an argument to the UNIX [cut utility](en.wikipedia.org/wiki/Cut_%28Unix%29).

Examples:

Pick ALMOST out of ABCDEFGHIJKLMNOPQRSTUVWXYZ

# $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | cut -c '1,12,13,15,19,20'
# ALMOST
RemoteTable.new 'file:///atoz.txt', :cut => '1,12,13,15,19,20'

Returns:

  • (String)


290
291
292
# File 'lib/remote_table.rb', line 290

def cut
  @cut
end

#delimiterString (readonly)

The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep. Default is ‘,’.

Returns:

  • (String)


236
237
238
# File 'lib/remote_table.rb', line 236

def delimiter
  @delimiter
end

#encodingString (readonly)

The original encoding of the source file. Default is UTF-8.

Returns:

  • (String)


232
233
234
# File 'lib/remote_table.rb', line 232

def encoding
  @encoding
end

#errataHash (readonly)

An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash.

  • #rejects?(row) - if row should be treated like it doesn’t exist

  • #correct!(row) - destructively update a row to fix something

See the Errata library at github.com/seamusabshere/errata for an example implementation.

Returns:



338
339
340
# File 'lib/remote_table.rb', line 338

def errata
  @errata
end

#filenameString (readonly)

The filename, which can be used to pick a file out of an archive.

Examples:

Specify the filename to get out of a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :filename => '2008_FE_guide_ALL_rel_dates_-no sales-for DOE-5-1-08.csv'

Returns:

  • (String)


280
281
282
# File 'lib/remote_table.rb', line 280

def filename
  @filename
end

#form_dataString (readonly)

Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded.

Returns:

  • (String)


224
225
226
# File 'lib/remote_table.rb', line 224

def form_data
  @form_data
end

#formatHash (readonly)

The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml :json

Note: treats all docs.google.com and spreadsheets.google.com URLs as :delimited.

Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)

Returns:



256
257
258
# File 'lib/remote_table.rb', line 256

def format
  @format
end

#globString (readonly)

The glob used to pick a file out of an archive.

Examples:

Pick out the only CSV in a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :glob => '/*.csv'

Returns:

  • (String)


272
273
274
# File 'lib/remote_table.rb', line 272

def glob
  @glob
end

#headers:first_row, ... (readonly)

Headers specified by the user: :first_row (the default), false, or a list of headers.

Returns:

  • (:first_row, false, Array<String>)


205
206
207
# File 'lib/remote_table.rb', line 205

def headers
  @headers
end

#keep_blank_rowstrue, false (readonly)

Whether to keep blank rows. Default is false.

Returns:

  • (true, false)


220
221
222
# File 'lib/remote_table.rb', line 220

def keep_blank_rows
  @keep_blank_rows
end

#other_optionsHash (readonly)

Options passed by the user that may be passed through to the underlying parsing library.

Returns:



372
373
374
# File 'lib/remote_table.rb', line 372

def other_options
  @other_options
end

#packingSymbol (readonly)

The packing type. Guessed from URL if not provided. Only :tar is supported.

Returns:

  • (Symbol)


264
265
266
# File 'lib/remote_table.rb', line 264

def packing
  @packing
end

#pre_rejectProc (readonly)

A proc that decides whether to include a row. Previously passed as :reject.

Returns:

  • (Proc)


328
329
330
# File 'lib/remote_table.rb', line 328

def pre_reject
  @pre_reject
end

#pre_selectProc (readonly)

A proc that decides whether to include a row. Previously passed as :select.

Returns:

  • (Proc)


324
325
326
# File 'lib/remote_table.rb', line 324

def pre_select
  @pre_select
end

#quote_charString (readonly)

Quote character for delimited files.

Defaults to double quotes.

Returns:

  • (String)


212
213
214
# File 'lib/remote_table.rb', line 212

def quote_char
  @quote_char
end

#root_nodeString (readonly)

The root node of the json document. Specified as a string.

Default: nil; no root node.

Returns:

  • (String)


354
355
356
# File 'lib/remote_table.rb', line 354

def root_node
  @root_node
end

#row_cssString (readonly)

The CSS selector used to find rows in HTML or XML.

Returns:

  • (String)


248
249
250
# File 'lib/remote_table.rb', line 248

def row_css
  @row_css
end

#row_xpathString (readonly)

The XPath used to find rows in HTML or XML.

Returns:

  • (String)


240
241
242
# File 'lib/remote_table.rb', line 240

def row_xpath
  @row_xpath
end

#schemaArray<Array{String,Integer,Hash}> (readonly)

The fixed-width schema, given as a multi-dimensional array.

Examples:

From the tests

RemoteTable.new('http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt',
                 :format => :fixed_width,
                 :skip => 1,
                 :schema => [[ 'header4', 10, { :type => :string }  ],
                             [  'spacer',  1 ],
                             [  'header5', 10, { :type => :string } ],
                             [  'spacer',  12 ],
                             [  'header6', 10, { :type => :string } ]])

Returns:



316
317
318
# File 'lib/remote_table.rb', line 316

def schema
  @schema
end

#schema_nameString, Symbol (readonly)

If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.

Returns:

  • (String, Symbol)


320
321
322
# File 'lib/remote_table.rb', line 320

def schema_name
  @schema_name
end

#sheetObject (readonly)

The sheet specified by the user as a number or a string. @return



216
217
218
# File 'lib/remote_table.rb', line 216

def sheet
  @sheet
end

#skipInteger (readonly)

How many rows to skip at the beginning of the file or table. Default is 0.

Returns:

  • (Integer)


228
229
230
# File 'lib/remote_table.rb', line 228

def skip
  @skip
end

#streamingtrue, false (readonly)

Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.

Returns:

  • (true, false)


197
198
199
# File 'lib/remote_table.rb', line 197

def streaming
  @streaming
end

#urlString (readonly)

The URL of the local or remote file.

Examples:

Local

file:///Users/myuser/Desktop/holidays.csv

Local using an absolute path

/Users/myuser/Desktop/holidays.csv

Remote

http://data.brighterplanet.com/countries.csv

Returns:

  • (String)


178
179
180
# File 'lib/remote_table.rb', line 178

def url
  @url
end

#warn_on_multiple_downloadstrue, false (readonly)

Whether to warn the user on multiple downloads. Defaults to true.

Returns:

  • (true, false)


201
202
203
# File 'lib/remote_table.rb', line 201

def warn_on_multiple_downloads
  @warn_on_multiple_downloads
end

Class Method Details

.google_spreadsheet_csv_url(url) ⇒ String

Given a Google Docs spreadsheet URL, make sure it uses CSV output.

Returns:

  • (String)


102
103
104
105
106
107
108
109
# File 'lib/remote_table.rb', line 102

def google_spreadsheet_csv_url(url)
  uri = ::URI.parse url
  params = uri.query.split('&')
  params.delete_if { |param| param.start_with?('output=') }
  params << 'output=csv'
  uri.query = params.join('&')
  uri.to_s
end

.guess_compression(url) ⇒ Symbol?

Guess compression based on URL. Used internally.

Returns:

  • (Symbol, nil)


51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/remote_table.rb', line 51

def guess_compression(url)
  extname = extname(url).downcase
  case extname
  when /gz/, /gunzip/
    :gz
  when /zip/
    :zip
  when /bz2/, /bunzip2/
    :bz2
  when /exe/
    :exe
  end
end

.guess_format(basename) ⇒ Symbol?

Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.

Returns:

  • (Symbol, nil)


76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'lib/remote_table.rb', line 76

def guess_format(basename)
  case basename.to_s.downcase.strip
  when /ods\z/, /open_?office\z/
    :ods
  when /xlsx\z/, /excelx\z/
    :xlsx
  when /xls\z/, /excel\z/
    :xls
  when /csv\z/, /tsv\z/, /delimited\z/
    # note that there is no RemoteTable::Csv class - it's normalized to :delimited
    :delimited
  when /fixed_?width\z/
    :fixed_width
  when /html?\z/
    :html
  when /xml\z/
    :xml
  when /yaml\z/, /yml\z/
    :yaml
  when /json\z/
    :json
  end
end

.guess_packing(url) ⇒ Symbol?

Guess packing from URL. Used internally.

Returns:

  • (Symbol, nil)


67
68
69
70
71
72
# File 'lib/remote_table.rb', line 67

def guess_packing(url)
  basename = basename(url).downcase
  if basename.include?('.tar') or basename.include?('.tgz')
    :tar
  end
end

.normalize_whitespace(v) ⇒ Object



111
112
113
114
115
116
# File 'lib/remote_table.rb', line 111

def normalize_whitespace(v)
  v = v.to_s.dup
  v.gsub! WHITESPACE, SINGLE_SPACE
  v.strip!
  v
end

.transpose(url, key_key, value_key, options = {}) ⇒ Object

Transpose two columns into a mapping from one to the other.



42
43
44
45
46
47
# File 'lib/remote_table.rb', line 42

def transpose(url, key_key, value_key, options = {})
  new(url, options).inject({}) do |memo, row|
    memo[row[key_key]] = row[value_key]
    memo
  end
end

Instance Method Details

#[](row_number) ⇒ Hash, Array

Get a row by row number. Zero-based.

Returns:



508
509
510
511
512
513
514
# File 'lib/remote_table.rb', line 508

def [](row_number)
  if fully_cached?
    cache[row_number]
  else
    to_a[row_number]
  end
end

#each {|Hash, Array| ... } ⇒ nil Also known as: each_row

Yield each row.

Yields:

  • (Hash, Array)

    A hash or an array depending on whether the RemoteTable has named headers (column names).

Returns:

  • (nil)


459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
# File 'lib/remote_table.rb', line 459

def each
  if fully_cached?
    cache.each do |row|
      yield row
    end
  else
    mark_download!
    preprocess!
    memo = _each do |row|
      parser.call(row).each do |virtual_row|
        virtual_row.row_hash = ::HashDigest.digest3 row
        if errata
          next if errata.rejects? virtual_row
          errata.correct! virtual_row
        end
        next if pre_select and !pre_select.call(virtual_row)
        next if pre_reject and pre_reject.call(virtual_row)
        unless streaming
          cache.push virtual_row
        end
        yield virtual_row
      end
    end
    unless streaming
      fully_cached!
    end
    memo
  end
  nil
end

#freenil

Clear the row cache in case it helps your GC.

Returns:

  • (nil)


519
520
521
522
523
# File 'lib/remote_table.rb', line 519

def free
  @fully_cached = false
  cache.clear
  nil
end

#parser#call

An object that responds to #call(row) and returns an array of one or more rows.

Returns:

  • (#call)


366
367
368
# File 'lib/remote_table.rb', line 366

def parser
  @final_parser ||= (@parser || NullParser.new)
end

#to_aArray<Hash,Array> Also known as: rows

Returns All rows.

Returns:



494
495
496
497
498
499
500
# File 'lib/remote_table.rb', line 494

def to_a
  if fully_cached?
    cache.dup
  else
    map { |row| row }
  end
end