Class: RemoteTable

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb

Overview

Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.

Defined Under Namespace

Modules: Delimited, FixedWidth, Html, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml

Constant Summary collapse

WHITESPACE =
/\s+/
SINGLE_SPACE =
' '
EXTERNAL_ENCODING =
'UTF-8'
EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'
GOOGLE_DOCS_SPREADSHEET =
[
  /docs.google.com/i,
  /spreadsheets.google.com/i
]
VALID =
{
  :compression => [:gz, :zip, :bz2, :exe],
  :packing => [:tar],
  :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv],
}
DEFAULT =
{
  :streaming => false,
  :warn_on_multiple_downloads => true,
  :headers => :first_row,
  :keep_blank_rows => false,
  :skip => 0,
  :encoding => 'UTF-8',
}
OLD_SETTING_NAMES =
{
  :pre_select => [:select],
  :pre_reject => [:reject],
  :delimiter  => [:col_sep],
}
VERSION =
'3.2.1'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable

Create a new RemoteTable, which is an Enumerable.

Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.

Does not immediately download/parse… it’s lazy-loading.

Examples:

Open an XLSX

RemoteTable.new('http://www.customerreferenceprogram.org/uploads/CRP_RFP_template.xlsx')

Open a CSV inside a ZIP file

RemoteTable.new 'http://www.epa.gov/climatechange/emissions/downloads10/2010-Inventory-Annex-Tables.zip',
                :filename => 'Annex Tables/Annex 3/Table A-93.csv',
                :skip => 1,
                :pre_select => proc { |row| row['Vehicle Age'].strip =~ /^\d+$/ }

Overloads:

  • #initialize(settings) ⇒ RemoteTable

    Parameters:

    • settings (Hash)

      Settings including :url.

  • #initialize(url, settings) ⇒ RemoteTable

    Parameters:

    • url (String)

      The URL to the local or remote file.

    • settings (Hash)

      Settings.



384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
# File 'lib/remote_table.rb', line 384

def initialize(*args)
  @download_count_mutex = ::Mutex.new
  @extend_bang_mutex = ::Mutex.new

  @cache = []
  @download_count = 0

  settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {}

  @url = if args.first.is_a? ::String
    args.first
  else
    grab settings, :url
  end
  @format = RemoteTable.guess_format grab(settings, :format)
  if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url }
    @url = RemoteTable.google_spreadsheet_csv_url url
    @format = :delimited
  end

  @headers = grab settings, :headers
  if headers.is_a?(::Array) and headers.any?(&:blank?)
    raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank"
  end
  @quote_char = grab settings, :quote_char

  @compression = grab(settings, :compression) || RemoteTable.guess_compression(url)
  @packing = grab(settings, :packing) || RemoteTable.guess_packing(url)

  @streaming = grab settings, :streaming
  @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads
  @delimiter = grab settings, :delimiter
  @sheet = grab settings, :sheet
  @keep_blank_rows = grab settings, :keep_blank_rows
  @form_data = grab settings, :form_data
  @skip = grab settings, :skip
  @encoding = grab settings, :encoding
  @row_xpath = grab settings, :row_xpath
  @column_xpath = grab settings, :column_xpath
  @row_css = grab settings, :row_css
  @column_css = grab settings, :column_css
  @glob = grab settings, :glob
  @filename = grab settings, :filename
  @cut = grab settings, :cut
  @crop = grab settings, :crop
  @schema = grab settings, :schema
  @schema_name = grab settings, :schema_name
  @pre_select = grab settings, :pre_select
  @pre_reject = grab settings, :pre_reject
  @errata = grab settings, :errata
  @parser = grab settings, :parser

  @other_options = settings

  @local_copy = LocalCopy.new self
  extend!
end

Instance Attribute Details

#column_cssString (readonly)

The CSS selector used to find columns in HTML or XML.

Returns:

  • (String)


248
249
250
# File 'lib/remote_table.rb', line 248

def column_css
  @column_css
end

#column_xpathString (readonly)

The XPath used to find columns in HTML or XML.

Returns:

  • (String)


240
241
242
# File 'lib/remote_table.rb', line 240

def column_xpath
  @column_xpath
end

#compressionSymbol (readonly)

The compression type. Guessed from URL if not provided. :gz, :zip, :bz2, and :exe (treated as :zip) are supported.

Returns:

  • (Symbol)


256
257
258
# File 'lib/remote_table.rb', line 256

def compression
  @compression
end

#cropRange (readonly)

Use a range of rows in a plaintext file.

Examples:

Only take rows 21 through 37

RemoteTable.new("http://www.eia.gov/emeu/cbecs/cbecs2003/detailed_tables_2003/2003set10/2003excel/C17.xls",
                :headers => false,
                :select => proc { |row| CbecsEnergyIntensity::NAICS_CODE_SYNTHESIZER.call(row) },
                :crop => (21..37))

Returns:

  • (Range)


297
298
299
# File 'lib/remote_table.rb', line 297

def crop
  @crop
end

#cutString (readonly)

Pick specific columns out of a plaintext file using an argument to the UNIX [cut utility](en.wikipedia.org/wiki/Cut_%28Unix%29).

Examples:

Pick ALMOST out of ABCDEFGHIJKLMNOPQRSTUVWXYZ

# $ echo ABCDEFGHIJKLMNOPQRSTUVWXYZ | cut -c '1,12,13,15,19,20'
# ALMOST
RemoteTable.new 'file:///atoz.txt', :cut => '1,12,13,15,19,20'

Returns:

  • (String)


286
287
288
# File 'lib/remote_table.rb', line 286

def cut
  @cut
end

#delimiterString (readonly)

The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep. Default is ‘,’.

Returns:

  • (String)


232
233
234
# File 'lib/remote_table.rb', line 232

def delimiter
  @delimiter
end

#encodingString (readonly)

The original encoding of the source file. Default is UTF-8.

Returns:

  • (String)


228
229
230
# File 'lib/remote_table.rb', line 228

def encoding
  @encoding
end

#errataHash (readonly)

An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash.

  • #rejects?(row) - if row should be treated like it doesn’t exist

  • #correct!(row) - destructively update a row to fix something

See the Errata library at github.com/seamusabshere/errata for an example implementation.

Returns:



334
335
336
# File 'lib/remote_table.rb', line 334

def errata
  @errata
end

#filenameString (readonly)

The filename, which can be used to pick a file out of an archive.

Examples:

Specify the filename to get out of a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :filename => '2008_FE_guide_ALL_rel_dates_-no sales-for DOE-5-1-08.csv'

Returns:

  • (String)


276
277
278
# File 'lib/remote_table.rb', line 276

def filename
  @filename
end

#form_dataString (readonly)

Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded.

Returns:

  • (String)


220
221
222
# File 'lib/remote_table.rb', line 220

def form_data
  @form_data
end

#formatHash (readonly)

The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml

Note: treats all docs.google.com and spreadsheets.google.com URLs as :delimited.

Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)

Returns:



252
253
254
# File 'lib/remote_table.rb', line 252

def format
  @format
end

#globString (readonly)

The glob used to pick a file out of an archive.

Examples:

Pick out the only CSV in a ZIP file

RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/08data.zip', :glob => '/*.csv'

Returns:

  • (String)


268
269
270
# File 'lib/remote_table.rb', line 268

def glob
  @glob
end

#headers:first_row, ... (readonly)

Headers specified by the user: :first_row (the default), false, or a list of headers.

Returns:

  • (:first_row, false, Array<String>)


201
202
203
# File 'lib/remote_table.rb', line 201

def headers
  @headers
end

#keep_blank_rowstrue, false (readonly)

Whether to keep blank rows. Default is false.

Returns:

  • (true, false)


216
217
218
# File 'lib/remote_table.rb', line 216

def keep_blank_rows
  @keep_blank_rows
end

#other_optionsHash (readonly)

Options passed by the user that may be passed through to the underlying parsing library.

Returns:



361
362
363
# File 'lib/remote_table.rb', line 361

def other_options
  @other_options
end

#packingSymbol (readonly)

The packing type. Guessed from URL if not provided. Only :tar is supported.

Returns:

  • (Symbol)


260
261
262
# File 'lib/remote_table.rb', line 260

def packing
  @packing
end

#pre_rejectProc (readonly)

A proc that decides whether to include a row. Previously passed as :reject.

Returns:

  • (Proc)


324
325
326
# File 'lib/remote_table.rb', line 324

def pre_reject
  @pre_reject
end

#pre_selectProc (readonly)

A proc that decides whether to include a row. Previously passed as :select.

Returns:

  • (Proc)


320
321
322
# File 'lib/remote_table.rb', line 320

def pre_select
  @pre_select
end

#quote_charString (readonly)

Quote character for delimited files.

Defaults to double quotes.

Returns:

  • (String)


208
209
210
# File 'lib/remote_table.rb', line 208

def quote_char
  @quote_char
end

#row_cssString (readonly)

The CSS selector used to find rows in HTML or XML.

Returns:

  • (String)


244
245
246
# File 'lib/remote_table.rb', line 244

def row_css
  @row_css
end

#row_xpathString (readonly)

The XPath used to find rows in HTML or XML.

Returns:

  • (String)


236
237
238
# File 'lib/remote_table.rb', line 236

def row_xpath
  @row_xpath
end

#schemaArray<Array{String,Integer,Hash}> (readonly)

The fixed-width schema, given as a multi-dimensional array.

Examples:

From the tests

RemoteTable.new('http://cloud.github.com/downloads/seamusabshere/remote_table/test2.fixed_width.txt',
                 :format => :fixed_width,
                 :skip => 1,
                 :schema => [[ 'header4', 10, { :type => :string }  ],
                             [  'spacer',  1 ],
                             [  'header5', 10, { :type => :string } ],
                             [  'spacer',  12 ],
                             [  'header6', 10, { :type => :string } ]])

Returns:



312
313
314
# File 'lib/remote_table.rb', line 312

def schema
  @schema
end

#schema_nameString, Symbol (readonly)

If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.

Returns:

  • (String, Symbol)


316
317
318
# File 'lib/remote_table.rb', line 316

def schema_name
  @schema_name
end

#sheetObject (readonly)

The sheet specified by the user as a number or a string. @return



212
213
214
# File 'lib/remote_table.rb', line 212

def sheet
  @sheet
end

#skipInteger (readonly)

How many rows to skip at the beginning of the file or table. Default is 0.

Returns:

  • (Integer)


224
225
226
# File 'lib/remote_table.rb', line 224

def skip
  @skip
end

#streamingtrue, false (readonly)

Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.

Returns:

  • (true, false)


193
194
195
# File 'lib/remote_table.rb', line 193

def streaming
  @streaming
end

#urlString (readonly)

The URL of the local or remote file.

Examples:

Local

file:///Users/myuser/Desktop/holidays.csv

Local using an absolute path

/Users/myuser/Desktop/holidays.csv

Remote

http://data.brighterplanet.com/countries.csv

Returns:

  • (String)


174
175
176
# File 'lib/remote_table.rb', line 174

def url
  @url
end

#warn_on_multiple_downloadstrue, false (readonly)

Whether to warn the user on multiple downloads. Defaults to true.

Returns:

  • (true, false)


197
198
199
# File 'lib/remote_table.rb', line 197

def warn_on_multiple_downloads
  @warn_on_multiple_downloads
end

Class Method Details

.google_spreadsheet_csv_url(url) ⇒ String

Given a Google Docs spreadsheet URL, make sure it uses CSV output.

Returns:

  • (String)


98
99
100
101
102
103
104
105
# File 'lib/remote_table.rb', line 98

def google_spreadsheet_csv_url(url)
  uri = ::URI.parse url
  params = uri.query.split('&')
  params.delete_if { |param| param.start_with?('output=') }
  params << 'output=csv'
  uri.query = params.join('&')
  uri.to_s
end

.guess_compression(url) ⇒ Symbol?

Guess compression based on URL. Used internally.

Returns:

  • (Symbol, nil)


49
50
51
52
53
54
55
56
57
58
59
60
61
# File 'lib/remote_table.rb', line 49

def guess_compression(url)
  extname = extname(url).downcase
  case extname
  when /gz/, /gunzip/
    :gz
  when /zip/
    :zip
  when /bz2/, /bunzip2/
    :bz2
  when /exe/
    :exe
  end
end

.guess_format(basename) ⇒ Symbol?

Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.

Returns:

  • (Symbol, nil)


74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/remote_table.rb', line 74

def guess_format(basename)
  case basename.to_s.downcase.strip
  when /ods\z/, /open_?office\z/
    :ods
  when /xlsx\z/, /excelx\z/
    :xlsx
  when /xls\z/, /excel\z/
    :xls
  when /csv\z/, /tsv\z/, /delimited\z/
    # note that there is no RemoteTable::Csv class - it's normalized to :delimited
    :delimited
  when /fixed_?width\z/
    :fixed_width
  when /html?\z/
    :html
  when /xml\z/
    :xml
  when /yaml\z/, /yml\z/
    :yaml
  end
end

.guess_packing(url) ⇒ Symbol?

Guess packing from URL. Used internally.

Returns:

  • (Symbol, nil)


65
66
67
68
69
70
# File 'lib/remote_table.rb', line 65

def guess_packing(url)
  basename = basename(url).downcase
  if basename.include?('.tar') or basename.include?('.tgz')
    :tar
  end
end

.normalize_whitespace(v) ⇒ Object



107
108
109
110
111
112
# File 'lib/remote_table.rb', line 107

def normalize_whitespace(v)
  v = v.to_s.dup
  v.gsub! WHITESPACE, SINGLE_SPACE
  v.strip!
  v
end

.transpose(url, key_key, value_key, options = {}) ⇒ Object

Transpose two columns into a mapping from one to the other.



40
41
42
43
44
45
# File 'lib/remote_table.rb', line 40

def transpose(url, key_key, value_key, options = {})
  new(url, options).inject({}) do |memo, row|
    memo[row[key_key]] = row[value_key]
    memo
  end
end

Instance Method Details

#[](row_number) ⇒ Hash, Array

Get a row by row number. Zero-based.

Returns:



496
497
498
499
500
501
502
# File 'lib/remote_table.rb', line 496

def [](row_number)
  if fully_cached?
    cache[row_number]
  else
    to_a[row_number]
  end
end

#each {|Hash, Array| ... } ⇒ nil Also known as: each_row

Yield each row.

Yields:

  • (Hash, Array)

    A hash or an array depending on whether the RemoteTable has named headers (column names).

Returns:

  • (nil)


447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
# File 'lib/remote_table.rb', line 447

def each
  if fully_cached?
    cache.each do |row|
      yield row
    end
  else
    mark_download!
    preprocess!
    memo = _each do |row|
      parser.call(row).each do |virtual_row|
        virtual_row.row_hash = ::HashDigest.digest2 row
        if errata
          next if errata.rejects? virtual_row
          errata.correct! virtual_row
        end
        next if pre_select and !pre_select.call(virtual_row)
        next if pre_reject and pre_reject.call(virtual_row)
        unless streaming
          cache.push virtual_row
        end
        yield virtual_row
      end
    end
    unless streaming
      fully_cached!
    end
    memo
  end
  nil
end

#freenil

Clear the row cache in case it helps your GC.

Returns:

  • (nil)


507
508
509
510
511
# File 'lib/remote_table.rb', line 507

def free
  @fully_cached = false
  cache.clear
  nil
end

#parser#call

An object that responds to #call(row) and returns an array of one or more rows.

Returns:

  • (#call)


355
356
357
# File 'lib/remote_table.rb', line 355

def parser
  @final_parser ||= (@parser || NullParser.new)
end

#to_aArray<Hash,Array> Also known as: rows

Returns All rows.

Returns:



482
483
484
485
486
487
488
# File 'lib/remote_table.rb', line 482

def to_a
  if fully_cached?
    cache.dup
  else
    map { |row| row }
  end
end