Class: RemoteTable
- Inherits:
-
Object
- Object
- RemoteTable
- Includes:
- Enumerable
- Defined in:
- lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/json.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb
Overview
Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
Defined Under Namespace
Modules: Delimited, FixedWidth, Html, Json, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml
Constant Summary collapse
- WHITESPACE =
/\s+/- SINGLE_SPACE =
' '- EXTERNAL_ENCODING =
'UTF-8'- EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'- GOOGLE_DOCS_SPREADSHEET =
[ /docs.google.com/i, /spreadsheets.google.com/i ]
- VALID =
{ :compression => [:gz, :zip, :bz2, :exe], :packing => [:tar], :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv, :json], }
- DEFAULT =
{ :streaming => false, :warn_on_multiple_downloads => true, :headers => :first_row, :keep_blank_rows => false, :skip => 0, :encoding => 'UTF-8', }
- OLD_SETTING_NAMES =
{ :pre_select => [:select], :pre_reject => [:reject], :delimiter => [:col_sep], }
- VERSION =
'3.3.2'
Instance Attribute Summary collapse
-
#column_css ⇒ String
readonly
The CSS selector used to find columns in HTML or XML.
-
#column_xpath ⇒ String
readonly
The XPath used to find columns in HTML or XML.
-
#compression ⇒ Symbol
readonly
The compression type.
-
#crop ⇒ Range
readonly
Use a range of rows in a plaintext file.
-
#cut ⇒ String
readonly
Pick specific columns out of a plaintext file using an argument to the UNIX [
cututility](en.wikipedia.org/wiki/Cut_%28Unix%29). -
#delimiter ⇒ String
readonly
The delimiter, a.k.a.
-
#encoding ⇒ String
readonly
The original encoding of the source file.
-
#errata ⇒ Hash
readonly
An object that responds to #rejects?(row) and #correct!(row).
-
#filename ⇒ String
readonly
The filename, which can be used to pick a file out of an archive.
-
#form_data ⇒ String
readonly
Form data to POST in the download request.
-
#format ⇒ Hash
readonly
The format of the source file.
-
#glob ⇒ String
readonly
The glob used to pick a file out of an archive.
-
#headers ⇒ :first_row, ...
readonly
Headers specified by the user:
:first_row(the default),false, or a list of headers. -
#keep_blank_rows ⇒ true, false
readonly
Whether to keep blank rows.
-
#other_options ⇒ Hash
readonly
Options passed by the user that may be passed through to the underlying parsing library.
-
#packing ⇒ Symbol
readonly
The packing type.
-
#pre_reject ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#pre_select ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#quote_char ⇒ String
readonly
Quote character for delimited files.
-
#root_node ⇒ String
readonly
The root node of the json document.
-
#row_css ⇒ String
readonly
The CSS selector used to find rows in HTML or XML.
-
#row_xpath ⇒ String
readonly
The XPath used to find rows in HTML or XML.
-
#schema ⇒ Array<Array{String,Integer,Hash}>
readonly
The fixed-width schema, given as a multi-dimensional array.
-
#schema_name ⇒ String, Symbol
readonly
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
-
#sheet ⇒ Object
readonly
The sheet specified by the user as a number or a string.
-
#skip ⇒ Integer
readonly
How many rows to skip at the beginning of the file or table.
-
#streaming ⇒ true, false
readonly
Whether to stream the rows without caching them.
-
#url ⇒ String
readonly
The URL of the local or remote file.
-
#warn_on_multiple_downloads ⇒ true, false
readonly
Whether to warn the user on multiple downloads.
Class Method Summary collapse
-
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
-
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL.
-
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename.
-
.guess_packing(url) ⇒ Symbol?
Guess packing from URL.
- .normalize_whitespace(v) ⇒ Object
-
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
Instance Method Summary collapse
-
#[](row_number) ⇒ Hash, Array
Get a row by row number.
-
#each {|Hash, Array| ... } ⇒ nil
(also: #each_row)
Yield each row.
-
#free ⇒ nil
Clear the row cache in case it helps your GC.
-
#initialize(*args) ⇒ RemoteTable
constructor
Create a new RemoteTable, which is an Enumerable.
-
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
-
#to_a ⇒ Array<Hash,Array>
(also: #rows)
All rows.
Constructor Details
#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable
Create a new RemoteTable, which is an Enumerable.
Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.
Does not immediately download/parse… it’s lazy-loading.
395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 |
# File 'lib/remote_table.rb', line 395 def initialize(*args) @download_count_mutex = ::Mutex.new @extend_bang_mutex = ::Mutex.new @cache = [] @download_count = 0 settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {} @url = if args.first.is_a? ::String args.first else grab settings, :url end @format = RemoteTable.guess_format grab(settings, :format) if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url } @url = RemoteTable.google_spreadsheet_csv_url url @format = :delimited end @headers = grab settings, :headers if headers.is_a?(::Array) and headers.any?(&:blank?) raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank" end @quote_char = grab settings, :quote_char @compression = grab(settings, :compression) || RemoteTable.guess_compression(url) @packing = grab(settings, :packing) || RemoteTable.guess_packing(url) @streaming = grab settings, :streaming @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads @delimiter = grab settings, :delimiter @sheet = grab settings, :sheet @keep_blank_rows = grab settings, :keep_blank_rows @form_data = grab settings, :form_data @skip = grab settings, :skip @encoding = grab settings, :encoding @row_xpath = grab settings, :row_xpath @column_xpath = grab settings, :column_xpath @row_css = grab settings, :row_css @column_css = grab settings, :column_css @glob = grab settings, :glob @filename = grab settings, :filename @cut = grab settings, :cut @crop = grab settings, :crop @schema = grab settings, :schema @schema_name = grab settings, :schema_name @pre_select = grab settings, :pre_select @pre_reject = grab settings, :pre_reject @errata = grab settings, :errata @root_node = grab settings, :root_node @parser = grab settings, :parser @other_options = settings @local_copy = LocalCopy.new self extend! end |
Instance Attribute Details
#column_css ⇒ String (readonly)
The CSS selector used to find columns in HTML or XML.
252 253 254 |
# File 'lib/remote_table.rb', line 252 def column_css @column_css end |
#column_xpath ⇒ String (readonly)
The XPath used to find columns in HTML or XML.
244 245 246 |
# File 'lib/remote_table.rb', line 244 def column_xpath @column_xpath end |
#compression ⇒ Symbol (readonly)
The compression type. Guessed from URL if not provided. :gz, :zip, :bz2, and :exe (treated as :zip) are supported.
260 261 262 |
# File 'lib/remote_table.rb', line 260 def compression @compression end |
#crop ⇒ Range (readonly)
Use a range of rows in a plaintext file.
301 302 303 |
# File 'lib/remote_table.rb', line 301 def crop @crop end |
#cut ⇒ String (readonly)
Pick specific columns out of a plaintext file using an argument to the UNIX [cut utility](en.wikipedia.org/wiki/Cut_%28Unix%29).
290 291 292 |
# File 'lib/remote_table.rb', line 290 def cut @cut end |
#delimiter ⇒ String (readonly)
The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep. Default is ‘,’.
236 237 238 |
# File 'lib/remote_table.rb', line 236 def delimiter @delimiter end |
#encoding ⇒ String (readonly)
The original encoding of the source file. Default is UTF-8.
232 233 234 |
# File 'lib/remote_table.rb', line 232 def encoding @encoding end |
#errata ⇒ Hash (readonly)
An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash.
-
#rejects?(row) - if row should be treated like it doesn’t exist
-
#correct!(row) - destructively update a row to fix something
See the Errata library at github.com/seamusabshere/errata for an example implementation.
338 339 340 |
# File 'lib/remote_table.rb', line 338 def errata @errata end |
#filename ⇒ String (readonly)
The filename, which can be used to pick a file out of an archive.
280 281 282 |
# File 'lib/remote_table.rb', line 280 def filename @filename end |
#form_data ⇒ String (readonly)
Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded.
224 225 226 |
# File 'lib/remote_table.rb', line 224 def form_data @form_data end |
#format ⇒ Hash (readonly)
The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml :json
Note: treats all docs.google.com and spreadsheets.google.com URLs as :delimited.
Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)
256 257 258 |
# File 'lib/remote_table.rb', line 256 def format @format end |
#glob ⇒ String (readonly)
The glob used to pick a file out of an archive.
272 273 274 |
# File 'lib/remote_table.rb', line 272 def glob @glob end |
#headers ⇒ :first_row, ... (readonly)
Headers specified by the user: :first_row (the default), false, or a list of headers.
205 206 207 |
# File 'lib/remote_table.rb', line 205 def headers @headers end |
#keep_blank_rows ⇒ true, false (readonly)
Whether to keep blank rows. Default is false.
220 221 222 |
# File 'lib/remote_table.rb', line 220 def keep_blank_rows @keep_blank_rows end |
#other_options ⇒ Hash (readonly)
Options passed by the user that may be passed through to the underlying parsing library.
372 373 374 |
# File 'lib/remote_table.rb', line 372 def @other_options end |
#packing ⇒ Symbol (readonly)
The packing type. Guessed from URL if not provided. Only :tar is supported.
264 265 266 |
# File 'lib/remote_table.rb', line 264 def packing @packing end |
#pre_reject ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :reject.
328 329 330 |
# File 'lib/remote_table.rb', line 328 def pre_reject @pre_reject end |
#pre_select ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :select.
324 325 326 |
# File 'lib/remote_table.rb', line 324 def pre_select @pre_select end |
#quote_char ⇒ String (readonly)
Quote character for delimited files.
Defaults to double quotes.
212 213 214 |
# File 'lib/remote_table.rb', line 212 def quote_char @quote_char end |
#root_node ⇒ String (readonly)
The root node of the json document. Specified as a string.
Default: nil; no root node.
354 355 356 |
# File 'lib/remote_table.rb', line 354 def root_node @root_node end |
#row_css ⇒ String (readonly)
The CSS selector used to find rows in HTML or XML.
248 249 250 |
# File 'lib/remote_table.rb', line 248 def row_css @row_css end |
#row_xpath ⇒ String (readonly)
The XPath used to find rows in HTML or XML.
240 241 242 |
# File 'lib/remote_table.rb', line 240 def row_xpath @row_xpath end |
#schema ⇒ Array<Array{String,Integer,Hash}> (readonly)
The fixed-width schema, given as a multi-dimensional array.
316 317 318 |
# File 'lib/remote_table.rb', line 316 def schema @schema end |
#schema_name ⇒ String, Symbol (readonly)
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
320 321 322 |
# File 'lib/remote_table.rb', line 320 def schema_name @schema_name end |
#sheet ⇒ Object (readonly)
The sheet specified by the user as a number or a string. @return
216 217 218 |
# File 'lib/remote_table.rb', line 216 def sheet @sheet end |
#skip ⇒ Integer (readonly)
How many rows to skip at the beginning of the file or table. Default is 0.
228 229 230 |
# File 'lib/remote_table.rb', line 228 def skip @skip end |
#streaming ⇒ true, false (readonly)
Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.
197 198 199 |
# File 'lib/remote_table.rb', line 197 def streaming @streaming end |
#url ⇒ String (readonly)
The URL of the local or remote file.
178 179 180 |
# File 'lib/remote_table.rb', line 178 def url @url end |
#warn_on_multiple_downloads ⇒ true, false (readonly)
Whether to warn the user on multiple downloads. Defaults to true.
201 202 203 |
# File 'lib/remote_table.rb', line 201 def warn_on_multiple_downloads @warn_on_multiple_downloads end |
Class Method Details
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
102 103 104 105 106 107 108 109 |
# File 'lib/remote_table.rb', line 102 def google_spreadsheet_csv_url(url) uri = ::URI.parse url params = uri.query.split('&') params.delete_if { |param| param.start_with?('output=') } params << 'output=csv' uri.query = params.join('&') uri.to_s end |
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL. Used internally.
51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/remote_table.rb', line 51 def guess_compression(url) extname = extname(url).downcase case extname when /gz/, /gunzip/ :gz when /zip/ :zip when /bz2/, /bunzip2/ :bz2 when /exe/ :exe end end |
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
# File 'lib/remote_table.rb', line 76 def guess_format(basename) case basename.to_s.downcase.strip when /ods\z/, /open_?office\z/ :ods when /xlsx\z/, /excelx\z/ :xlsx when /xls\z/, /excel\z/ :xls when /csv\z/, /tsv\z/, /delimited\z/ # note that there is no RemoteTable::Csv class - it's normalized to :delimited :delimited when /fixed_?width\z/ :fixed_width when /html?\z/ :html when /xml\z/ :xml when /yaml\z/, /yml\z/ :yaml when /json\z/ :json end end |
.guess_packing(url) ⇒ Symbol?
Guess packing from URL. Used internally.
67 68 69 70 71 72 |
# File 'lib/remote_table.rb', line 67 def guess_packing(url) basename = basename(url).downcase if basename.include?('.tar') or basename.include?('.tgz') :tar end end |
.normalize_whitespace(v) ⇒ Object
111 112 113 114 115 116 |
# File 'lib/remote_table.rb', line 111 def normalize_whitespace(v) v = v.to_s.dup v.gsub! WHITESPACE, SINGLE_SPACE v.strip! v end |
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
42 43 44 45 46 47 |
# File 'lib/remote_table.rb', line 42 def transpose(url, key_key, value_key, = {}) new(url, ).inject({}) do |memo, row| memo[row[key_key]] = row[value_key] memo end end |
Instance Method Details
#[](row_number) ⇒ Hash, Array
Get a row by row number. Zero-based.
508 509 510 511 512 513 514 |
# File 'lib/remote_table.rb', line 508 def [](row_number) if fully_cached? cache[row_number] else to_a[row_number] end end |
#each {|Hash, Array| ... } ⇒ nil Also known as: each_row
Yield each row.
459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 |
# File 'lib/remote_table.rb', line 459 def each if fully_cached? cache.each do |row| yield row end else mark_download! preprocess! memo = _each do |row| parser.call(row).each do |virtual_row| virtual_row.row_hash = ::HashDigest.digest3 row if errata next if errata.rejects? virtual_row errata.correct! virtual_row end next if pre_select and !pre_select.call(virtual_row) next if pre_reject and pre_reject.call(virtual_row) unless streaming cache.push virtual_row end yield virtual_row end end unless streaming fully_cached! end memo end nil end |
#free ⇒ nil
Clear the row cache in case it helps your GC.
519 520 521 522 523 |
# File 'lib/remote_table.rb', line 519 def free @fully_cached = false cache.clear nil end |
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
366 367 368 |
# File 'lib/remote_table.rb', line 366 def parser @final_parser ||= (@parser || NullParser.new) end |