Class: RemoteTable
- Inherits:
-
Object
- Object
- RemoteTable
- Includes:
- Enumerable
- Defined in:
- lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb
Overview
Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
Defined Under Namespace
Modules: Delimited, FixedWidth, Html, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml
Constant Summary collapse
- WHITESPACE =
/\s+/
- SINGLE_SPACE =
' '
- EXTERNAL_ENCODING =
'UTF-8'
- EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'
- GOOGLE_DOCS_SPREADSHEET =
[ /docs.google.com/i, /spreadsheets.google.com/i ]
- VALID =
{ :compression => [:gz, :zip, :bz2, :exe], :packing => [:tar], :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv], }
- DEFAULT =
{ :streaming => false, :warn_on_multiple_downloads => true, :headers => :first_row, :keep_blank_rows => false, :skip => 0, :encoding => 'UTF-8', }
- OLD_SETTING_NAMES =
{ :pre_select => [:select], :pre_reject => [:reject], :delimiter => [:col_sep], }
- VERSION =
'3.2.0'
Instance Attribute Summary collapse
-
#column_css ⇒ String
readonly
The CSS selector used to find columns in HTML or XML.
-
#column_xpath ⇒ String
readonly
The XPath used to find columns in HTML or XML.
-
#compression ⇒ Symbol
readonly
The compression type.
-
#crop ⇒ Range
readonly
Use a range of rows in a plaintext file.
-
#cut ⇒ String
readonly
Pick specific columns out of a plaintext file using an argument to the UNIX [
cut
utility](en.wikipedia.org/wiki/Cut_%28Unix%29). -
#delimiter ⇒ String
readonly
The delimiter, a.k.a.
-
#encoding ⇒ String
readonly
The original encoding of the source file.
-
#errata ⇒ Hash
readonly
An object that responds to #rejects?(row) and #correct!(row).
-
#filename ⇒ String
readonly
The filename, which can be used to pick a file out of an archive.
-
#form_data ⇒ String
readonly
Form data to POST in the download request.
-
#format ⇒ Hash
readonly
The format of the source file.
-
#glob ⇒ String
readonly
The glob used to pick a file out of an archive.
-
#headers ⇒ :first_row, ...
readonly
Headers specified by the user:
:first_row
(the default),false
, or a list of headers. -
#keep_blank_rows ⇒ true, false
readonly
Whether to keep blank rows.
-
#other_options ⇒ Hash
readonly
Options passed by the user that may be passed through to the underlying parsing library.
-
#packing ⇒ Symbol
readonly
The packing type.
-
#pre_reject ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#pre_select ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#quote_char ⇒ String
readonly
Quote character for delimited files.
-
#row_css ⇒ String
readonly
The CSS selector used to find rows in HTML or XML.
-
#row_xpath ⇒ String
readonly
The XPath used to find rows in HTML or XML.
-
#schema ⇒ Array<Array{String,Integer,Hash}>
readonly
The fixed-width schema, given as a multi-dimensional array.
-
#schema_name ⇒ String, Symbol
readonly
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
-
#sheet ⇒ Object
readonly
The sheet specified by the user as a number or a string.
-
#skip ⇒ Integer
readonly
How many rows to skip at the beginning of the file or table.
-
#streaming ⇒ true, false
readonly
Whether to stream the rows without caching them.
-
#url ⇒ String
readonly
The URL of the local or remote file.
-
#warn_on_multiple_downloads ⇒ true, false
readonly
Whether to warn the user on multiple downloads.
Class Method Summary collapse
-
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
-
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL.
-
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename.
-
.guess_packing(url) ⇒ Symbol?
Guess packing from URL.
- .normalize_whitespace(v) ⇒ Object
-
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
Instance Method Summary collapse
-
#[](row_number) ⇒ Hash, Array
Get a row by row number.
-
#each {|Hash, Array| ... } ⇒ nil
(also: #each_row)
Yield each row.
-
#free ⇒ nil
Clear the row cache in case it helps your GC.
-
#initialize(*args) ⇒ RemoteTable
constructor
Create a new RemoteTable, which is an Enumerable.
-
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
-
#to_a ⇒ Array<Hash,Array>
(also: #rows)
All rows.
Constructor Details
#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable
Create a new RemoteTable, which is an Enumerable.
Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.
Does not immediately download/parse… it’s lazy-loading.
390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 |
# File 'lib/remote_table.rb', line 390 def initialize(*args) @download_count_mutex = ::Mutex.new @extend_bang_mutex = ::Mutex.new @cache = [] @download_count = 0 settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {} @url = if args.first.is_a? ::String args.first else grab settings, :url end @format = RemoteTable.guess_format grab(settings, :format) if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url } @url = RemoteTable.google_spreadsheet_csv_url url @format = :delimited end @headers = grab settings, :headers if headers.is_a?(::Array) and headers.any?(&:blank?) raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank" end @quote_char = grab settings, :quote_char @compression = grab(settings, :compression) || RemoteTable.guess_compression(url) @packing = grab(settings, :packing) || RemoteTable.guess_packing(url) @streaming = grab settings, :streaming @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads @delimiter = grab settings, :delimiter @sheet = grab settings, :sheet @keep_blank_rows = grab settings, :keep_blank_rows @form_data = grab settings, :form_data @skip = grab settings, :skip @encoding = grab settings, :encoding @row_xpath = grab settings, :row_xpath @column_xpath = grab settings, :column_xpath @row_css = grab settings, :row_css @column_css = grab settings, :column_css @glob = grab settings, :glob @filename = grab settings, :filename @cut = grab settings, :cut @crop = grab settings, :crop @schema = grab settings, :schema @schema_name = grab settings, :schema_name @pre_select = grab settings, :pre_select @pre_reject = grab settings, :pre_reject @errata = grab settings, :errata @parser = grab settings, :parser @other_options = settings @local_copy = LocalCopy.new self extend! end |
Instance Attribute Details
#column_css ⇒ String (readonly)
The CSS selector used to find columns in HTML or XML.
254 255 256 |
# File 'lib/remote_table.rb', line 254 def column_css @column_css end |
#column_xpath ⇒ String (readonly)
The XPath used to find columns in HTML or XML.
246 247 248 |
# File 'lib/remote_table.rb', line 246 def column_xpath @column_xpath end |
#compression ⇒ Symbol (readonly)
The compression type. Guessed from URL if not provided. :gz
, :zip
, :bz2
, and :exe
(treated as :zip
) are supported.
262 263 264 |
# File 'lib/remote_table.rb', line 262 def compression @compression end |
#crop ⇒ Range (readonly)
Use a range of rows in a plaintext file.
303 304 305 |
# File 'lib/remote_table.rb', line 303 def crop @crop end |
#cut ⇒ String (readonly)
Pick specific columns out of a plaintext file using an argument to the UNIX [cut
utility](en.wikipedia.org/wiki/Cut_%28Unix%29).
292 293 294 |
# File 'lib/remote_table.rb', line 292 def cut @cut end |
#delimiter ⇒ String (readonly)
The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep
. Default is ‘,’.
238 239 240 |
# File 'lib/remote_table.rb', line 238 def delimiter @delimiter end |
#encoding ⇒ String (readonly)
The original encoding of the source file. Default is UTF-8.
234 235 236 |
# File 'lib/remote_table.rb', line 234 def encoding @encoding end |
#errata ⇒ Hash (readonly)
An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash
.
-
#rejects?(row) - if row should be treated like it doesn’t exist
-
#correct!(row) - destructively update a row to fix something
See the Errata library at github.com/seamusabshere/errata for an example implementation.
340 341 342 |
# File 'lib/remote_table.rb', line 340 def errata @errata end |
#filename ⇒ String (readonly)
The filename, which can be used to pick a file out of an archive.
282 283 284 |
# File 'lib/remote_table.rb', line 282 def filename @filename end |
#form_data ⇒ String (readonly)
Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded
.
226 227 228 |
# File 'lib/remote_table.rb', line 226 def form_data @form_data end |
#format ⇒ Hash (readonly)
The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml
Note: treats all docs.google.com
and spreadsheets.google.com
URLs as :delimited
.
Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)
258 259 260 |
# File 'lib/remote_table.rb', line 258 def format @format end |
#glob ⇒ String (readonly)
The glob used to pick a file out of an archive.
274 275 276 |
# File 'lib/remote_table.rb', line 274 def glob @glob end |
#headers ⇒ :first_row, ... (readonly)
Headers specified by the user: :first_row
(the default), false
, or a list of headers.
207 208 209 |
# File 'lib/remote_table.rb', line 207 def headers @headers end |
#keep_blank_rows ⇒ true, false (readonly)
Whether to keep blank rows. Default is false.
222 223 224 |
# File 'lib/remote_table.rb', line 222 def keep_blank_rows @keep_blank_rows end |
#other_options ⇒ Hash (readonly)
Options passed by the user that may be passed through to the underlying parsing library.
367 368 369 |
# File 'lib/remote_table.rb', line 367 def @other_options end |
#packing ⇒ Symbol (readonly)
The packing type. Guessed from URL if not provided. Only :tar
is supported.
266 267 268 |
# File 'lib/remote_table.rb', line 266 def packing @packing end |
#pre_reject ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :reject
.
330 331 332 |
# File 'lib/remote_table.rb', line 330 def pre_reject @pre_reject end |
#pre_select ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :select
.
326 327 328 |
# File 'lib/remote_table.rb', line 326 def pre_select @pre_select end |
#quote_char ⇒ String (readonly)
Quote character for delimited files.
Defaults to double quotes.
214 215 216 |
# File 'lib/remote_table.rb', line 214 def quote_char @quote_char end |
#row_css ⇒ String (readonly)
The CSS selector used to find rows in HTML or XML.
250 251 252 |
# File 'lib/remote_table.rb', line 250 def row_css @row_css end |
#row_xpath ⇒ String (readonly)
The XPath used to find rows in HTML or XML.
242 243 244 |
# File 'lib/remote_table.rb', line 242 def row_xpath @row_xpath end |
#schema ⇒ Array<Array{String,Integer,Hash}> (readonly)
The fixed-width schema, given as a multi-dimensional array.
318 319 320 |
# File 'lib/remote_table.rb', line 318 def schema @schema end |
#schema_name ⇒ String, Symbol (readonly)
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
322 323 324 |
# File 'lib/remote_table.rb', line 322 def schema_name @schema_name end |
#sheet ⇒ Object (readonly)
The sheet specified by the user as a number or a string. @return
218 219 220 |
# File 'lib/remote_table.rb', line 218 def sheet @sheet end |
#skip ⇒ Integer (readonly)
How many rows to skip at the beginning of the file or table. Default is 0.
230 231 232 |
# File 'lib/remote_table.rb', line 230 def skip @skip end |
#streaming ⇒ true, false (readonly)
Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.
199 200 201 |
# File 'lib/remote_table.rb', line 199 def streaming @streaming end |
#url ⇒ String (readonly)
The URL of the local or remote file.
180 181 182 |
# File 'lib/remote_table.rb', line 180 def url @url end |
#warn_on_multiple_downloads ⇒ true, false (readonly)
Whether to warn the user on multiple downloads. Defaults to true.
203 204 205 |
# File 'lib/remote_table.rb', line 203 def warn_on_multiple_downloads @warn_on_multiple_downloads end |
Class Method Details
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
104 105 106 107 108 109 110 111 |
# File 'lib/remote_table.rb', line 104 def google_spreadsheet_csv_url(url) uri = ::URI.parse url params = uri.query.split('&') params.delete_if { |param| param.start_with?('output=') } params << 'output=csv' uri.query = params.join('&') uri.to_s end |
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL. Used internally.
55 56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'lib/remote_table.rb', line 55 def guess_compression(url) extname = extname(url).downcase case extname when /gz/, /gunzip/ :gz when /zip/ :zip when /bz2/, /bunzip2/ :bz2 when /exe/ :exe end end |
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/remote_table.rb', line 80 def guess_format(basename) case basename.to_s.downcase.strip when /ods\z/, /open_?office\z/ :ods when /xlsx\z/, /excelx\z/ :xlsx when /xls\z/, /excel\z/ :xls when /csv\z/, /tsv\z/, /delimited\z/ # note that there is no RemoteTable::Csv class - it's normalized to :delimited :delimited when /fixed_?width\z/ :fixed_width when /html?\z/ :html when /xml\z/ :xml when /yaml\z/, /yml\z/ :yaml end end |
.guess_packing(url) ⇒ Symbol?
Guess packing from URL. Used internally.
71 72 73 74 75 76 |
# File 'lib/remote_table.rb', line 71 def guess_packing(url) basename = basename(url).downcase if basename.include?('.tar') or basename.include?('.tgz') :tar end end |
.normalize_whitespace(v) ⇒ Object
113 114 115 116 117 118 |
# File 'lib/remote_table.rb', line 113 def normalize_whitespace(v) v = v.to_s.dup v.gsub! WHITESPACE, SINGLE_SPACE v.strip! v end |
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
46 47 48 49 50 51 |
# File 'lib/remote_table.rb', line 46 def transpose(url, key_key, value_key, = {}) new(url, ).inject({}) do |memo, row| memo[row[key_key]] = row[value_key] memo end end |
Instance Method Details
#[](row_number) ⇒ Hash, Array
Get a row by row number. Zero-based.
502 503 504 505 506 507 508 |
# File 'lib/remote_table.rb', line 502 def [](row_number) if fully_cached? cache[row_number] else to_a[row_number] end end |
#each {|Hash, Array| ... } ⇒ nil Also known as: each_row
Yield each row.
453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 |
# File 'lib/remote_table.rb', line 453 def each if fully_cached? cache.each do |row| yield row end else mark_download! preprocess! memo = _each do |row| parser.call(row).each do |virtual_row| virtual_row.row_hash = ::HashDigest.digest2 row if errata next if errata.rejects? virtual_row errata.correct! virtual_row end next if pre_select and !pre_select.call(virtual_row) next if pre_reject and pre_reject.call(virtual_row) unless streaming cache.push virtual_row end yield virtual_row end end unless streaming fully_cached! end memo end nil end |
#free ⇒ nil
Clear the row cache in case it helps your GC.
513 514 515 516 517 |
# File 'lib/remote_table.rb', line 513 def free @fully_cached = false cache.clear nil end |
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
361 362 363 |
# File 'lib/remote_table.rb', line 361 def parser @final_parser ||= (@parser || NullParser.new) end |