Class: RemoteTable
- Inherits:
-
Object
- Object
- RemoteTable
- Includes:
- Enumerable
- Defined in:
- lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/json.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb
Overview
Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
Defined Under Namespace
Modules: Delimited, FixedWidth, Html, Json, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml
Constant Summary collapse
- WHITESPACE =
/\s+/
- SINGLE_SPACE =
' '
- EXTERNAL_ENCODING =
'UTF-8'
- EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'
- GOOGLE_DOCS_SPREADSHEET =
[ /docs.google.com/i, /spreadsheets.google.com/i ]
- VALID =
{ :compression => [:gz, :zip, :bz2, :exe], :packing => [:tar], :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv, :json], }
- DEFAULT =
{ :streaming => false, :warn_on_multiple_downloads => true, :headers => :first_row, :keep_blank_rows => false, :skip => 0, :encoding => 'UTF-8', :stop_after_untitled_headers => false, }
- OLD_SETTING_NAMES =
{ :pre_select => [:select], :pre_reject => [:reject], :delimiter => [:col_sep], }
- VERSION =
'3.3.3'
Instance Attribute Summary collapse
-
#column_css ⇒ String
readonly
The CSS selector used to find columns in HTML or XML.
-
#column_xpath ⇒ String
readonly
The XPath used to find columns in HTML or XML.
-
#compression ⇒ Symbol
readonly
The compression type.
-
#crop ⇒ Range
readonly
Use a range of rows in a plaintext file.
-
#cut ⇒ String
readonly
Pick specific columns out of a plaintext file using an argument to the UNIX [
cut
utility](en.wikipedia.org/wiki/Cut_%28Unix%29). -
#delimiter ⇒ String
readonly
The delimiter, a.k.a.
-
#encoding ⇒ String
readonly
The original encoding of the source file.
-
#errata ⇒ Hash
readonly
An object that responds to #rejects?(row) and #correct!(row).
-
#filename ⇒ String
readonly
The filename, which can be used to pick a file out of an archive.
-
#form_data ⇒ String
readonly
Form data to POST in the download request.
-
#format ⇒ Hash
readonly
The format of the source file.
-
#glob ⇒ String
readonly
The glob used to pick a file out of an archive.
-
#headers ⇒ :first_row, ...
readonly
Headers specified by the user:
:first_row
(the default),false
, or a list of headers. -
#keep_blank_rows ⇒ true, false
readonly
Whether to keep blank rows.
-
#other_options ⇒ Hash
readonly
Options passed by the user that may be passed through to the underlying parsing library.
-
#packing ⇒ Symbol
readonly
The packing type.
-
#pre_reject ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#pre_select ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#quote_char ⇒ String
readonly
Quote character for delimited files.
-
#root_node ⇒ String
readonly
The root node of the json document.
-
#row_css ⇒ String
readonly
The CSS selector used to find rows in HTML or XML.
-
#row_xpath ⇒ String
readonly
The XPath used to find rows in HTML or XML.
-
#schema ⇒ Array<Array{String,Integer,Hash}>
readonly
The fixed-width schema, given as a multi-dimensional array.
-
#schema_name ⇒ String, Symbol
readonly
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
-
#sheet ⇒ Object
readonly
The sheet specified by the user as a number or a string.
-
#skip ⇒ Integer
readonly
How many rows to skip at the beginning of the file or table.
-
#stop_after_untitled_headers ⇒ Integer
readonly
When to trim untitled headers.
-
#streaming ⇒ true, false
readonly
Whether to stream the rows without caching them.
-
#url ⇒ String
readonly
The URL of the local or remote file.
-
#warn_on_multiple_downloads ⇒ true, false
readonly
Whether to warn the user on multiple downloads.
Class Method Summary collapse
-
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
-
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL.
-
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename.
-
.guess_packing(url) ⇒ Symbol?
Guess packing from URL.
- .normalize_whitespace(v) ⇒ Object
-
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
Instance Method Summary collapse
-
#[](row_number) ⇒ Hash, Array
Get a row by row number.
-
#each {|Hash, Array| ... } ⇒ nil
(also: #each_row)
Yield each row.
-
#free ⇒ nil
Clear the row cache in case it helps your GC.
-
#initialize(*args) ⇒ RemoteTable
constructor
Create a new RemoteTable, which is an Enumerable.
-
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
-
#to_a ⇒ Array<Hash,Array>
(also: #rows)
All rows.
Constructor Details
#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable
Create a new RemoteTable, which is an Enumerable.
Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.
Does not immediately download/parse… it’s lazy-loading.
405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 |
# File 'lib/remote_table.rb', line 405 def initialize(*args) @download_count_mutex = ::Mutex.new @extend_bang_mutex = ::Mutex.new @cache = [] @download_count = 0 settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {} @url = if args.first.is_a? ::String args.first else grab settings, :url end @format = RemoteTable.guess_format grab(settings, :format) if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url } @url = RemoteTable.google_spreadsheet_csv_url url @format = :delimited end @headers = grab settings, :headers if headers.is_a?(::Array) and headers.any?(&:blank?) raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank" end @quote_char = grab settings, :quote_char @compression = grab(settings, :compression) || RemoteTable.guess_compression(url) @packing = grab(settings, :packing) || RemoteTable.guess_packing(url) @streaming = grab settings, :streaming @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads @delimiter = grab settings, :delimiter @sheet = grab settings, :sheet @keep_blank_rows = grab settings, :keep_blank_rows @form_data = grab settings, :form_data @skip = grab settings, :skip @encoding = grab settings, :encoding @row_xpath = grab settings, :row_xpath @column_xpath = grab settings, :column_xpath @row_css = grab settings, :row_css @column_css = grab settings, :column_css @glob = grab settings, :glob @filename = grab settings, :filename @cut = grab settings, :cut @crop = grab settings, :crop @schema = grab settings, :schema @schema_name = grab settings, :schema_name @pre_select = grab settings, :pre_select @pre_reject = grab settings, :pre_reject @errata = grab settings, :errata @root_node = grab settings, :root_node @parser = grab settings, :parser @stop_after_untitled_headers = grab settings, :stop_after_untitled_headers @other_options = settings @local_copy = LocalCopy.new self extend! end |
Instance Attribute Details
#column_css ⇒ String (readonly)
The CSS selector used to find columns in HTML or XML.
253 254 255 |
# File 'lib/remote_table.rb', line 253 def column_css @column_css end |
#column_xpath ⇒ String (readonly)
The XPath used to find columns in HTML or XML.
245 246 247 |
# File 'lib/remote_table.rb', line 245 def column_xpath @column_xpath end |
#compression ⇒ Symbol (readonly)
The compression type. Guessed from URL if not provided. :gz
, :zip
, :bz2
, and :exe
(treated as :zip
) are supported.
261 262 263 |
# File 'lib/remote_table.rb', line 261 def compression @compression end |
#crop ⇒ Range (readonly)
Use a range of rows in a plaintext file.
302 303 304 |
# File 'lib/remote_table.rb', line 302 def crop @crop end |
#cut ⇒ String (readonly)
Pick specific columns out of a plaintext file using an argument to the UNIX [cut
utility](en.wikipedia.org/wiki/Cut_%28Unix%29).
291 292 293 |
# File 'lib/remote_table.rb', line 291 def cut @cut end |
#delimiter ⇒ String (readonly)
The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep
. Default is ‘,’.
237 238 239 |
# File 'lib/remote_table.rb', line 237 def delimiter @delimiter end |
#encoding ⇒ String (readonly)
The original encoding of the source file. Default is UTF-8.
233 234 235 |
# File 'lib/remote_table.rb', line 233 def encoding @encoding end |
#errata ⇒ Hash (readonly)
An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash
.
-
#rejects?(row) - if row should be treated like it doesn’t exist
-
#correct!(row) - destructively update a row to fix something
See the Errata library at github.com/seamusabshere/errata for an example implementation.
339 340 341 |
# File 'lib/remote_table.rb', line 339 def errata @errata end |
#filename ⇒ String (readonly)
The filename, which can be used to pick a file out of an archive.
281 282 283 |
# File 'lib/remote_table.rb', line 281 def filename @filename end |
#form_data ⇒ String (readonly)
Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded
.
225 226 227 |
# File 'lib/remote_table.rb', line 225 def form_data @form_data end |
#format ⇒ Hash (readonly)
The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml :json
Note: treats all docs.google.com
and spreadsheets.google.com
URLs as :delimited
.
Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)
257 258 259 |
# File 'lib/remote_table.rb', line 257 def format @format end |
#glob ⇒ String (readonly)
The glob used to pick a file out of an archive.
273 274 275 |
# File 'lib/remote_table.rb', line 273 def glob @glob end |
#headers ⇒ :first_row, ... (readonly)
Headers specified by the user: :first_row
(the default), false
, or a list of headers.
206 207 208 |
# File 'lib/remote_table.rb', line 206 def headers @headers end |
#keep_blank_rows ⇒ true, false (readonly)
Whether to keep blank rows. Default is false.
221 222 223 |
# File 'lib/remote_table.rb', line 221 def keep_blank_rows @keep_blank_rows end |
#other_options ⇒ Hash (readonly)
Options passed by the user that may be passed through to the underlying parsing library.
382 383 384 |
# File 'lib/remote_table.rb', line 382 def @other_options end |
#packing ⇒ Symbol (readonly)
The packing type. Guessed from URL if not provided. Only :tar
is supported.
265 266 267 |
# File 'lib/remote_table.rb', line 265 def packing @packing end |
#pre_reject ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :reject
.
329 330 331 |
# File 'lib/remote_table.rb', line 329 def pre_reject @pre_reject end |
#pre_select ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :select
.
325 326 327 |
# File 'lib/remote_table.rb', line 325 def pre_select @pre_select end |
#quote_char ⇒ String (readonly)
Quote character for delimited files.
Defaults to double quotes.
213 214 215 |
# File 'lib/remote_table.rb', line 213 def quote_char @quote_char end |
#root_node ⇒ String (readonly)
The root node of the json document. Specified as a string.
Default: nil; no root node.
355 356 357 |
# File 'lib/remote_table.rb', line 355 def root_node @root_node end |
#row_css ⇒ String (readonly)
The CSS selector used to find rows in HTML or XML.
249 250 251 |
# File 'lib/remote_table.rb', line 249 def row_css @row_css end |
#row_xpath ⇒ String (readonly)
The XPath used to find rows in HTML or XML.
241 242 243 |
# File 'lib/remote_table.rb', line 241 def row_xpath @row_xpath end |
#schema ⇒ Array<Array{String,Integer,Hash}> (readonly)
The fixed-width schema, given as a multi-dimensional array.
317 318 319 |
# File 'lib/remote_table.rb', line 317 def schema @schema end |
#schema_name ⇒ String, Symbol (readonly)
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
321 322 323 |
# File 'lib/remote_table.rb', line 321 def schema_name @schema_name end |
#sheet ⇒ Object (readonly)
The sheet specified by the user as a number or a string. @return
217 218 219 |
# File 'lib/remote_table.rb', line 217 def sheet @sheet end |
#skip ⇒ Integer (readonly)
How many rows to skip at the beginning of the file or table. Default is 0.
229 230 231 |
# File 'lib/remote_table.rb', line 229 def skip @skip end |
#stop_after_untitled_headers ⇒ Integer (readonly)
When to trim untitled headers. Set this to 100 to prevent more than 100 untitled headers being created; the rest will be silently discarded.
Note: This is effectively a right trim… the counting starts from the left.
Default: false, don’t try
364 365 366 |
# File 'lib/remote_table.rb', line 364 def stop_after_untitled_headers @stop_after_untitled_headers end |
#streaming ⇒ true, false (readonly)
Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.
198 199 200 |
# File 'lib/remote_table.rb', line 198 def streaming @streaming end |
#url ⇒ String (readonly)
The URL of the local or remote file.
179 180 181 |
# File 'lib/remote_table.rb', line 179 def url @url end |
#warn_on_multiple_downloads ⇒ true, false (readonly)
Whether to warn the user on multiple downloads. Defaults to true.
202 203 204 |
# File 'lib/remote_table.rb', line 202 def warn_on_multiple_downloads @warn_on_multiple_downloads end |
Class Method Details
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
102 103 104 105 106 107 108 109 |
# File 'lib/remote_table.rb', line 102 def google_spreadsheet_csv_url(url) uri = ::URI.parse url params = uri.query.split('&') params.delete_if { |param| param.start_with?('output=') } params << 'output=csv' uri.query = params.join('&') uri.to_s end |
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL. Used internally.
51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/remote_table.rb', line 51 def guess_compression(url) extname = extname(url).downcase case extname when /gz/, /gunzip/ :gz when /zip/ :zip when /bz2/, /bunzip2/ :bz2 when /exe/ :exe end end |
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
# File 'lib/remote_table.rb', line 76 def guess_format(basename) case basename.to_s.downcase.strip when /ods\z/, /open_?office\z/ :ods when /xlsx\z/, /excelx\z/ :xlsx when /xls\z/, /excel\z/ :xls when /csv\z/, /tsv\z/, /delimited\z/ # note that there is no RemoteTable::Csv class - it's normalized to :delimited :delimited when /fixed_?width\z/ :fixed_width when /html?\z/ :html when /xml\z/ :xml when /yaml\z/, /yml\z/ :yaml when /json\z/ :json end end |
.guess_packing(url) ⇒ Symbol?
Guess packing from URL. Used internally.
67 68 69 70 71 72 |
# File 'lib/remote_table.rb', line 67 def guess_packing(url) basename = basename(url).downcase if basename.include?('.tar') or basename.include?('.tgz') :tar end end |
.normalize_whitespace(v) ⇒ Object
111 112 113 114 115 116 |
# File 'lib/remote_table.rb', line 111 def normalize_whitespace(v) v = v.to_s.dup v.gsub! WHITESPACE, SINGLE_SPACE v.strip! v end |
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
42 43 44 45 46 47 |
# File 'lib/remote_table.rb', line 42 def transpose(url, key_key, value_key, = {}) new(url, ).inject({}) do |memo, row| memo[row[key_key]] = row[value_key] memo end end |
Instance Method Details
#[](row_number) ⇒ Hash, Array
Get a row by row number. Zero-based.
519 520 521 522 523 524 525 |
# File 'lib/remote_table.rb', line 519 def [](row_number) if fully_cached? cache[row_number] else to_a[row_number] end end |
#each {|Hash, Array| ... } ⇒ nil Also known as: each_row
Yield each row.
470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 |
# File 'lib/remote_table.rb', line 470 def each if fully_cached? cache.each do |row| yield row end else mark_download! preprocess! memo = _each do |row| parser.call(row).each do |virtual_row| virtual_row.row_hash = ::HashDigest.digest3 row if errata next if errata.rejects? virtual_row errata.correct! virtual_row end next if pre_select and !pre_select.call(virtual_row) next if pre_reject and pre_reject.call(virtual_row) unless streaming cache.push virtual_row end yield virtual_row end end unless streaming fully_cached! end memo end nil end |
#free ⇒ nil
Clear the row cache in case it helps your GC.
530 531 532 533 534 |
# File 'lib/remote_table.rb', line 530 def free @fully_cached = false cache.clear nil end |
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
376 377 378 |
# File 'lib/remote_table.rb', line 376 def parser @final_parser ||= (@parser || NullParser.new) end |