Class: RemoteTable
- Inherits:
-
Object
- Object
- RemoteTable
- Includes:
- Enumerable
- Defined in:
- lib/remote_table.rb,
lib/remote_table/ods.rb,
lib/remote_table/xls.rb,
lib/remote_table/xml.rb,
lib/remote_table/html.rb,
lib/remote_table/xlsx.rb,
lib/remote_table/yaml.rb,
lib/remote_table/version.rb,
lib/remote_table/delimited.rb,
lib/remote_table/plaintext.rb,
lib/remote_table/local_copy.rb,
lib/remote_table/fixed_width.rb,
lib/remote_table/processed_by_roo.rb,
lib/remote_table/processed_by_nokogiri.rb
Overview
Open Google Docs spreadsheets, local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files.
Defined Under Namespace
Modules: Delimited, FixedWidth, Html, Ods, Plaintext, ProcessedByNokogiri, ProcessedByRoo, Xls, Xlsx, Xml, Yaml
Constant Summary collapse
- WHITESPACE =
/\s+/- SINGLE_SPACE =
' '- EXTERNAL_ENCODING =
'UTF-8'- EXTERNAL_ENCODING_ICONV =
'UTF-8//TRANSLIT'- GOOGLE_DOCS_SPREADSHEET =
[ /docs.google.com/i, /spreadsheets.google.com/i ]
- VALID =
{ :compression => [:gz, :zip, :bz2, :exe], :packing => [:tar], :format => [:xlsx, :xls, :delimited, :ods, :fixed_width, :html, :xml, :yaml, :csv], }
- DEFAULT =
{ :streaming => false, :warn_on_multiple_downloads => true, :headers => :first_row, :keep_blank_rows => false, :skip => 0, :encoding => 'UTF-8', }
- OLD_SETTING_NAMES =
{ :pre_select => [:select], :pre_reject => [:reject], :delimiter => [:col_sep], }
- VERSION =
'3.2.1'
Instance Attribute Summary collapse
-
#column_css ⇒ String
readonly
The CSS selector used to find columns in HTML or XML.
-
#column_xpath ⇒ String
readonly
The XPath used to find columns in HTML or XML.
-
#compression ⇒ Symbol
readonly
The compression type.
-
#crop ⇒ Range
readonly
Use a range of rows in a plaintext file.
-
#cut ⇒ String
readonly
Pick specific columns out of a plaintext file using an argument to the UNIX [
cututility](en.wikipedia.org/wiki/Cut_%28Unix%29). -
#delimiter ⇒ String
readonly
The delimiter, a.k.a.
-
#encoding ⇒ String
readonly
The original encoding of the source file.
-
#errata ⇒ Hash
readonly
An object that responds to #rejects?(row) and #correct!(row).
-
#filename ⇒ String
readonly
The filename, which can be used to pick a file out of an archive.
-
#form_data ⇒ String
readonly
Form data to POST in the download request.
-
#format ⇒ Hash
readonly
The format of the source file.
-
#glob ⇒ String
readonly
The glob used to pick a file out of an archive.
-
#headers ⇒ :first_row, ...
readonly
Headers specified by the user:
:first_row(the default),false, or a list of headers. -
#keep_blank_rows ⇒ true, false
readonly
Whether to keep blank rows.
-
#other_options ⇒ Hash
readonly
Options passed by the user that may be passed through to the underlying parsing library.
-
#packing ⇒ Symbol
readonly
The packing type.
-
#pre_reject ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#pre_select ⇒ Proc
readonly
A proc that decides whether to include a row.
-
#quote_char ⇒ String
readonly
Quote character for delimited files.
-
#row_css ⇒ String
readonly
The CSS selector used to find rows in HTML or XML.
-
#row_xpath ⇒ String
readonly
The XPath used to find rows in HTML or XML.
-
#schema ⇒ Array<Array{String,Integer,Hash}>
readonly
The fixed-width schema, given as a multi-dimensional array.
-
#schema_name ⇒ String, Symbol
readonly
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
-
#sheet ⇒ Object
readonly
The sheet specified by the user as a number or a string.
-
#skip ⇒ Integer
readonly
How many rows to skip at the beginning of the file or table.
-
#streaming ⇒ true, false
readonly
Whether to stream the rows without caching them.
-
#url ⇒ String
readonly
The URL of the local or remote file.
-
#warn_on_multiple_downloads ⇒ true, false
readonly
Whether to warn the user on multiple downloads.
Class Method Summary collapse
-
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
-
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL.
-
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename.
-
.guess_packing(url) ⇒ Symbol?
Guess packing from URL.
- .normalize_whitespace(v) ⇒ Object
-
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
Instance Method Summary collapse
-
#[](row_number) ⇒ Hash, Array
Get a row by row number.
-
#each {|Hash, Array| ... } ⇒ nil
(also: #each_row)
Yield each row.
-
#free ⇒ nil
Clear the row cache in case it helps your GC.
-
#initialize(*args) ⇒ RemoteTable
constructor
Create a new RemoteTable, which is an Enumerable.
-
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
-
#to_a ⇒ Array<Hash,Array>
(also: #rows)
All rows.
Constructor Details
#initialize(settings) ⇒ RemoteTable #initialize(url, settings) ⇒ RemoteTable
Create a new RemoteTable, which is an Enumerable.
Options are set at creation using any of the attributes listed… RDoc will say they’re “read-only” because you can’t set/change them after creation.
Does not immediately download/parse… it’s lazy-loading.
384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 |
# File 'lib/remote_table.rb', line 384 def initialize(*args) @download_count_mutex = ::Mutex.new @extend_bang_mutex = ::Mutex.new @cache = [] @download_count = 0 settings = args.last.is_a?(::Hash) ? args.last.symbolize_keys : {} @url = if args.first.is_a? ::String args.first else grab settings, :url end @format = RemoteTable.guess_format grab(settings, :format) if GOOGLE_DOCS_SPREADSHEET.any? { |regex| regex =~ url } @url = RemoteTable.google_spreadsheet_csv_url url @format = :delimited end @headers = grab settings, :headers if headers.is_a?(::Array) and headers.any?(&:blank?) raise ::ArgumentError, "[remote_table] If you specify headers, none of them can be blank" end @quote_char = grab settings, :quote_char @compression = grab(settings, :compression) || RemoteTable.guess_compression(url) @packing = grab(settings, :packing) || RemoteTable.guess_packing(url) @streaming = grab settings, :streaming @warn_on_multiple_downloads = grab settings, :warn_on_multiple_downloads @delimiter = grab settings, :delimiter @sheet = grab settings, :sheet @keep_blank_rows = grab settings, :keep_blank_rows @form_data = grab settings, :form_data @skip = grab settings, :skip @encoding = grab settings, :encoding @row_xpath = grab settings, :row_xpath @column_xpath = grab settings, :column_xpath @row_css = grab settings, :row_css @column_css = grab settings, :column_css @glob = grab settings, :glob @filename = grab settings, :filename @cut = grab settings, :cut @crop = grab settings, :crop @schema = grab settings, :schema @schema_name = grab settings, :schema_name @pre_select = grab settings, :pre_select @pre_reject = grab settings, :pre_reject @errata = grab settings, :errata @parser = grab settings, :parser @other_options = settings @local_copy = LocalCopy.new self extend! end |
Instance Attribute Details
#column_css ⇒ String (readonly)
The CSS selector used to find columns in HTML or XML.
248 249 250 |
# File 'lib/remote_table.rb', line 248 def column_css @column_css end |
#column_xpath ⇒ String (readonly)
The XPath used to find columns in HTML or XML.
240 241 242 |
# File 'lib/remote_table.rb', line 240 def column_xpath @column_xpath end |
#compression ⇒ Symbol (readonly)
The compression type. Guessed from URL if not provided. :gz, :zip, :bz2, and :exe (treated as :zip) are supported.
256 257 258 |
# File 'lib/remote_table.rb', line 256 def compression @compression end |
#crop ⇒ Range (readonly)
Use a range of rows in a plaintext file.
297 298 299 |
# File 'lib/remote_table.rb', line 297 def crop @crop end |
#cut ⇒ String (readonly)
Pick specific columns out of a plaintext file using an argument to the UNIX [cut utility](en.wikipedia.org/wiki/Cut_%28Unix%29).
286 287 288 |
# File 'lib/remote_table.rb', line 286 def cut @cut end |
#delimiter ⇒ String (readonly)
The delimiter, a.k.a. column separator. Passed to Ruby CSV as :col_sep. Default is ‘,’.
232 233 234 |
# File 'lib/remote_table.rb', line 232 def delimiter @delimiter end |
#encoding ⇒ String (readonly)
The original encoding of the source file. Default is UTF-8.
228 229 230 |
# File 'lib/remote_table.rb', line 228 def encoding @encoding end |
#errata ⇒ Hash (readonly)
An object that responds to #rejects?(row) and #correct!(row). Applied after creating row_hash.
-
#rejects?(row) - if row should be treated like it doesn’t exist
-
#correct!(row) - destructively update a row to fix something
See the Errata library at github.com/seamusabshere/errata for an example implementation.
334 335 336 |
# File 'lib/remote_table.rb', line 334 def errata @errata end |
#filename ⇒ String (readonly)
The filename, which can be used to pick a file out of an archive.
276 277 278 |
# File 'lib/remote_table.rb', line 276 def filename @filename end |
#form_data ⇒ String (readonly)
Form data to POST in the download request. It should probably be in application/x-www-form-urlencoded.
220 221 222 |
# File 'lib/remote_table.rb', line 220 def form_data @form_data end |
#format ⇒ Hash (readonly)
The format of the source file. Can be specified as: :xlsx, :xls, :delimited (aka :csv), :ods, :fixed_width, :html, :xml, :yaml
Note: treats all docs.google.com and spreadsheets.google.com URLs as :delimited.
Default: guessed from file extension (which is usually the same as the URL, but sometimes not if you pick out a specific file from an archive)
252 253 254 |
# File 'lib/remote_table.rb', line 252 def format @format end |
#glob ⇒ String (readonly)
The glob used to pick a file out of an archive.
268 269 270 |
# File 'lib/remote_table.rb', line 268 def glob @glob end |
#headers ⇒ :first_row, ... (readonly)
Headers specified by the user: :first_row (the default), false, or a list of headers.
201 202 203 |
# File 'lib/remote_table.rb', line 201 def headers @headers end |
#keep_blank_rows ⇒ true, false (readonly)
Whether to keep blank rows. Default is false.
216 217 218 |
# File 'lib/remote_table.rb', line 216 def keep_blank_rows @keep_blank_rows end |
#other_options ⇒ Hash (readonly)
Options passed by the user that may be passed through to the underlying parsing library.
361 362 363 |
# File 'lib/remote_table.rb', line 361 def @other_options end |
#packing ⇒ Symbol (readonly)
The packing type. Guessed from URL if not provided. Only :tar is supported.
260 261 262 |
# File 'lib/remote_table.rb', line 260 def packing @packing end |
#pre_reject ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :reject.
324 325 326 |
# File 'lib/remote_table.rb', line 324 def pre_reject @pre_reject end |
#pre_select ⇒ Proc (readonly)
A proc that decides whether to include a row. Previously passed as :select.
320 321 322 |
# File 'lib/remote_table.rb', line 320 def pre_select @pre_select end |
#quote_char ⇒ String (readonly)
Quote character for delimited files.
Defaults to double quotes.
208 209 210 |
# File 'lib/remote_table.rb', line 208 def quote_char @quote_char end |
#row_css ⇒ String (readonly)
The CSS selector used to find rows in HTML or XML.
244 245 246 |
# File 'lib/remote_table.rb', line 244 def row_css @row_css end |
#row_xpath ⇒ String (readonly)
The XPath used to find rows in HTML or XML.
236 237 238 |
# File 'lib/remote_table.rb', line 236 def row_xpath @row_xpath end |
#schema ⇒ Array<Array{String,Integer,Hash}> (readonly)
The fixed-width schema, given as a multi-dimensional array.
312 313 314 |
# File 'lib/remote_table.rb', line 312 def schema @schema end |
#schema_name ⇒ String, Symbol (readonly)
If you somehow already defined a fixed-width schema (so you can re-use it?), specify it here.
316 317 318 |
# File 'lib/remote_table.rb', line 316 def schema_name @schema_name end |
#sheet ⇒ Object (readonly)
The sheet specified by the user as a number or a string. @return
212 213 214 |
# File 'lib/remote_table.rb', line 212 def sheet @sheet end |
#skip ⇒ Integer (readonly)
How many rows to skip at the beginning of the file or table. Default is 0.
224 225 226 |
# File 'lib/remote_table.rb', line 224 def skip @skip end |
#streaming ⇒ true, false (readonly)
Whether to stream the rows without caching them. Saves memory, but you have to re-download the file every time you enumerate its rows. Defaults to false.
193 194 195 |
# File 'lib/remote_table.rb', line 193 def streaming @streaming end |
#url ⇒ String (readonly)
The URL of the local or remote file.
174 175 176 |
# File 'lib/remote_table.rb', line 174 def url @url end |
#warn_on_multiple_downloads ⇒ true, false (readonly)
Whether to warn the user on multiple downloads. Defaults to true.
197 198 199 |
# File 'lib/remote_table.rb', line 197 def warn_on_multiple_downloads @warn_on_multiple_downloads end |
Class Method Details
.google_spreadsheet_csv_url(url) ⇒ String
Given a Google Docs spreadsheet URL, make sure it uses CSV output.
98 99 100 101 102 103 104 105 |
# File 'lib/remote_table.rb', line 98 def google_spreadsheet_csv_url(url) uri = ::URI.parse url params = uri.query.split('&') params.delete_if { |param| param.start_with?('output=') } params << 'output=csv' uri.query = params.join('&') uri.to_s end |
.guess_compression(url) ⇒ Symbol?
Guess compression based on URL. Used internally.
49 50 51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/remote_table.rb', line 49 def guess_compression(url) extname = extname(url).downcase case extname when /gz/, /gunzip/ :gz when /zip/ :zip when /bz2/, /bunzip2/ :bz2 when /exe/ :exe end end |
.guess_format(basename) ⇒ Symbol?
Guess file format from the basename. Since a file might be decompressed and/or pulled out of an archive with a glob, this usually can’t be called until a file is downloaded.
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/remote_table.rb', line 74 def guess_format(basename) case basename.to_s.downcase.strip when /ods\z/, /open_?office\z/ :ods when /xlsx\z/, /excelx\z/ :xlsx when /xls\z/, /excel\z/ :xls when /csv\z/, /tsv\z/, /delimited\z/ # note that there is no RemoteTable::Csv class - it's normalized to :delimited :delimited when /fixed_?width\z/ :fixed_width when /html?\z/ :html when /xml\z/ :xml when /yaml\z/, /yml\z/ :yaml end end |
.guess_packing(url) ⇒ Symbol?
Guess packing from URL. Used internally.
65 66 67 68 69 70 |
# File 'lib/remote_table.rb', line 65 def guess_packing(url) basename = basename(url).downcase if basename.include?('.tar') or basename.include?('.tgz') :tar end end |
.normalize_whitespace(v) ⇒ Object
107 108 109 110 111 112 |
# File 'lib/remote_table.rb', line 107 def normalize_whitespace(v) v = v.to_s.dup v.gsub! WHITESPACE, SINGLE_SPACE v.strip! v end |
.transpose(url, key_key, value_key, options = {}) ⇒ Object
Transpose two columns into a mapping from one to the other.
40 41 42 43 44 45 |
# File 'lib/remote_table.rb', line 40 def transpose(url, key_key, value_key, = {}) new(url, ).inject({}) do |memo, row| memo[row[key_key]] = row[value_key] memo end end |
Instance Method Details
#[](row_number) ⇒ Hash, Array
Get a row by row number. Zero-based.
496 497 498 499 500 501 502 |
# File 'lib/remote_table.rb', line 496 def [](row_number) if fully_cached? cache[row_number] else to_a[row_number] end end |
#each {|Hash, Array| ... } ⇒ nil Also known as: each_row
Yield each row.
447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 |
# File 'lib/remote_table.rb', line 447 def each if fully_cached? cache.each do |row| yield row end else mark_download! preprocess! memo = _each do |row| parser.call(row).each do |virtual_row| virtual_row.row_hash = ::HashDigest.digest2 row if errata next if errata.rejects? virtual_row errata.correct! virtual_row end next if pre_select and !pre_select.call(virtual_row) next if pre_reject and pre_reject.call(virtual_row) unless streaming cache.push virtual_row end yield virtual_row end end unless streaming fully_cached! end memo end nil end |
#free ⇒ nil
Clear the row cache in case it helps your GC.
507 508 509 510 511 |
# File 'lib/remote_table.rb', line 507 def free @fully_cached = false cache.clear nil end |
#parser ⇒ #call
An object that responds to #call(row) and returns an array of one or more rows.
355 356 357 |
# File 'lib/remote_table.rb', line 355 def parser @final_parser ||= (@parser || NullParser.new) end |