Class: BioDSL::ReadTable
- Inherits:
-
Object
- Object
- BioDSL::ReadTable
- Defined in:
- lib/BioDSL/commands/read_table.rb
Overview
Read tabular data from one or more files.
Tabular input can be read with read_table
which will read in chosen rows and chosen columns (separated by a given delimiter) from a table in ASCII text format.
If no keys
option is given and there is a comment line beginning with # the fields here will be used as keys. Subsequence lines beginning with # will be ignored.
If a comment line is present beginning with a # the options select
and reject
can be used to chose what columns to read.
Usage
read_table(input: <glob>[, first: <uint>|last: <uint>][, select: <list>
|, reject: <list>[, keys: <list>][, skip: <uint>
[, delimiter: <string>]]])
Options
-
input <glob> - Input file or file glob expression.
-
first <uint> - Only read in the first number of entries.
-
last <uint> - Only read in the last number of entries.
-
select <list> - List of column indexes or header keys to read.
-
reject <list> - List of column indexes or header keys to skip.
-
keys <list> - List of key identifiers to use for each column.
-
skip <uint> - Number of initial lines to skip (default=0).
-
delimiter <string> - Delimter to use for separating columsn
(default="\s+").
Examples
To read all entries from a file:
read_table(input: "test.tab")
To read all entries from a gzipped file:
read_table(input: "test.tab.gz")
To read in only 10 records from a file:
read_table(input: "test.tab", first: 10)
To read in the last 10 records from a file:
read_table(input: "test.tab", last: 10)
To read all entries from multiple files:
read_table(input: "test1.tab,test2.tab")
To read entries from multiple files using a glob expression:
read_table(input: "*.tab")
Consider the following table from the file from the file test.tab:
#Organism Sequence Count
Human ATACGTCAG 23524
Dog AGCATGAC 2442
Mouse GACTG 234
Cat AAATGCA 2342
Reading the entire table will result in 4 records, one for each row, where the keys Organism, Sequence and Count are taken from the comment line prefixe with #:
BD.new.read_tab(input: "test.tab").dump.run
{:Organism=>"Human", :Sequence=>"ATACGTCAG", :Count=>23524}
{:Organism=>"Dog", :Sequence=>"AGCATGAC", :Count=>2442}
{:Organism=>"Mouse", :Sequence=>"GACTG", :Count=>234}
{:Organism=>"Cat", :Sequence=>"AAATGCA", :Count=>2342}
However, if the first line is skipped using the skip
option the keys will default to V0, V1, V2 … Vn:
BD.new.read_table(input: "test.tab", skip: 1).dump.run
{:V0=>"Human", :V1=>"ATACGTCAG", :V2=>23524}
{:V0=>"Dog", :V1=>"AGCATGAC", :V2=>2442}
{:V0=>"Mouse", :V1=>"GACTG", :V2=>234}
{:V0=>"Cat", :V1=>"AAATGCA", :V2=>2342}
To explicitly name the columns (or the keys) use the keys
option:
BD.new.
read_table(input: "test.tab", skip: 1, keys: [:ORGANISM, :SEQ, :COUNT]).
dump.
run
{:ORGANISM=>"Human", :SEQ=>"ATACGTCAG", :COUNT=>23524}
{:ORGANISM=>"Dog", :SEQ=>"AGCATGAC", :COUNT=>2442}
{:ORGANISM=>"Mouse", :SEQ=>"GACTG", :COUNT=>234}
{:ORGANISM=>"Cat", :SEQ=>"AAATGCA", :COUNT=>2342}
It is possible to select a subset of columns to read by using the select
option which takes a comma separated list of columns numbers (first column is designated 0) or header keys as (requires header) argument. So to read in only the sequence and the count so that the count comes before the sequence do:
BD.new.read_table(input: "test.tab", skip: 1, select: [2, 1]).dump.run
{:V0=>23524, :V1=>"ATACGTCAG"}
{:V0=>2442, :V1=>"AGCATGAC"}
{:V0=>234, :V1=>"GACTG"}
{:V0=>2342, :V1=>"AAATGCA"}
Alternatively, if a header line was present in the file:
#Organism Sequence Count
Then the header keys can be used:
BD.new.
read_table(input: "test.tab", skip: 1, select: [:Count, :Sequence]).
dump.
run
{:Count=>23524, :Sequence=>"ATACGTCAG"}
{:Count=>2442, :Sequence=>"AGCATGAC"}
{:Count=>234, :Sequence=>"GACTG"}
{:Count=>2342, :Sequence=>"AAATGCA"}
Likewise, it is possible to reject specified columns from being read using the reject
option:
BD.new.read_table(input: "test.tab", skip: 1, reject: [2, 1]).dump.run
{:V0=>"Human"}
{:V0=>"Dog"}
{:V0=>"Mouse"}
{:V0=>"Cat"}
And again, the header keys can be used if a header is present:
BD.new.
read_table(input: "test.tab", skip: 1, reject: [:Count, :Sequence]).
dump.
run
{:Organism=>"Human"}
{:Organism=>"Dog"}
{:Organism=>"Mouse"}
{:Organism=>"Cat"}
Constant Summary collapse
- STATS =
%i(records_in records_out)
Instance Method Summary collapse
-
#initialize(options) ⇒ ReadTable
constructor
Constructor for ReadTable.
-
#lmb ⇒ Proc
Return command lambda for ReadTable.
Constructor Details
#initialize(options) ⇒ ReadTable
Constructor for ReadTable.
192 193 194 195 196 197 198 199 |
# File 'lib/BioDSL/commands/read_table.rb', line 192 def initialize() @options = @keys = [:keys] ? [:keys].map(&:to_sym) : nil @skip = [:skip] || 0 @buffer = [] end |
Instance Method Details
#lmb ⇒ Proc
Return command lambda for ReadTable
204 205 206 207 208 209 210 211 212 213 214 215 216 |
# File 'lib/BioDSL/commands/read_table.rb', line 204 def lmb lambda do |input, output, status| status_init(status, STATS) process_input(input, output) case when @options[:first] then read_first(output) when @options[:last] then read_last(output) else read_all(output) end end end |