Class: BioDSL::ReadTable

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/read_table.rb

Overview

Read tabular data from one or more files.

Tabular input can be read with read_table which will read in chosen rows and chosen columns (separated by a given delimiter) from a table in ASCII text format.

If no keys option is given and there is a comment line beginning with # the fields here will be used as keys. Subsequence lines beginning with # will be ignored.

If a comment line is present beginning with a # the options select and reject can be used to chose what columns to read.

Usage

read_table(input: <glob>[, first: <uint>|last: <uint>][, select: <list>
           |, reject: <list>[, keys: <list>][, skip: <uint>
           [, delimiter: <string>]]])

Options

  • input <glob> - Input file or file glob expression.

  • first <uint> - Only read in the first number of entries.

  • last <uint> - Only read in the last number of entries.

  • select <list> - List of column indexes or header keys to read.

  • reject <list> - List of column indexes or header keys to skip.

  • keys <list> - List of key identifiers to use for each column.

  • skip <uint> - Number of initial lines to skip (default=0).

  • delimiter <string> - Delimter to use for separating columsn

    (default="\s+").

Examples

To read all entries from a file:

read_table(input: "test.tab")

To read all entries from a gzipped file:

read_table(input: "test.tab.gz")

To read in only 10 records from a file:

read_table(input: "test.tab", first: 10)

To read in the last 10 records from a file:

read_table(input: "test.tab", last: 10)

To read all entries from multiple files:

read_table(input: "test1.tab,test2.tab")

To read entries from multiple files using a glob expression:

read_table(input: "*.tab")

Consider the following table from the file from the file test.tab:

#Organism   Sequence    Count
Human      ATACGTCAG   23524
Dog        AGCATGAC    2442
Mouse      GACTG       234
Cat        AAATGCA     2342

Reading the entire table will result in 4 records, one for each row, where the keys Organism, Sequence and Count are taken from the comment line prefixe with #:

BD.new.read_tab(input: "test.tab").dump.run

{:Organism=>"Human", :Sequence=>"ATACGTCAG", :Count=>23524}
{:Organism=>"Dog", :Sequence=>"AGCATGAC", :Count=>2442}
{:Organism=>"Mouse", :Sequence=>"GACTG", :Count=>234}
{:Organism=>"Cat", :Sequence=>"AAATGCA", :Count=>2342}

However, if the first line is skipped using the skip option the keys will default to V0, V1, V2 … Vn:

BD.new.read_table(input: "test.tab", skip: 1).dump.run

{:V0=>"Human", :V1=>"ATACGTCAG", :V2=>23524}
{:V0=>"Dog", :V1=>"AGCATGAC", :V2=>2442}
{:V0=>"Mouse", :V1=>"GACTG", :V2=>234}
{:V0=>"Cat", :V1=>"AAATGCA", :V2=>2342}

To explicitly name the columns (or the keys) use the keys option:

BD.new.
read_table(input: "test.tab", skip: 1, keys: [:ORGANISM, :SEQ, :COUNT]).
dump.
run

{:ORGANISM=>"Human", :SEQ=>"ATACGTCAG", :COUNT=>23524}
{:ORGANISM=>"Dog", :SEQ=>"AGCATGAC", :COUNT=>2442}
{:ORGANISM=>"Mouse", :SEQ=>"GACTG", :COUNT=>234}
{:ORGANISM=>"Cat", :SEQ=>"AAATGCA", :COUNT=>2342}

It is possible to select a subset of columns to read by using the select option which takes a comma separated list of columns numbers (first column is designated 0) or header keys as (requires header) argument. So to read in only the sequence and the count so that the count comes before the sequence do:

BD.new.read_table(input: "test.tab", skip: 1, select: [2, 1]).dump.run

{:V0=>23524, :V1=>"ATACGTCAG"}
{:V0=>2442, :V1=>"AGCATGAC"}
{:V0=>234, :V1=>"GACTG"}
{:V0=>2342, :V1=>"AAATGCA"}

Alternatively, if a header line was present in the file:

#Organism  Sequence   Count

Then the header keys can be used:

BD.new.
read_table(input: "test.tab", skip: 1, select: [:Count, :Sequence]).
dump.
run

{:Count=>23524, :Sequence=>"ATACGTCAG"}
{:Count=>2442, :Sequence=>"AGCATGAC"}
{:Count=>234, :Sequence=>"GACTG"}
{:Count=>2342, :Sequence=>"AAATGCA"}

Likewise, it is possible to reject specified columns from being read using the reject option:

BD.new.read_table(input: "test.tab", skip: 1, reject: [2, 1]).dump.run

{:V0=>"Human"}
{:V0=>"Dog"}
{:V0=>"Mouse"}
{:V0=>"Cat"}

And again, the header keys can be used if a header is present:

BD.new.
read_table(input: "test.tab", skip: 1, reject: [:Count, :Sequence]).
dump.
run

{:Organism=>"Human"}
{:Organism=>"Dog"}
{:Organism=>"Mouse"}
{:Organism=>"Cat"}

Constant Summary

STATS =
%i(records_in records_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ ReadTable

Constructor for ReadTable.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :input (String)
  • :first (Integer)
  • :last (Integer)
  • :keys (Array)
  • :skip (Integer)
  • :delimiter (String)
  • :select (Boolean)
  • :reject (Boolean)


192
193
194
195
196
197
198
199
# File 'lib/BioDSL/commands/read_table.rb', line 192

def initialize(options)
  @options = options
  @keys    = options[:keys] ? options[:keys].map(&:to_sym) : nil
  @skip    = options[:skip] || 0
  @buffer  = []

  check_options
end

Instance Method Details

#lmbProc

Return command lambda for ReadTable

Returns:

  • (Proc)

    Command lambda.



204
205
206
207
208
209
210
211
212
213
214
215
216
# File 'lib/BioDSL/commands/read_table.rb', line 204

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    process_input(input, output)

    case
    when @options[:first] then read_first(output)
    when @options[:last]  then read_last(output)
    else read_all(output)
    end
  end
end