Class: BioDSL::MergeTable

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/merge_table.rb

Overview

Merge records on a given key with tabular data from one or more files.

merge_table reads in one or more tabular files and merges any records in the stream with identical values for a given key. The values for the given key must be unique in the tabular files, but not necesarily in the stream.

Consult read_table for details on how the tabular files are read.

The stats for merge_table includes the following values:

  • rows_total - total number of table rows.

  • rows_matched - number of table rows with the given key.

  • rows_unmatched - number of table rows without the given key.

  • merged - number of records that was merged.

  • non_merged - number of records that was not merged.

Usage

merge_table(<input: <glob>>, <key: <string>>[, columns: <list>
            [, keys: <list>[, skip: <uint>[, delimiter: <string>]]]])

Options

  • input <glob> - Input file or file glob expression.

  • key <string> - Key used to merge

  • columns <list> - List of columns to read in that order.

  • keys <list> - List of key identifiers to use for each column.

  • skip <uint> - Number of initial lines to skip (default=0).

  • delimiter <string> - Delimter to use for separating columsn

    (default="\s+").

Examples

Consider the following two files:

test1.tab:

#ID ORGANISM
1   parrot
2   eel
3   platypus
4   beetle

test2.tab:

#ID COUNT
1   5423
2   34
3   2423
4   234

We can merge the data with merge_table like this:

BD.new.
read_table(input: "test1.tab").
merge_table(input: "test2.tab", key: :ID).
dump.
run

{:ID=>1, :ORGANISM=>"parrot", :COUNT=>5423}
{:ID=>2, :ORGANISM=>"eel", :COUNT=>34}
{:ID=>3, :ORGANISM=>"platypus", :COUNT=>2423}
{:ID=>4, :ORGANISM=>"beetle", :COUNT=>234}

Constant Summary

STATS =
%i(records_in records_out rows_total rows_matched rows_unmatched
merged non_merged)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ MergeTable

Constructor for MergeTable.

Options Hash (options):

  • :input (String)

    Input glob expression.

  • :key (String, Symbol)

    Key used to merge.

  • :keys (Array)

    List of key identifiers to use for each column.

  • :columns (Array)

    List of columns to read in that order.

  • :skip (Integer)

    Number of initial lines to skip.

  • :delimiter (String)

    Delimter to use for separating columns.



117
118
119
120
121
122
123
124
125
126
# File 'lib/BioDSL/commands/merge_table.rb', line 117

def initialize(options)
  @options = options

  check_options
  defaults

  @table = {}
  @key   = @options[:key].to_sym
  @keys  = options[:keys] ? @options[:keys].map(&:to_sym) : nil
end

Instance Method Details

#lmbProc

Return command lambda for merge_table.



131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'lib/BioDSL/commands/merge_table.rb', line 131

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    parse_input_tables

    input.each do |record|
      @status[:records_in] += 1

      if record[@key] && @table[record[@key]]
        @status[:merged] += 1
        record = record.merge(@table[record[@key]])
      else
        @status[:non_merged] += 1
      end

      output << record
      @status[:records_out] += 1
    end

    @status[:rows_total] = @status[:rows_matched] + @status[:rows_unmatched]
  end
end