Class: BioDSL::Grab

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/grab.rb

Overview

Grab records in stream.

grab select records from the stream by matching patterns to keys or values. grab is BioDSL’ equivalent of Unix’ grep, however, grab is much more versatile.

NB! If chaining multiple grab commands then use the most restrictive grab first in order to get the best performance.

NB! Avoid using exact with long values because of memory use.

Usage

grab(<select: <pattern>|select_file: <file>|reject: <pattern>|
     reject_file: <file>|evaluate: <expression>|exact: <bool>>
     [, keys: <list>|keys_only: <bool>|values_only: <bool>|
     ignore_case: <bool>])

Options

  • select: <pattern> - Select records matching <pattern> which is a regex or an exact match if the exact option is set.

  • select_file: <file> - File with one <pattern> per line to select.

  • reject: <pattern> - Reject records matching <pattern> which is a regex or an exact match if the exact option is set.

  • reject_file: <file> - File with one <pattern> per line to reject.

  • evaluate: <expression> - Select records where <expression> is true.

  • exact: <bool> - Turn on exact matching for improved speed.

  • keys: <list> - Comma separated list or array of keys to grab the value for.

  • keys_only: <bool> - Only grab for keys.

  • values_only: <bool> - Only grab for values.

  • ignore_case: <bool> - Ignore case when grabbing with regex (does not work with evaluate and exact).

Examples

To easily grab all records in the stream that has any mentioning of the pattern ‘human’ just pipe the data stream through grab like this:

grab(select: "human")

This will search for the pattern ‘human’ in all keys and all values. The select option alternatively uses an array of patterns, so in order to match one of multiple patterns do:

grab(select: ["human", "mouse"])

It is also possible to invoke flexible matching using regex (regular expressions) instead of simple pattern matching. If you want to grab records with the sequence ATCG or GCTA you can do this:

grab(select: "ATCG|GCTA")

Or if you want to grab sequences beginning with ATCG:

grab(select: "^ATCG")

It is also possible to use the select_file option to load patterns from a file with one pattern per line.

grab(select_file: "patterns.txt")

If you want the opposite result - to find all records that does not match the a pattern, use the reject option:

grab(reject: "human")

Similar to select_file there is a reject_file option to load patterns from a file, and use any of these patterns to reject records:

grab(reject_file: "patterns.txt")

If you want to search the record keys only, e.g. to grab all records containing the key SEQ you can use the keys_only option. This will prevent matching of SEQ in any record value, and in fact SEQ is a not uncommon peptide sequence you could get an unwanted record. Also, this will give an increase in speed since only the keys are searched:

grab(select: "SEQ", keys_only: true)

However, if you are interested in grabbing the peptide sequence SEQ and not the SEQ key, just use the vals_only option:

grab(select: "SEQ", vals_only: true)

Also, if you want to grab for certain key/value pairs you can supply a comma separated list or an array of keys whos values will then be grabbed using the keys option. This is handy if your records contain large genomic sequences and you don’t want to search the entire sequence for e.g. the organism name - it is much faster to tell grab which keys to search the value for:

grab(select: "human", keys: :SEQ_NAME)

You can also use the evaluate option to grab records that fulfill an expression. So to grab all records with a sequence length greater than 30:

grab(evaluate: 'SEQ_LEN > 30')

If you want to grab all records containing the pattern ‘human’ and where the sequence length is greater that 30, you do this by running the stream through grab twice:

grab(select: 'human').grab(evaluate: 'SEQ_LEN > 30')

Finally, it is possible to grab for exact pattern using the exact option. This is much faster than the default regex pattern grabbing because with exact the patterns are used to create a lookup hash for instant matching of keys or values. This is useful if you e.g. have a file with ID numbers and you want to grab matching records from the stream:

grab(select_file: "ids.txt", keys: :ID, exact: true)

rubocop:disable ClassLength

Constant Summary collapse

STATS =
%i(records_in records_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ ReadFasta

Constructor for the ReadFasta class.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :select (String, Array)

    Patterns or list of patterns to select records.

  • :select_file (String)

    File path with patterns, one per line, to select records.

  • :reject (String, Array)

    Patterns or list of patterns to reject records.

  • :reject_file (String)

    File path with patterns, one per line, to reject records.

  • :evaluate (String)

    Expression that is evaluated to select records.

  • :exact (Boolean)

    Flag indicating that a given pattern must match over its entire length.

  • :keys (Symbol, Array)

    Key or list of keys whos key/value pairs to grab for.

  • :keys_only (Boolean)

    Flag indicating to grab for key only - not values.

  • :values_only (Boolean)

    Flag indicating to grab for values only - not keys.

  • :ignore_case (Boolean)

    Flag indicating that pattern matching should be case insensitive.



183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/BioDSL/commands/grab.rb', line 183

def initialize(options)
  @options = options

  check_options

  @keys_only = @options[:keys_only]
  @vals_only = @options[:values_only]
  @invert    = @options[:reject] || @options[:reject_file]
  @eval      = @options[:evaluate]
  @exact     = nil
  @regex     = nil
  @keys      = nil
end

Instance Method Details

#lmbProc

Return a lambda for the grab command.

Returns:

  • (Proc)

    Returns the grab command lambda.



200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# File 'lib/BioDSL/commands/grab.rb', line 200

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)
    compile_keys
    compile_exact
    compile_regexes

    input.each do |record|
      @status[:records_in] += 1

      match = case
              when @exact then exact_match? record
              when @regex then regex_match? record
              when @eval  then eval_match? record
              end

      emit_match(output, record, match)
    end
  end
end