Class: BioDSL::Grab
- Inherits:
-
Object
- Object
- BioDSL::Grab
- Defined in:
- lib/BioDSL/commands/grab.rb
Overview
Grab records in stream.
grab
select records from the stream by matching patterns to keys or values. grab
is BioDSL’ equivalent of Unix’ grep
, however, grab
is much more versatile.
NB! If chaining multiple grab
commands then use the most restrictive grab
first in order to get the best performance.
NB! Avoid using exact with long values because of memory use.
Usage
grab(<select: <pattern>|select_file: <file>|reject: <pattern>|
reject_file: <file>|evaluate: <expression>|exact: <bool>>
[, keys: <list>|keys_only: <bool>|values_only: <bool>|
ignore_case: <bool>])
Options
-
select: <pattern> - Select records matching <pattern> which is a regex or an exact match if the exact option is set.
-
select_file: <file> - File with one <pattern> per line to select.
-
reject: <pattern> - Reject records matching <pattern> which is a regex or an exact match if the exact option is set.
-
reject_file: <file> - File with one <pattern> per line to reject.
-
evaluate: <expression> - Select records where <expression> is true.
-
exact: <bool> - Turn on exact matching for improved speed.
-
keys: <list> - Comma separated list or array of keys to grab the value for.
-
keys_only: <bool> - Only grab for keys.
-
values_only: <bool> - Only grab for values.
-
ignore_case: <bool> - Ignore case when grabbing with regex (does not work with
evaluate
andexact
).
Examples
To easily grab all records in the stream that has any mentioning of the pattern ‘human’ just pipe the data stream through grab like this:
grab(select: "human")
This will search for the pattern ‘human’ in all keys and all values. The select
option alternatively uses an array of patterns, so in order to match one of multiple patterns do:
grab(select: ["human", "mouse"])
It is also possible to invoke flexible matching using regex (regular expressions) instead of simple pattern matching. If you want to grab
records with the sequence ATCG
or GCTA
you can do this:
grab(select: "ATCG|GCTA")
Or if you want to grab
sequences beginning with ATCG
:
grab(select: "^ATCG")
It is also possible to use the select_file
option to load patterns from a file with one pattern per line.
grab(select_file: "patterns.txt")
If you want the opposite result - to find all records that does not match the a pattern, use the reject
option:
grab(reject: "human")
Similar to select_file
there is a reject_file
option to load patterns from a file, and use any of these patterns to reject records:
grab(reject_file: "patterns.txt")
If you want to search the record keys only, e.g. to grab
all records containing the key SEQ
you can use the keys_only
option. This will prevent matching of SEQ
in any record value, and in fact SEQ
is a not uncommon peptide sequence you could get an unwanted record. Also, this will give an increase in speed since only the keys are searched:
grab(select: "SEQ", keys_only: true)
However, if you are interested in grabbing
the peptide sequence SEQ
and not the SEQ
key, just use the vals_only
option:
grab(select: "SEQ", vals_only: true)
Also, if you want to grab
for certain key/value pairs you can supply a comma separated list or an array of keys whos values will then be grabbed using the keys
option. This is handy if your records contain large genomic sequences and you don’t want to search the entire sequence for e.g. the organism name - it is much faster to tell grab
which keys to search the value for:
grab(select: "human", keys: :SEQ_NAME)
You can also use the evaluate
option to grab
records that fulfill an expression. So to grab
all records with a sequence length greater than 30:
grab(evaluate: 'SEQ_LEN > 30')
If you want to grab
all records containing the pattern ‘human’ and where the sequence length is greater that 30, you do this by running the stream through grab
twice:
grab(select: 'human').grab(evaluate: 'SEQ_LEN > 30')
Finally, it is possible to grab
for exact pattern using the exact
option. This is much faster than the default regex pattern grabbing because with exact
the patterns are used to create a lookup hash for instant matching of keys or values. This is useful if you e.g. have a file with ID numbers and you want to grab
matching records from the stream:
grab(select_file: "ids.txt", keys: :ID, exact: true)
rubocop:disable ClassLength
Constant Summary collapse
- STATS =
%i(records_in records_out)
Instance Method Summary collapse
-
#initialize(options) ⇒ ReadFasta
constructor
Constructor for the ReadFasta class.
-
#lmb ⇒ Proc
Return a lambda for the grab command.
Constructor Details
#initialize(options) ⇒ ReadFasta
Constructor for the ReadFasta class.
183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/BioDSL/commands/grab.rb', line 183 def initialize() @options = @keys_only = @options[:keys_only] @vals_only = @options[:values_only] @invert = @options[:reject] || @options[:reject_file] @eval = @options[:evaluate] @exact = nil @regex = nil @keys = nil end |
Instance Method Details
#lmb ⇒ Proc
Return a lambda for the grab command.
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
# File 'lib/BioDSL/commands/grab.rb', line 200 def lmb lambda do |input, output, status| status_init(status, STATS) compile_keys compile_exact compile_regexes input.each do |record| @status[:records_in] += 1 match = case when @exact then exact_match? record when @regex then regex_match? record when @eval then eval_match? record end emit_match(output, record, match) end end end |