Class: BioDSL::MaskSeq

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/mask_seq.rb

Overview

Mask sequences in the stream based on quality scores.

mask_seq masks sequences in the stream using either hard masking or soft masking (default). Hard masking is replacing residues with corresponding quality score below a specified quality_min with an N, while soft is replacing such residues with lower case. The sequences are values to SEQ keys and the quality scores are values to SCORES keys. The SCORES are encoded as ranges of ASCII characters from ā€˜!ā€™ to ā€˜Iā€™ indicating scores from 0 to 40.

Usage

mask_seq([quality_min: <uint>[, mask: <:soft|:hard>]])

Options

  • quality_min: <uint> - Minimum quality (default=20).

  • mask: <string> - Soft or Hard mask (default=soft).

Examples

Consider the following FASTQ entry in the file test.fq:

@HWI-EAS157_20FFGAAXX:2:1:888:434
TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT
+HWI-EAS157_20FFGAAXX:2:1:888:434
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI

We can read in these sequence using read_fastq and then soft mask the sequence with mask_seq like this:

BD.new.read_fastq(input: "test.fq").mask_seq.dump.run

{:SEQ_NAME=>"HWI-EAS157_20FFGAAXX:2:1:888:434",
 :SEQ=>"ttggtcgctcgctccgcgacCTCAGATCAGACGTGGGCGAT",
 :SEQ_LEN=>41,
 :SCORES=>"!\"\#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI"}

Using the quality_min option we can change the cutoff:

BD.new.read_fastq(input: "test.fq").mask_seq(quality_min: 25).dump.run

{:SEQ_NAME=>"HWI-EAS157_20FFGAAXX:2:1:888:434",
 :SEQ=>"ttggtcgctcgctccgcgacctcagATCAGACGTGGGCGAT",
 :SEQ_LEN=>41,
 :SCORES=>"!\"\#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI"}

Using the mask option for hard masking:

BD.new.read_fastq(input: "test.fq").mask_seq(mask: :hard).dump.run

{:SEQ_NAME=>"HWI-EAS157_20FFGAAXX:2:1:888:434",
 :SEQ=>"NNNNNNNNNNNNNNNNNNNNCTCAGATCAGACGTGGGCGAT",
 :SEQ_LEN=>41,
 :SCORES=>"!\"\#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI"}

Constant Summary collapse

STATS =
%i(records_in records_out sequences_in sequences_out residues_in
residues_out masked)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ MaskSeq

Constructor for MaskSeq.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • Minimum (Integer)

    quality score.

  • Mask (Symbol, String)

    scheme.



95
96
97
98
99
100
101
102
# File 'lib/BioDSL/commands/mask_seq.rb', line 95

def initialize(options)
  @options = options

  check_options
  defaults

  @mask = options[:mask].to_sym
end

Instance Method Details

#lmbProc

Return command lambda for mask_seq.

Returns:

  • (Proc)

    command lambda.



107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# File 'lib/BioDSL/commands/mask_seq.rb', line 107

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each do |record|
      @status[:records_in] += 1

      mask_seq(record) if record[:SEQ] && record[:SCORES]

      output << record

      @status[:records_out] += 1
    end

    @status[:masked_percent] =
      (100 * @status[:masked].to_f / @status[:residues_in]).round(2)
  end
end