Class: BioDSL::TrimSeq

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/trim_seq.rb

Overview

Trim sequence ends removing residues with a low quality score.

trim_seq removes subquality residues from the ends of sequences in the stream based on quality SCORES in a FASTQ type quality score string. Trimming progresses until a stretch, specified with the length_min option, is found thus preventing premature termination of the trimming by e.g. a single good quality residue at the end. It is possible, using the mode option to indicate if the sequence should be trimmed from the left or right end or both (default=:both).

Usage

trim_seq([quality_min: <uint>[, length_min: <uint>
         [, mode: <:left|:right|:both>]]])

Options

  • quality_min: <uint> - Minimum quality (default=20).

  • length_min: <uint> - Minimum stretch length (default=3).

  • mode: <string> - Trim mode :left|:right|:both (default=:both).

Examples

Consider the following FASTQ entry in the file test.fq:

@test
gatcgatcgtacgagcagcatctgacgtatcgatcgttgattagttgctagctatgcagtctacgacgagcat
+
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghhgfedcba`_^]\[ZYXWVUTSRQPONMLKJI

To trim both ends simply do:

BD.new.read_fastq(input: "test.fq").trim_seq.trim_seq.run

SEQ_NAME: test
SEQ: tctgacgtatcgatcgttgattagttgctagctatgcagtctacgacgagcat
SEQ_LEN: 62
SCORES: TUVWXYZ[\]^_`abcdefghhgfedcba`_^]\[ZYXWVUTSRQPONMLKJI
---

Use the quality_min option to change the minimum value to discard:

BD.new.
read_fastq(input: "test.fq").
trim_seq(quality_min: 25).
trim_seq.
run

SEQ_NAME: test
SEQ: cgtatcgatcgttgattagttgctagctatgcagtctacgacgagcatgctagctag
SEQ_LEN: 57
SCORES: YZ[\]^_`abcdefghhgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDChhh
---

To trim the left end only (use :rigth for right end only), do:

BD.new.read_fastq(input: "test.fq").trim_seq(mode: :left).trim_seq.run

SEQ_NAME: test
SEQ: tctgacgtatcgatcgttgattagttgctagctatgcagtctacgacgagcatgctagctag
SEQ_LEN: 62
SCORES: TUVWXYZ[\]^_`abcdefghhgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDChhh
---

To increase the length of stretch of good quality residues to match, use the length_min option:

BD.new.read_fastq(input: "test.fq").trim_seq(length_min: 4).trim_seq.run

SEQ_NAME: test
SEQ: tctgacgtatcgatcgttgattagttgctagctatgcagtct
SEQ_LEN: 42
SCORES: TUVWXYZ[\]^_`abcdefghhgfedcba`_^]\[ZYXWVUT
---

Constant Summary collapse

STATS =
%i(records_in records_out sequences_in sequences_out residues_in
residues_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ Proc, TrimSeq

Constructor for the TrimSeq class.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :quality_min (Integer)

    TrimSeq minimum quality (default=20).

  • :mode (Symbol)

    TrimSeq mode (default=:both).

  • :length_min (Integer)

    TrimSeq stretch length triggering trim (default=3).



123
124
125
126
127
128
129
130
131
132
# File 'lib/BioDSL/commands/trim_seq.rb', line 123

def initialize(options)
  @options = options

  check_options
  defaults

  @mode = @options[:mode].to_sym
  @min  = @options[:quality_min]
  @len  = @options[:length_min]
end

Instance Method Details

#lmbProc

Return a lambda for the trim_seq command.

Returns:

  • (Proc)

    Returns the trim_seq command lambda.



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/BioDSL/commands/trim_seq.rb', line 137

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each do |record|
      @status[:records_in] += 1

      trim_seq(record) if record[:SEQ] && record[:SCORES]

      output << record

      @status[:records_out] += 1
    end
  end
end