Class: BioDSL::AssemblePairs

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/assemble_pairs.rb

Overview

Assemble ordered overlapping pair-end sequences in the stream.

assemble_pairs assembles overlapping pair-end sequences into single sequences that are output to the stream - the orginal sequences are no output. Assembly works by progressively considering all overlaps between the maximum considered overlap using the overlap_max option (default is the length of the shortest sequence) until the minimum required overlap supplied with the overlap_min option (default 1). For each overlap a percentage of mismatches can be allowed using the mismatch_percent option (default 20%).

Mismatches in the overlapping regions are resolved so that the residues with the highest quality score is used in the assembled sequence. The quality scores are averaged in the overlapping region. The sequence of the overlapping region is output in upper case and the remaining in lower case.

Futhermore, sequences must be in interleaved order in the stream - use read_fastq with input and input2 options for that.

The additional keys are added to records with assembled sequences:

  • OVERLAP_LEN - the length of the located overlap.

  • HAMMING_DIST - the number of mismatches in the assembly.

Using the merge_unassembled option will merge any unassembled sequences taking into account reverse complementation of read2 if the reverse_complement option is true. Note that you probably want to set overlap_min to 1 before using merge_unassembled to improve chances of making an assembly before falling back to a simple merge.

Usage

assemble_pairs([mismatch_percent: <uint>[, overlap_min: <uint>
               [, overlap_max: <uint>[, reverse_complement: <bool>
               [, merge_unassembled: <bool>]]]]])

Options

  • mismatch_percent: <uint> - Maximum allowed overlap mismatches in

    percent (default=20).
    
  • overlap_min: <uint> - Minimum overlap required (default=1).

  • overlap_max: <uint> - Maximum overlap considered

    (default=<length of shortest sequences>).
    
  • reverse_complement: <bool> - Reverse-complement read2 before assembly

    (default=false).
    
  • merge_unassembled: <bool> - Merge unassembled pairs (default=false).

Examples

If you have two pair-end sequence files with the Illumina data then you can assemble these using assemble_pairs like this:

BD.new.
read_fastq(input: "file1.fq", input2: "file2.fq).
assemble_pairs(reverse_complement: true).
run

rubocop:disable ClassLength

Constant Summary collapse

STATS =
%i(overlap_sum hamming_sum records_in records_out sequences_in
sequences_out residues_in residues_out assembled unassembled)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ ReadFasta

Constructor for the AssemblePairs class.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :mismatch_percent (Integer)

    Maximum allowed overlap mismatches in percent.

  • :overlap_min (Integer)

    Minimum length of overlap.

  • :overlap_max (Integer)

    Maximum length of overlap.

  • :reverse_complement (Boolean)

    Reverse-complment read2.

  • :merge_unassembled (Boolean)

    Merge read pairs that couldn’t be assembled.

  • :allow_unassembled (Boolean)

    Output reads that couldn’t be assembled.



112
113
114
115
116
117
118
119
120
# File 'lib/BioDSL/commands/assemble_pairs.rb', line 112

def initialize(options)
  @options = options

  @overlap_sum = 0
  @hamming_sum = 0

  check_options
  defaults
end

Instance Method Details

#lmbProc

Return a lambda for the read_fasta command.

Returns:

  • (Proc)

    Returns the read_fasta command lambda.



125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/BioDSL/commands/assemble_pairs.rb', line 125

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each_slice(2) do |record1, record2|
      @status[:records_in] += 2

      if record2 && record1[:SEQ] && record2[:SEQ]
        assemble_pairs(record1, record2, output)
      else
        output_record(record1, output)
        output_record(record2, output) if record2
      end
    end

    calc_status
  end
end