Class: BioDSL::AssemblePairs

Inherits:

Object

Object
BioDSL::AssemblePairs

Defined in:: lib/BioDSL/commands/assemble_pairs.rb

Overview

Assemble ordered overlapping pair-end sequences in the stream.

assemble_pairs assembles overlapping pair-end sequences into single sequences that are output to the stream - the orginal sequences are no output. Assembly works by progressively considering all overlaps between the maximum considered overlap using the overlap_max option (default is the length of the shortest sequence) until the minimum required overlap supplied with the overlap_min option (default 1). For each overlap a percentage of mismatches can be allowed using the mismatch_percent option (default 20%).

Mismatches in the overlapping regions are resolved so that the residues with the highest quality score is used in the assembled sequence. The quality scores are averaged in the overlapping region. The sequence of the overlapping region is output in upper case and the remaining in lower case.

Futhermore, sequences must be in interleaved order in the stream - use read_fastq with input and input2 options for that.

The additional keys are added to records with assembled sequences:

OVERLAP_LEN - the length of the located overlap.
HAMMING_DIST - the number of mismatches in the assembly.

Using the merge_unassembled option will merge any unassembled sequences taking into account reverse complementation of read2 if the reverse_complement option is true. Note that you probably want to set overlap_min to 1 before using merge_unassembled to improve chances of making an assembly before falling back to a simple merge.

Usage

assemble_pairs([mismatch_percent: <uint>[, overlap_min: <uint>
               [, overlap_max: <uint>[, reverse_complement: <bool>
               [, merge_unassembled: <bool>]]]]])

Options

mismatch_percent: <uint> - Maximum allowed overlap mismatches in
```
percent (default=20).
```
overlap_min: <uint> - Minimum overlap required (default=1).
overlap_max: <uint> - Maximum overlap considered
```
(default=<length of shortest sequences>).
```
reverse_complement: <bool> - Reverse-complement read2 before assembly
```
(default=false).
```
merge_unassembled: <bool> - Merge unassembled pairs (default=false).

Examples

If you have two pair-end sequence files with the Illumina data then you can assemble these using assemble_pairs like this:

BD.new.
read_fastq(input: "file1.fq", input2: "file2.fq).
assemble_pairs(reverse_complement: true).
run

rubocop:disable ClassLength

Constant Summary collapse

STATS =

%i(overlap_sum hamming_sum records_in records_out sequences_in
sequences_out residues_in residues_out assembled unassembled)

Instance Method Summary collapse

#initialize(options) ⇒ ReadFasta constructor

Constructor for the AssemblePairs class.
#lmb ⇒ Proc

Return a lambda for the read_fasta command.

Constructor Details

#initialize(options) ⇒ `ReadFasta`

Constructor for the AssemblePairs class.

Parameters:

options (Hash) —

Options hash.

Options Hash (options):

:mismatch_percent (Integer) —

Maximum allowed overlap mismatches in percent.
:overlap_min (Integer) —

Minimum length of overlap.
:overlap_max (Integer) —

Maximum length of overlap.
:reverse_complement (Boolean) —

Reverse-complment read2.
:merge_unassembled (Boolean) —

Merge read pairs that couldn’t be assembled.
:allow_unassembled (Boolean) —

Output reads that couldn’t be assembled.

# File 'lib/BioDSL/commands/assemble_pairs.rb', line 112

def initialize(options)
  @options = options

  @overlap_sum = 0
  @hamming_sum = 0

  check_options
  defaults
end

Instance Method Details

#lmb ⇒ `Proc`

Return a lambda for the read_fasta command.

Returns:

(Proc) —

Returns the read_fasta command lambda.

# File 'lib/BioDSL/commands/assemble_pairs.rb', line 125

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each_slice(2) do |record1, record2|
      @status[:records_in] += 2

      if record2 && record1[:SEQ] && record2[:SEQ]
        assemble_pairs(record1, record2, output)
      else
        output_record(record1, output)
        output_record(record2, output) if record2
      end
    end

    calc_status
  end
end