Class: BioDSL::AssemblePairs
- Inherits:
-
Object
- Object
- BioDSL::AssemblePairs
- Defined in:
- lib/BioDSL/commands/assemble_pairs.rb
Overview
Assemble ordered overlapping pair-end sequences in the stream.
assemble_pairs
assembles overlapping pair-end sequences into single sequences that are output to the stream - the orginal sequences are no output. Assembly works by progressively considering all overlaps between the maximum considered overlap using the overlap_max
option (default is the length of the shortest sequence) until the minimum required overlap supplied with the overlap_min
option (default 1). For each overlap a percentage of mismatches can be allowed using the mismatch_percent
option (default 20%).
Mismatches in the overlapping regions are resolved so that the residues with the highest quality score is used in the assembled sequence. The quality scores are averaged in the overlapping region. The sequence of the overlapping region is output in upper case and the remaining in lower case.
Futhermore, sequences must be in interleaved order in the stream - use read_fastq
with input
and input2
options for that.
The additional keys are added to records with assembled sequences:
-
OVERLAP_LEN - the length of the located overlap.
-
HAMMING_DIST - the number of mismatches in the assembly.
Using the merge_unassembled
option will merge any unassembled sequences taking into account reverse complementation of read2 if the reverse_complement
option is true. Note that you probably want to set overlap_min
to 1 before using merge_unassembled
to improve chances of making an assembly before falling back to a simple merge.
Usage
assemble_pairs([mismatch_percent: <uint>[, overlap_min: <uint>
[, overlap_max: <uint>[, reverse_complement: <bool>
[, merge_unassembled: <bool>]]]]])
Options
-
mismatch_percent: <uint> - Maximum allowed overlap mismatches in
percent (default=20).
-
overlap_min: <uint> - Minimum overlap required (default=1).
-
overlap_max: <uint> - Maximum overlap considered
(default=<length of shortest sequences>).
-
reverse_complement: <bool> - Reverse-complement read2 before assembly
(default=false).
-
merge_unassembled: <bool> - Merge unassembled pairs (default=false).
Examples
If you have two pair-end sequence files with the Illumina data then you can assemble these using assemble_pairs like this:
BD.new.
read_fastq(input: "file1.fq", input2: "file2.fq).
assemble_pairs(reverse_complement: true).
run
rubocop:disable ClassLength
Constant Summary collapse
- STATS =
%i(overlap_sum hamming_sum records_in records_out sequences_in sequences_out residues_in residues_out assembled unassembled)
Instance Method Summary collapse
-
#initialize(options) ⇒ ReadFasta
constructor
Constructor for the AssemblePairs class.
-
#lmb ⇒ Proc
Return a lambda for the read_fasta command.
Constructor Details
#initialize(options) ⇒ ReadFasta
Constructor for the AssemblePairs class.
112 113 114 115 116 117 118 119 120 |
# File 'lib/BioDSL/commands/assemble_pairs.rb', line 112 def initialize() @options = @overlap_sum = 0 @hamming_sum = 0 defaults end |
Instance Method Details
#lmb ⇒ Proc
Return a lambda for the read_fasta command.
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/BioDSL/commands/assemble_pairs.rb', line 125 def lmb lambda do |input, output, status| status_init(status, STATS) input.each_slice(2) do |record1, record2| @status[:records_in] += 2 if record2 && record1[:SEQ] && record2[:SEQ] assemble_pairs(record1, record2, output) else output_record(record1, output) output_record(record2, output) if record2 end end calc_status end end |