Class: BioDSL::SliceAlign
- Inherits:
-
Object
- Object
- BioDSL::SliceAlign
- Defined in:
- lib/BioDSL/commands/slice_align.rb
Overview
Slice aligned sequences in the stream to obtain subsequences.
slice_align
slices an alignment to extract subsequence from all sequences in the stream. This is done by either specifying a range or a set of primers that is then used to locate the range to be sliced from the sequences.
If a range is given with the slice
option the potitions (0-based) must be corresponding the aligned sequence, i.e with gaps.
If a set of primers are given with the forward
and reverse
options (or the forward_rc
and reverse_rc
options) these primers are used to locate the matching positions in the first entry and this range is used to slice this and any following sequences. It is possible to specify fuzzy primer matching by using the max_mismatches
, max_insertions
and max_deletions
options. Moreover, IUPAC ambigity codes are allowed.
It is also possible to specify a template file using the template_file
option. The template file should be a file with one FASTA formatted sequence from the alignment (with gaps). If a template file and a range is specified the nucleotide positions from the ungapped template will be used. If both template file and primers are specified the template sequence is used for the primer search and the positions will be used for slicing.
The sequences in the stream are replaced with the sliced subsequences.
Usage
slice_align(<slice: <index>|<range>> |
<forward: <string> | forward_rc: <string>>,
<revese: <string> | reverse_rc: <string>
[, max_mismatches: <uint>[, max_insertions: <uint>
[, max_deletions: <uint>[, template_file: <file>]]]])
Options
-
slice: <index> - Slice a one residue subsequence.
-
slice: <range> - Slice a range from the sequence.
-
forward: <string> - Forward primer (5’-3’).
-
forward_rc: <string> - Forward primer (3’-5’).
-
reverse: <string> - Reverse primer (3’-5’).
-
reverse_rc: <string> - Reverse primer (5’-3’).
-
max_mismatches: <uint> - Max number of mismatchs (default=2).
-
max_insertions: <uint> - Max number of insertions (default=1).
-
max_deletions: <uint> - Max number of deletions (default=1).
-
template_file: <file> - File with one aligned sequence in FASTA format.
Examples
Consider the following alignment in the file ‘test.fna`
>ID00000000
CCGCATACG-------CCCTGAGGGG----
>ID00000001
CCGCATGAT-------ACCTGAGGGT----
>ID00000002
CCGCATATACTCTTGACGCTAAAGCGTAGT
>ID00000003
CCGTATGTG-------CCCTTCGGGG----
>ID00000004
CCGGATAAG-------CCCTTACGGG----
>ID00000005
CCGGATAAG-------CCCTTACGGG----
We can slice the alignment with slice_align
using a range:
BD.new.
read_fasta(input: "test.fna").
slice_align(slice: 14 .. 27).
dump.
run
{:SEQ_NAME=>"ID00000000", :SEQ=>"--CCCTGAGGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000001", :SEQ=>"--ACCTGAGGGT--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000002", :SEQ=>"GACGCTAAAGCGTA", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000003", :SEQ=>"--CCCTTCGGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000004", :SEQ=>"--CCCTTACGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000005", :SEQ=>"--CCCTTACGGG--", :SEQ_LEN=>14}
Or we could slice the alignment using a set of primers:
BD.new.
read_fasta(input: "test.fna").
slice_align(forward: "CGCATACG", reverse: "GAGGGG", max_mismatches: 0,
max_insertions: 0, max_deletions: 0).
dump.run
{:SEQ_NAME=>"ID00000000", :SEQ=>"CGCATACG-------CCCTGAGGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000001", :SEQ=>"CGCATGAT-------ACCTGAGGGT", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000002", :SEQ=>"CGCATATACTCTTGACGCTAAAGCG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000003", :SEQ=>"CGTATGTG-------CCCTTCGGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000004", :SEQ=>"CGGATAAG-------CCCTTACGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000005", :SEQ=>"CGGATAAG-------CCCTTACGGG", :SEQ_LEN=>25}
Now, if we have a template file with the following FASTA entry:
>template
CTGAATACG-------CCATTCGATGG---
and spefifying primers these will be matched to the template and the hit positions used for slicing:
BD.new.
read_fasta(input: "test.fna").
slice_align(template_file: "template.fna", forward: "GAATACG",
reverse: "ATTCGAT", max_mismatches: 0, max_insertions: 0,
max_deletions: 0).
dump.run
{:SEQ_NAME=>"ID00000000", :SEQ=>"GCATACG-------CCCTGAGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000001", :SEQ=>"GCATGAT-------ACCTGAGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000002", :SEQ=>"GCATATACTCTTGACGCTAAAGC", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000003", :SEQ=>"GTATGTG-------CCCTTCGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000004", :SEQ=>"GGATAAG-------CCCTTACGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000005", :SEQ=>"GGATAAG-------CCCTTACGG", :SEQ_LEN=>23}
Finally, specifying a template file and an interval the positions used for slicing will be the ungapped positions from the template sequence. This is useful if you are slicing 16S rRNA alignments and want the E.coli corresponding positions - simply use the E.coli sequence as template.
BD.new.
read_fasta(input: "test.fna").
slice_align(template_file: "template.fna", slice: 4 .. 14).
dump.run
{:SEQ_NAME=>"ID00000000", :SEQ=>"ATACG-------CCCTGA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000001", :SEQ=>"ATGAT-------ACCTGA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000002", :SEQ=>"ATATACTCTTGACGCTAA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000003", :SEQ=>"ATGTG-------CCCTTC", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000004", :SEQ=>"ATAAG-------CCCTTA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000005", :SEQ=>"ATAAG-------CCCTTA", :SEQ_LEN=>18}
rubocop: enable LineLength rubocop: disable ClassLength
Defined Under Namespace
Classes: PosIndex
Constant Summary collapse
- STATS =
%i(records_in records_out sequences_in sequences_out residues_in residues_out)
Instance Method Summary collapse
-
#initialize(options) ⇒ SliceAlign
constructor
Constructor for SliceAlign.
-
#lmb ⇒ Proc
Return the comman lamba for slice_align.
Constructor Details
#initialize(options) ⇒ SliceAlign
Constructor for SliceAlign.
182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/BioDSL/commands/slice_align.rb', line 182 def initialize() @options = @forward = forward @reverse = reverse @indels = BioDSL::Seq::INDELS.sort.join @template = nil @slice = [:slice] defaults end |
Instance Method Details
#lmb ⇒ Proc
Return the comman lamba for slice_align.
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
# File 'lib/BioDSL/commands/slice_align.rb', line 197 def lmb lambda do |input, output, status| status_init(status, STATS) parse_template_file setup_template_slice input.each do |record| @status[:records_in] += 1 slice_align(record) if record.key? :SEQ output << record @status[:records_out] += 1 end end end |