Class: BioDSL::SliceAlign

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/slice_align.rb

Overview

Slice aligned sequences in the stream to obtain subsequences.

slice_align slices an alignment to extract subsequence from all sequences in the stream. This is done by either specifying a range or a set of primers that is then used to locate the range to be sliced from the sequences.

If a range is given with the slice option the potitions (0-based) must be corresponding the aligned sequence, i.e with gaps.

If a set of primers are given with the forward and reverse options (or the forward_rc and reverse_rc options) these primers are used to locate the matching positions in the first entry and this range is used to slice this and any following sequences. It is possible to specify fuzzy primer matching by using the max_mismatches, max_insertions and max_deletions options. Moreover, IUPAC ambigity codes are allowed.

It is also possible to specify a template file using the template_file option. The template file should be a file with one FASTA formatted sequence from the alignment (with gaps). If a template file and a range is specified the nucleotide positions from the ungapped template will be used. If both template file and primers are specified the template sequence is used for the primer search and the positions will be used for slicing.

The sequences in the stream are replaced with the sliced subsequences.

Usage

slice_align(<slice: <index>|<range>> |
            <forward: <string> | forward_rc: <string>>,
            <revese: <string> | reverse_rc: <string>
            [, max_mismatches: <uint>[, max_insertions: <uint>
            [, max_deletions: <uint>[, template_file: <file>]]]])

Options

  • slice: <index> - Slice a one residue subsequence.

  • slice: <range> - Slice a range from the sequence.

  • forward: <string> - Forward primer (5'-3').

  • forward_rc: <string> - Forward primer (3'-5').

  • reverse: <string> - Reverse primer (3'-5').

  • reverse_rc: <string> - Reverse primer (5'-3').

  • max_mismatches: <uint> - Max number of mismatchs (default=2).

  • max_insertions: <uint> - Max number of insertions (default=1).

  • max_deletions: <uint> - Max number of deletions (default=1).

  • template_file: <file> - File with one aligned sequence in FASTA format.

Examples

Consider the following alignment in the file `test.fna`

>ID00000000
CCGCATACG-------CCCTGAGGGG----
>ID00000001
CCGCATGAT-------ACCTGAGGGT----
>ID00000002
CCGCATATACTCTTGACGCTAAAGCGTAGT
>ID00000003
CCGTATGTG-------CCCTTCGGGG----
>ID00000004
CCGGATAAG-------CCCTTACGGG----
>ID00000005
CCGGATAAG-------CCCTTACGGG----

We can slice the alignment with slice_align using a range:

BD.new.
read_fasta(input: "test.fna").
slice_align(slice: 14 .. 27).
dump.
run

{:SEQ_NAME=>"ID00000000", :SEQ=>"--CCCTGAGGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000001", :SEQ=>"--ACCTGAGGGT--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000002", :SEQ=>"GACGCTAAAGCGTA", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000003", :SEQ=>"--CCCTTCGGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000004", :SEQ=>"--CCCTTACGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000005", :SEQ=>"--CCCTTACGGG--", :SEQ_LEN=>14}

Or we could slice the alignment using a set of primers:

BD.new.
read_fasta(input: "test.fna").
slice_align(forward: "CGCATACG", reverse: "GAGGGG", max_mismatches: 0,
            max_insertions: 0, max_deletions: 0).
dump.run

{:SEQ_NAME=>"ID00000000", :SEQ=>"CGCATACG-------CCCTGAGGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000001", :SEQ=>"CGCATGAT-------ACCTGAGGGT", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000002", :SEQ=>"CGCATATACTCTTGACGCTAAAGCG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000003", :SEQ=>"CGTATGTG-------CCCTTCGGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000004", :SEQ=>"CGGATAAG-------CCCTTACGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000005", :SEQ=>"CGGATAAG-------CCCTTACGGG", :SEQ_LEN=>25}

Now, if we have a template file with the following FASTA entry:

>template
CTGAATACG-------CCATTCGATGG---

and spefifying primers these will be matched to the template and the hit positions used for slicing:

BD.new.
read_fasta(input: "test.fna").
slice_align(template_file: "template.fna", forward: "GAATACG",
            reverse: "ATTCGAT", max_mismatches: 0, max_insertions: 0,
            max_deletions: 0).
dump.run

{:SEQ_NAME=>"ID00000000", :SEQ=>"GCATACG-------CCCTGAGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000001", :SEQ=>"GCATGAT-------ACCTGAGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000002", :SEQ=>"GCATATACTCTTGACGCTAAAGC", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000003", :SEQ=>"GTATGTG-------CCCTTCGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000004", :SEQ=>"GGATAAG-------CCCTTACGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000005", :SEQ=>"GGATAAG-------CCCTTACGG", :SEQ_LEN=>23}

Finally, specifying a template file and an interval the positions used for slicing will be the ungapped positions from the template sequence. This is useful if you are slicing 16S rRNA alignments and want the E.coli corresponding positions - simply use the E.coli sequence as template.

BD.new.
read_fasta(input: "test.fna").
slice_align(template_file: "template.fna", slice: 4 .. 14).
dump.run

{:SEQ_NAME=>"ID00000000", :SEQ=>"ATACG-------CCCTGA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000001", :SEQ=>"ATGAT-------ACCTGA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000002", :SEQ=>"ATATACTCTTGACGCTAA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000003", :SEQ=>"ATGTG-------CCCTTC", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000004", :SEQ=>"ATAAG-------CCCTTA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000005", :SEQ=>"ATAAG-------CCCTTA", :SEQ_LEN=>18}

rubocop: enable LineLength rubocop: disable ClassLength

Defined Under Namespace

Classes: PosIndex

Constant Summary

STATS =
%i(records_in records_out sequences_in sequences_out residues_in
residues_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ SliceAlign

Constructor for SliceAlign.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :slice (Range, Integer)
  • :forward (String)
  • :forward_rc (String)
  • :reverse (String)
  • :reverse_rc (String)
  • :max_mismatches (Integer)
  • :max_insertions (Integer)
  • :max_deletions (Integer)
  • :template_file (String)


182
183
184
185
186
187
188
189
190
191
192
# File 'lib/BioDSL/commands/slice_align.rb', line 182

def initialize(options)
  @options  = options
  @forward  = forward
  @reverse  = reverse
  @indels   = BioDSL::Seq::INDELS.sort.join
  @template = nil
  @slice    = options[:slice]

  check_options
  defaults
end

Instance Method Details

#lmbProc

Return the comman lamba for slice_align.

Returns:

  • (Proc)

    Command lambda.



197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# File 'lib/BioDSL/commands/slice_align.rb', line 197

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    parse_template_file
    setup_template_slice

    input.each do |record|
      @status[:records_in] += 1
      slice_align(record) if record.key? :SEQ
      output << record
      @status[:records_out] += 1
    end
  end
end