Class: BioDSL::SplitPairSeq

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/split_pair_seq.rb

Overview

Splite pair-end sequences in the stream.

split_pair_seq splits sequences in the stream previously merged with merge_pair_seq. Sequence names must be in either Illumina1.3/1.5 format trailing a /1 or /2 or Illumina1.8 containing 1: or 2:. A sequence split into two will be output as two records where the first will be named with 1 and the second with 2.

Usage

split_pair_seq

Options

Examples

Consider the following records created with merge_pair_seq:

{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:14862:1868 1:N:0:14",
 :SEQ=>"TGGGGAATATTGGACAATGGCCTGTTTGCTACCCACGCTT",
 :SEQ_LEN=>40,
 :SCORES=>"<??????BDDDDDDDDGGGG?????BB<-<BDDDDDFEEF",
 :SEQ_LEN_LEFT=>20,
 :SEQ_LEN_RIGHT=>20}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:13906:2139 1:N:0:14",
 :SEQ=>"TAGGGAATCTTGCACAATGGACTCTTCGCTACCCATGCTT",
 :SEQ_LEN=>40,
 :SCORES=>"<???9?BBBDBDDBDDFFFF,5<??BB?DDABDBDDFFFF",
 :SEQ_LEN_LEFT=>20,
 :SEQ_LEN_RIGHT=>20}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:14865:2158 1:N:0:14",
 :SEQ=>"TAGGGAATCTTGCACAATGGCCTCTTCGCTACCCATGCTT",
 :SEQ_LEN=>40,
 :SCORES=>"?????BBBBBDDBDDBFFFF??,<??B?BB?BBBBBFF?F",
 :SEQ_LEN_LEFT=>20,
 :SEQ_LEN_RIGHT=>20}

These can be split using split_pair_seq:

BD.new.
read_fastq(input: "test.fq", encoding: :base_33).
merge_pair_seq.
split_pair_seq.
dump.
run

{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:14862:1868 1:N:0:14",
 :SEQ=>"TGGGGAATATTGGACAATGG",
 :SEQ_LEN=>20,
 :SCORES=>"<??????BDDDDDDDDGGGG"}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:14862:1868 2:N:0:14",
 :SEQ=>"CCTGTTTGCTACCCACGCTT",
 :SEQ_LEN=>20,
 :SCORES=>"?????BB<-<BDDDDDFEEF"}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:13906:2139 1:N:0:14",
 :SEQ=>"TAGGGAATCTTGCACAATGG",
 :SEQ_LEN=>20,
 :SCORES=>"<???9?BBBDBDDBDDFFFF"}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:13906:2139 2:N:0:14",
 :SEQ=>"ACTCTTCGCTACCCATGCTT",
 :SEQ_LEN=>20,
 :SCORES=>",5<??BB?DDABDBDDFFFF"}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:14865:2158 1:N:0:14",
 :SEQ=>"TAGGGAATCTTGCACAATGG",
 :SEQ_LEN=>20,
 :SCORES=>"?????BBBBBDDBDDBFFFF"}
{:SEQ_NAME=>"M01168:16:000000000-A1R9L:1:1101:14865:2158 2:N:0:14",
 :SEQ=>"CCTCTTCGCTACCCATGCTT",
 :SEQ_LEN=>20,
 :SCORES=>"??,<??B?BB?BBBBBFF?F"}

Constant Summary collapse

STATS =
%i(records_in records_out sequences_in sequences_out residues_in
residues_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ SplitPairSeq

Constructor for SplitPairSeq.

Parameters:

  • options (Hash)

    Options hash.



108
109
110
111
112
# File 'lib/BioDSL/commands/split_pair_seq.rb', line 108

def initialize(options)
  @options = options

  check_options
end

Instance Method Details

#lmbProc

Return command lambda for split_pair_seq.

Returns:

  • (Proc)

    Command lambda.



117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# File 'lib/BioDSL/commands/split_pair_seq.rb', line 117

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each do |record|
      @status[:records_in] += 1

      if record[:SEQ_NAME] && record[:SEQ] && record[:SEQ_LEN_LEFT] &&
         record[:SEQ_LEN_RIGHT]
        split_pair_seq(output, record)
      else
        output << record

        @status[:records_out] += 1
      end
    end
  end
end