Class: BioDSL::DereplicateSeq

Inherits:
Object
  • Object
show all
Defined in:
lib/BioDSL/commands/dereplicate_seq.rb

Overview

Dereplicate sequences in the stream.

dereplicate_seq removes all duplicate sequence records. Dereplicated sequences are output along with the count of replicates. Using the ignore_case option disables the default case sensitive sequence matching.

Usage

dereplicate_seq([ignore_case: <bool>])

Options

  • ignore_case: <bool> - Ignore sequence case.

Examples

Consider the following FASTA file test.fna:

>test1
ATGC
>test2
ATGC
>test3
GCAT

To dereplicate all sequences we use read_fasta and dereplicate_seq:

BD.new.read_fasta(input: "test.fna").dereplicate_seq.dump.run

{:SEQ_NAME=>"test1", :SEQ=>"ATGC", :SEQ_LEN=>4, :SEQ_COUNT=>2}
{:SEQ_NAME=>"test3", :SEQ=>"GCAT", :SEQ_LEN=>4, :SEQ_COUNT=>1}

Constant Summary collapse

STATS =
%i(records_in records_out sequences_in sequences_out residues_in
residues_out)

Instance Method Summary collapse

Constructor Details

#initialize(options) ⇒ DereplicateSeq

Constructor for the DereplicateSeq class.

Parameters:

  • options (Hash)

    Options hash.

Options Hash (options):

  • :ignore_case (Boolean)

    Ignore sequence case.



70
71
72
73
74
75
# File 'lib/BioDSL/commands/dereplicate_seq.rb', line 70

def initialize(options)
  @options = options
  @lookup  = {}

  check_options
end

Instance Method Details

#lmbProc

Return the command lambda for DereplicateSeq.

Returns:

  • (Proc)

    Command lambda.



80
81
82
83
84
85
86
87
88
89
# File 'lib/BioDSL/commands/dereplicate_seq.rb', line 80

def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    TmpDir.create('dereplicate_seq') do |tmp_file, _|
      process_input(input, output, tmp_file)
      process_output(output, tmp_file)
    end
  end
end