Class: Bio::Sequence

Inherits:
Object show all
Includes:
Format, SequenceMasker
Defined in:
lib/bio/sequence.rb,
lib/bio/sequence/na.rb,
lib/bio/sequence/aa.rb,
lib/bio/sequence/compat.rb,
lib/bio/sequence/format.rb,
lib/bio/sequence/common.rb,
lib/bio/sequence/generic.rb,
lib/bio/sequence/quality_score.rb,
lib/bio/sequence/sequence_masker.rb

Overview

DESCRIPTION

Bio::Sequence objects represent annotated sequences in bioruby. A Bio::Sequence object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object. For most users, this encapsulation will be completely transparent. Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence).

USAGE

# Create a nucleic or amino acid sequence
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')

# Print it out
puts dna.to_s
puts aa.to_s

# Get a subsequence, bioinformatics style (first nucleotide is '1')
puts dna.subseq(2,6)

# Get a subsequence, informatics style (first nucleotide is '0')
puts dna[2,6]

# Print in FASTA format
puts dna.output(:fasta)

# Print all codons
dna.window_search(3,3) do |codon|
  puts codon
end

# Splice or otherwise mangle your sequence
puts dna.splicing("complement(join(1..5,16..20))")
puts rna.splicing("complement(join(1..5,16..20))")

# Convert a sequence containing ambiguity codes into a 
# regular expression you can use for subsequent searching
puts aa.to_re

# These should speak for themselves
puts dna.complement
puts dna.composition
puts dna.molecular_weight
puts dna.translate
puts dna.gc_percent

Defined Under Namespace

Modules: Adapter, Common, Format, QualityScore, SequenceMasker Classes: AA, DBLink, Generic, NA

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from SequenceMasker

#mask_with_enumerator, #mask_with_error_probability, #mask_with_quality_score

Methods included from Format

#list_output_formats, #output, #output_fasta

Constructor Details

#initialize(str) ⇒ Sequence

Create a new Bio::Sequence object

s = Bio::Sequence.new('atgc')
puts s                                  #=> 'atgc'

Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string

puts s.seq.class                        #=> String

See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto for methods to transform the basic String of a just created Bio::Sequence object to a proper bioruby object


Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

Returns

Bio::Sequence object



97
98
99
# File 'lib/bio/sequence.rb', line 97

def initialize(str)
  @seq = str
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object

Pass any unknown method calls to the wrapped sequence object. see www.rubycentral.com/book/ref_c_object.html#Object.method_missing



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/bio/sequence.rb', line 103

def method_missing(sym, *args, &block) #:nodoc:
  begin
    seq.__send__(sym, *args, &block)
  rescue NoMethodError => evar
    lineno = __LINE__ - 2
    file = __FILE__
    bt_here = [ "#{file}:#{lineno}:in \`__send__\'",
                "#{file}:#{lineno}:in \`method_missing\'"
              ]
    if bt_here == evar.backtrace[0, 2] then
      bt = evar.backtrace[2..-1]
      evar = evar.class.new("undefined method \`#{sym.to_s}\' for #{self.inspect}")
      evar.set_backtrace(bt)
    end
    #p lineno
    #p file
    #p bt_here
    #p evar.backtrace
    raise(evar)
  end
end

Instance Attribute Details

#classificationObject Also known as: taxonomy

Organism classification, taxonomic classification of the source organism. (Array of String)



233
234
235
# File 'lib/bio/sequence.rb', line 233

def classification
  @classification
end

#commentsObject

Comments (String or an Array of String)



140
141
142
# File 'lib/bio/sequence.rb', line 140

def comments
  @comments
end

#data_classObject

Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1



195
196
197
# File 'lib/bio/sequence.rb', line 195

def data_class
  @data_class
end

#date_createdObject

Created date of the sequence entry (Date, DateTime, Time, or String)



208
209
210
# File 'lib/bio/sequence.rb', line 208

def date_created
  @date_created
end

#date_modifiedObject

Last modified date of the sequence entry (Date, DateTime, Time, or String)



211
212
213
# File 'lib/bio/sequence.rb', line 211

def date_modified
  @date_modified
end

Links to other database entries. (An Array of Bio::Sequence::DBLink objects)



147
148
149
# File 'lib/bio/sequence.rb', line 147

def dblinks
  @dblinks
end

#definitionObject

A String with a description of the sequence (String)



131
132
133
# File 'lib/bio/sequence.rb', line 131

def definition
  @definition
end

#divisionObject

Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2



199
200
201
# File 'lib/bio/sequence.rb', line 199

def division
  @division
end

#entry_idObject

The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL origin, this is the primary accession number.



128
129
130
# File 'lib/bio/sequence.rb', line 128

def entry_id
  @entry_id
end

#entry_versionObject

Version of the entry (String or Integer). Unlike sequence_version, entry_version is a database maintainer's internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL, GenBank, and DDBJ may have different entry_version.



226
227
228
# File 'lib/bio/sequence.rb', line 226

def entry_version
  @entry_version
end

#error_probabilitiesObject

Error probabilities of the bases/residues in the sequence. (Array containing Float, or nil)



170
171
172
# File 'lib/bio/sequence.rb', line 170

def error_probabilities
  @error_probabilities
end

#featuresObject

Features (An Array of Bio::Feature objects)



134
135
136
# File 'lib/bio/sequence.rb', line 134

def features
  @features
end

#id_namespaceObject

Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String). For example, 'EMBL', 'GenBank', 'DDBJ', 'RefSeq'.



242
243
244
# File 'lib/bio/sequence.rb', line 242

def id_namespace
  @id_namespace
end

#keywordsObject

Keywords (An Array of String)



143
144
145
# File 'lib/bio/sequence.rb', line 143

def keywords
  @keywords
end

#molecule_typeObject

molecular type (String). “DNA” or “RNA” for nucleotide sequence.



191
192
193
# File 'lib/bio/sequence.rb', line 191

def molecule_type
  @molecule_type
end

#moltypeObject

Bio::Sequence::NA/AA



150
151
152
# File 'lib/bio/sequence.rb', line 150

def moltype
  @moltype
end

#organelleObject

(not well supported) Organelle information (String).



237
238
239
# File 'lib/bio/sequence.rb', line 237

def organelle
  @organelle
end

#other_seqidsObject

Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects). For example, NCBI GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks should be used.



250
251
252
# File 'lib/bio/sequence.rb', line 250

def other_seqids
  @other_seqids
end

#primary_accessionObject

Primary accession number (String)



202
203
204
# File 'lib/bio/sequence.rb', line 202

def primary_accession
  @primary_accession
end

#quality_score_typeObject

The meaning (calculation method) of the quality scores stored in the quality_scores attribute. Maybe one of :phred, :solexa, or nil.

Note that if it is nil, and error_probabilities is empty, some methods implicitly assumes that it is :phred (PHRED score).



166
167
168
# File 'lib/bio/sequence.rb', line 166

def quality_score_type
  @quality_score_type
end

#quality_scoresObject

Quality scores of the bases/residues in the sequence. (Array containing Integer, or nil)



158
159
160
# File 'lib/bio/sequence.rb', line 158

def quality_scores
  @quality_scores
end

#referencesObject

References (An Array of Bio::Reference objects)



137
138
139
# File 'lib/bio/sequence.rb', line 137

def references
  @references
end

#release_createdObject

Release information when created (String)



214
215
216
# File 'lib/bio/sequence.rb', line 214

def release_created
  @release_created
end

#release_modifiedObject

Release information when last-modified (String)



217
218
219
# File 'lib/bio/sequence.rb', line 217

def release_modified
  @release_modified
end

#secondary_accessionsObject

Secondary accession numbers (Array of String)



205
206
207
# File 'lib/bio/sequence.rb', line 205

def secondary_accessions
  @secondary_accessions
end

#seqObject

The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String



154
155
156
# File 'lib/bio/sequence.rb', line 154

def seq
  @seq
end

#sequence_versionObject

Version number of the sequence (String or Integer). Unlike entry_version, sequence_version will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL, GenBank, and DDBJ) may have the same sequence_version.



181
182
183
# File 'lib/bio/sequence.rb', line 181

def sequence_version
  @sequence_version
end

#speciesObject

Organism species (String). For example, “Escherichia coli”.



229
230
231
# File 'lib/bio/sequence.rb', line 229

def species
  @species
end

#strandednessObject

Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.



188
189
190
# File 'lib/bio/sequence.rb', line 188

def strandedness
  @strandedness
end

#topologyObject

Topology (String). “circular”, “linear”, or nil.



184
185
186
# File 'lib/bio/sequence.rb', line 184

def topology
  @topology
end

Class Method Details

.adapter(source_data, adapter_module) ⇒ Object

Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence).

Creates a new Bio::Sequence object from database data with an adapter module.



461
462
463
464
465
466
467
468
469
# File 'lib/bio/sequence.rb', line 461

def self.adapter(source_data, adapter_module)
  biosequence = self.new(nil)
  biosequence.instance_eval {
    remove_instance_variable(:@seq)
    @source_data = source_data
  }
  biosequence.extend(adapter_module)
  biosequence
end

.auto(str) ⇒ Object

Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA)

s = Bio::Sequence.auto('atgc')
puts s.seq.class                        #=> Bio::Sequence::NA

Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

Returns

Bio::Sequence object



281
282
283
284
285
# File 'lib/bio/sequence.rb', line 281

def self.auto(str)
  seq = self.new(str)
  seq.auto
  return seq
end

.guess(str, *args) ⇒ Object

Guess the class of a given sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.

puts .guess('atgc')        #=> Bio::Sequence::NA

There are three optional parameters: `threshold`, `length`, and `index`.

The `threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.

puts Bio::Sequence.guess('atgcatgcqq')      #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA

The `length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.

# limit the guess to the first 1000 positions
puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)

The `index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…

puts Bio::Sequence.guess('-----atgcc')             #=> Bio::Sequence::AA
puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA

Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

  • (optional) threshold: Float in range 0,1 (default 0.9)

  • (optional) length: Fixnum (default 10000)

  • (optional) index: Fixnum (default 1)

Returns

Bio::Sequence::NA/AA



379
380
381
# File 'lib/bio/sequence.rb', line 379

def self.guess(str, *args)
  self.new(str).guess(*args)
end

.input(str, format = nil) ⇒ Object

Create a new Bio::Sequence object from a formatted string (GenBank, EMBL, fasta format, etc.)

s = Bio::Sequence.input(str)

Arguments:

  • (required) str: string

  • (optional) format: format specification (class or nil)

Returns

Bio::Sequence object



434
435
436
437
438
439
440
441
442
# File 'lib/bio/sequence.rb', line 434

def self.input(str, format = nil)
  if format then
    klass = format
  else
    klass = Bio::FlatFile::AutoDetect.default.autodetect(str)
  end
  obj = klass.new(str)
  obj.to_biosequence
end

.read(str, format = nil) ⇒ Object

alias of Bio::Sequence.input



445
446
447
# File 'lib/bio/sequence.rb', line 445

def self.read(str, format = nil)
  input(str, format)
end

Instance Method Details

#aaObject

Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!

s = Bio::Sequence.new('atgc')
puts s.seq.class                        #=> String
s.aa
puts s.seq.class                        #=> Bio::Sequence::AA !!!

However, if you know your sequence type, this method may be constructively used after initialization,

s = Bio::Sequence.new('RRLE')
s.aa

Returns

Bio::Sequence::AA



420
421
422
423
# File 'lib/bio/sequence.rb', line 420

def aa
  @seq = AA.new(seq)
  @moltype = AA
end

#accessionsObject

accession numbers of the sequence

Returns

Array of String



452
453
454
# File 'lib/bio/sequence.rb', line 452

def accessions
  [ primary_accession, secondary_accessions ].flatten.compact
end

#autoObject

Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess. This method will change the current Bio::Sequence object.

s = Bio::Sequence.new('atgc')
puts s.seq.class                        #=> String
s.auto
puts s.seq.class                        #=> Bio::Sequence::NA

Returns

Bio::Sequence::NA/AA object



262
263
264
265
266
267
268
269
# File 'lib/bio/sequence.rb', line 262

def auto
  @moltype = guess
  if @moltype == NA
    @seq = NA.new(seq)
  else
    @seq = AA.new(seq)
  end
end

#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object

Guess the class of the current sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.

s = Bio::Sequence.new('atgc')
puts s.guess                            #=> Bio::Sequence::NA

There are three parameters: `threshold`, `length`, and `index`.

The `threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.

s = Bio::Sequence.new('atgcatgcqq')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.8)                       #=> Bio::Sequence::AA
puts s.guess(0.7)                       #=> Bio::Sequence::NA

The `length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.

s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000)  # limit the guess to the first 1000 positions

The `index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…

s = Bio::Sequence.new('-----atgcc')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.9,10000,5)               #=> Bio::Sequence::NA

Arguments:

  • (optional) threshold: Float in range 0,1 (default 0.9)

  • (optional) length: Fixnum (default 10000)

  • (optional) index: Fixnum (default 1)

Returns

Bio::Sequence::NA/AA



326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
# File 'lib/bio/sequence.rb', line 326

def guess(threshold = 0.9, length = 10000, index = 0)
  str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common
  cmp = str.composition

  bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] +
          cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u']

  total = str.length - cmp['N'] - cmp['n']

  if bases.to_f / total > threshold
    return NA
  else
    return AA
  end
end

#naObject

Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!

s = Bio::Sequence.new('RRLE')
puts s.seq.class                        #=> String
s.na
puts s.seq.class                        #=> Bio::Sequence::NA !!!

However, if you know your sequence type, this method may be constructively used after initialization,

s = Bio::Sequence.new('atgc')
s.na

Returns

Bio::Sequence::NA



399
400
401
402
# File 'lib/bio/sequence.rb', line 399

def na
  @seq = NA.new(seq)
  @moltype = NA
end

#to_sObject Also known as: to_str

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence

Returns

String object



32
33
34
# File 'lib/bio/sequence/compat.rb', line 32

def to_s
  String.new(self.seq)
end