Class: Bio::Sequence

Inherits:
Object show all
Includes:
Format, SequenceMasker
Defined in:
lib/bio/sequence.rb,
lib/bio/sequence/aa.rb,
lib/bio/sequence/na.rb,
lib/bio/sequence/common.rb,
lib/bio/sequence/compat.rb,
lib/bio/sequence/format.rb,
lib/bio/sequence/generic.rb,
lib/bio/sequence/quality_score.rb,
lib/bio/sequence/sequence_masker.rb

Overview

DESCRIPTION

Bio::Sequence objects represent annotated sequences in bioruby. A Bio::Sequence object is a wrapper around the actual sequence, represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object. For most users, this encapsulation will be completely transparent. Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA objects using the same arguments and returning the same values (even though these methods are not documented specifically for Bio::Sequence).

USAGE

# Create a nucleic or amino acid sequence
dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')

# Print it out
puts dna.to_s
puts aa.to_s

# Get a subsequence, bioinformatics style (first nucleotide is '1')
puts dna.subseq(2,6)

# Get a subsequence, informatics style (first nucleotide is '0')
puts dna[2,6]

# Print in FASTA format
puts dna.output(:fasta)

# Print all codons
dna.window_search(3,3) do |codon|
  puts codon
end

# Splice or otherwise mangle your sequence
puts dna.splicing("complement(join(1..5,16..20))")
puts rna.splicing("complement(join(1..5,16..20))")

# Convert a sequence containing ambiguity codes into a 
# regular expression you can use for subsequent searching
puts aa.to_re

# These should speak for themselves
puts dna.complement
puts dna.composition
puts dna.molecular_weight
puts dna.translate
puts dna.gc_percent

Defined Under Namespace

Modules: Adapter, Common, Format, QualityScore, SequenceMasker Classes: AA, DBLink, Generic, NA

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from SequenceMasker

#mask_with_enumerator, #mask_with_error_probability, #mask_with_quality_score

Methods included from Format

#list_output_formats, #output, #output_fasta

Constructor Details

#initialize(str) ⇒ Sequence

Create a new Bio::Sequence object

s = Bio::Sequence.new('atgc')
puts s                                  #=> 'atgc'

Note that this method does not intialize the contained sequence as any kind of bioruby object, only as a simple string

puts s.seq.class                        #=> String

See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto for methods to transform the basic String of a just created Bio::Sequence object to a proper bioruby object


Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

Returns

Bio::Sequence object



99
100
101
# File 'lib/bio/sequence.rb', line 99

def initialize(str)
  @seq = str
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object

Pass any unknown method calls to the wrapped sequence object. see www.rubycentral.com/book/ref_c_object.html#Object.method_missing



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# File 'lib/bio/sequence.rb', line 105

def method_missing(sym, *args, &block) #:nodoc:
  begin
    seq.__send__(sym, *args, &block)
  rescue NoMethodError => evar
    lineno = __LINE__ - 2
    file = __FILE__
    bt_here = [ "#{file}:#{lineno}:in \`__send__\'",
                "#{file}:#{lineno}:in \`method_missing\'"
              ]
    if bt_here == evar.backtrace[0, 2] then
      bt = evar.backtrace[2..-1]
      evar = evar.class.new("undefined method \`#{sym.to_s}\' for #{self.inspect}")
      evar.set_backtrace(bt)
    end
    #p lineno
    #p file
    #p bt_here
    #p evar.backtrace
    raise(evar)
  end
end

Instance Attribute Details

#classificationObject Also known as: taxonomy

Organism classification, taxonomic classification of the source organism. (Array of String)



235
236
237
# File 'lib/bio/sequence.rb', line 235

def classification
  @classification
end

#commentsObject

Comments (String or an Array of String)



142
143
144
# File 'lib/bio/sequence.rb', line 142

def comments
  @comments
end

#data_classObject

Data Class defined by EMBL (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_1



197
198
199
# File 'lib/bio/sequence.rb', line 197

def data_class
  @data_class
end

#date_createdObject

Created date of the sequence entry (Date, DateTime, Time, or String)



210
211
212
# File 'lib/bio/sequence.rb', line 210

def date_created
  @date_created
end

#date_modifiedObject

Last modified date of the sequence entry (Date, DateTime, Time, or String)



213
214
215
# File 'lib/bio/sequence.rb', line 213

def date_modified
  @date_modified
end

Links to other database entries. (An Array of Bio::Sequence::DBLink objects)



149
150
151
# File 'lib/bio/sequence.rb', line 149

def dblinks
  @dblinks
end

#definitionObject

A String with a description of the sequence (String)



133
134
135
# File 'lib/bio/sequence.rb', line 133

def definition
  @definition
end

#divisionObject

Taxonomic Division defined by EMBL/GenBank/DDBJ (String) See www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#3_2



201
202
203
# File 'lib/bio/sequence.rb', line 201

def division
  @division
end

#entry_idObject

The sequence identifier (String). For example, for a sequence of Genbank origin, this is the locus name. For a sequence of EMBL origin, this is the primary accession number.



130
131
132
# File 'lib/bio/sequence.rb', line 130

def entry_id
  @entry_id
end

#entry_versionObject

Version of the entry (String or Integer). Unlike sequence_version, entry_version is a database maintainer’s internal version number. The version number will be changed when the database maintainer modifies the entry. The same enrty in EMBL, GenBank, and DDBJ may have different entry_version.



228
229
230
# File 'lib/bio/sequence.rb', line 228

def entry_version
  @entry_version
end

#error_probabilitiesObject

Error probabilities of the bases/residues in the sequence. (Array containing Float, or nil)



172
173
174
# File 'lib/bio/sequence.rb', line 172

def error_probabilities
  @error_probabilities
end

#featuresObject

Features (An Array of Bio::Feature objects)



136
137
138
# File 'lib/bio/sequence.rb', line 136

def features
  @features
end

#id_namespaceObject

Namespace of the sequence IDs described in entry_id, primary_accession, and secondary_accessions methods (String). For example, ‘EMBL’, ‘GenBank’, ‘DDBJ’, ‘RefSeq’.



244
245
246
# File 'lib/bio/sequence.rb', line 244

def id_namespace
  @id_namespace
end

#keywordsObject

Keywords (An Array of String)



145
146
147
# File 'lib/bio/sequence.rb', line 145

def keywords
  @keywords
end

#molecule_typeObject

molecular type (String). “DNA” or “RNA” for nucleotide sequence.



193
194
195
# File 'lib/bio/sequence.rb', line 193

def molecule_type
  @molecule_type
end

#moltypeObject

Bio::Sequence::NA/AA



152
153
154
# File 'lib/bio/sequence.rb', line 152

def moltype
  @moltype
end

#organelleObject

(not well supported) Organelle information (String).



239
240
241
# File 'lib/bio/sequence.rb', line 239

def organelle
  @organelle
end

#other_seqidsObject

Sequence identifiers which are not described in entry_id, primary_accession,and secondary_accessions methods (Array of Bio::Sequence::DBLink objects). For example, NCBI GI number can be stored. Note that only identifiers of the entry itself should be stored. For database cross references, dblinks should be used.



252
253
254
# File 'lib/bio/sequence.rb', line 252

def other_seqids
  @other_seqids
end

#primary_accessionObject

Primary accession number (String)



204
205
206
# File 'lib/bio/sequence.rb', line 204

def primary_accession
  @primary_accession
end

#quality_score_typeObject

The meaning (calculation method) of the quality scores stored in the quality_scores attribute. Maybe one of :phred, :solexa, or nil.

Note that if it is nil, and error_probabilities is empty, some methods implicitly assumes that it is :phred (PHRED score).



168
169
170
# File 'lib/bio/sequence.rb', line 168

def quality_score_type
  @quality_score_type
end

#quality_scoresObject

Quality scores of the bases/residues in the sequence. (Array containing Integer, or nil)



160
161
162
# File 'lib/bio/sequence.rb', line 160

def quality_scores
  @quality_scores
end

#referencesObject

References (An Array of Bio::Reference objects)



139
140
141
# File 'lib/bio/sequence.rb', line 139

def references
  @references
end

#release_createdObject

Release information when created (String)



216
217
218
# File 'lib/bio/sequence.rb', line 216

def release_created
  @release_created
end

#release_modifiedObject

Release information when last-modified (String)



219
220
221
# File 'lib/bio/sequence.rb', line 219

def release_modified
  @release_modified
end

#secondary_accessionsObject

Secondary accession numbers (Array of String)



207
208
209
# File 'lib/bio/sequence.rb', line 207

def secondary_accessions
  @secondary_accessions
end

#seqObject

The sequence object, usually Bio::Sequence::NA/AA, but could be a simple String



156
157
158
# File 'lib/bio/sequence.rb', line 156

def seq
  @seq
end

#sequence_versionObject

Version number of the sequence (String or Integer). Unlike entry_version, sequence_version will be changed when the submitter of the sequence updates the entry. Normally, the same entry taken from different databases (EMBL, GenBank, and DDBJ) may have the same sequence_version.



183
184
185
# File 'lib/bio/sequence.rb', line 183

def sequence_version
  @sequence_version
end

#speciesObject

Organism species (String). For example, “Escherichia coli”.



231
232
233
# File 'lib/bio/sequence.rb', line 231

def species
  @species
end

#strandednessObject

Strandedness (String). “single” (single-stranded), “double” (double-stranded), “mixed” (mixed-stranded), or nil.



190
191
192
# File 'lib/bio/sequence.rb', line 190

def strandedness
  @strandedness
end

#topologyObject

Topology (String). “circular”, “linear”, or nil.



186
187
188
# File 'lib/bio/sequence.rb', line 186

def topology
  @topology
end

Class Method Details

.adapter(source_data, adapter_module) ⇒ Object

Normally, users should not call this method directly. Use Bio::*#to_biosequence (e.g. Bio::GenBank#to_biosequence).

Creates a new Bio::Sequence object from database data with an adapter module.



463
464
465
466
467
468
469
470
471
# File 'lib/bio/sequence.rb', line 463

def self.adapter(source_data, adapter_module)
  biosequence = self.new(nil)
  biosequence.instance_eval {
    remove_instance_variable(:@seq)
    @source_data = source_data
  }
  biosequence.extend(adapter_module)
  biosequence
end

.auto(str) ⇒ Object

Given a sequence String, guess its type, Amino Acid or Nucleic Acid, and return a new Bio::Sequence object wrapping a sequence of the guessed type (either Bio::Sequence::AA or Bio::Sequence::NA)

s = Bio::Sequence.auto('atgc')
puts s.seq.class                        #=> Bio::Sequence::NA

Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

Returns

Bio::Sequence object



283
284
285
286
287
# File 'lib/bio/sequence.rb', line 283

def self.auto(str)
  seq = self.new(str)
  seq.auto
  return seq
end

.guess(str, *args) ⇒ Object

Guess the class of a given sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.

puts .guess('atgc')        #=> Bio::Sequence::NA

There are three optional parameters: ‘threshold`, `length`, and `index`.

The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.

puts Bio::Sequence.guess('atgcatgcqq')      #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.8) #=> Bio::Sequence::AA
puts Bio::Sequence.guess('atgcatgcqq', 0.7) #=> Bio::Sequence::NA

The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.

# limit the guess to the first 1000 positions
puts Bio::Sequence.guess('A VERY LONG SEQUENCE', 0.9, 1000)

The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…

puts Bio::Sequence.guess('-----atgcc')             #=> Bio::Sequence::AA
puts Bio::Sequence.guess('-----atgcc',0.9,10000,5) #=> Bio::Sequence::NA

Arguments:

  • (required) str: String or Bio::Sequence::NA/AA object

  • (optional) threshold: Float in range 0,1 (default 0.9)

  • (optional) length: Fixnum (default 10000)

  • (optional) index: Fixnum (default 1)

Returns

Bio::Sequence::NA/AA



381
382
383
# File 'lib/bio/sequence.rb', line 381

def self.guess(str, *args)
  self.new(str).guess(*args)
end

.input(str, format = nil) ⇒ Object

Create a new Bio::Sequence object from a formatted string (GenBank, EMBL, fasta format, etc.)

s = Bio::Sequence.input(str)

Arguments:

  • (required) str: string

  • (optional) format: format specification (class or nil)

Returns

Bio::Sequence object



436
437
438
439
440
441
442
443
444
# File 'lib/bio/sequence.rb', line 436

def self.input(str, format = nil)
  if format then
    klass = format
  else
    klass = Bio::FlatFile::AutoDetect.default.autodetect(str)
  end
  obj = klass.new(str)
  obj.to_biosequence
end

.read(str, format = nil) ⇒ Object

alias of Bio::Sequence.input



447
448
449
# File 'lib/bio/sequence.rb', line 447

def self.read(str, format = nil)
  input(str, format)
end

Instance Method Details

#aaObject

Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!

s = Bio::Sequence.new('atgc')
puts s.seq.class                        #=> String
s.aa
puts s.seq.class                        #=> Bio::Sequence::AA !!!

However, if you know your sequence type, this method may be constructively used after initialization,

s = Bio::Sequence.new('RRLE')
s.aa

Returns

Bio::Sequence::AA



422
423
424
425
# File 'lib/bio/sequence.rb', line 422

def aa
  @seq = AA.new(seq)
  @moltype = AA
end

#accessionsObject

accession numbers of the sequence

Returns

Array of String



454
455
456
# File 'lib/bio/sequence.rb', line 454

def accessions
  [ primary_accession, secondary_accessions ].flatten.compact
end

#autoObject

Guess the type of sequence, Amino Acid or Nucleic Acid, and create a new sequence object (Bio::Sequence::AA or Bio::Sequence::NA) on the basis of this guess. This method will change the current Bio::Sequence object.

s = Bio::Sequence.new('atgc')
puts s.seq.class                        #=> String
s.auto
puts s.seq.class                        #=> Bio::Sequence::NA

Returns

Bio::Sequence::NA/AA object



264
265
266
267
268
269
270
271
# File 'lib/bio/sequence.rb', line 264

def auto
  @moltype = guess
  if @moltype == NA
    @seq = NA.new(seq)
  else
    @seq = AA.new(seq)
  end
end

#guess(threshold = 0.9, length = 10000, index = 0) ⇒ Object

Guess the class of the current sequence. Returns the class (Bio::Sequence::AA or Bio::Sequence::NA) guessed. In general, used by developers only, but if you know what you are doing, feel free.

s = Bio::Sequence.new('atgc')
puts s.guess                            #=> Bio::Sequence::NA

There are three parameters: ‘threshold`, `length`, and `index`.

The ‘threshold` value (defaults to 0.9) is the frequency of nucleic acid bases [AGCTUagctu] required in the sequence for this method to produce a Bio::Sequence::NA “guess”. In the default case, if less than 90% of the bases (after excluding [Nn]) are in the set [AGCTUagctu], then the guess is Bio::Sequence::AA.

s = Bio::Sequence.new('atgcatgcqq')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.8)                       #=> Bio::Sequence::AA
puts s.guess(0.7)                       #=> Bio::Sequence::NA

The ‘length` value is how much of the total sequence to use in the guess (default 10000). If your sequence is very long, you may want to use a smaller amount to reduce the computational burden.

s = Bio::Sequence.new(A VERY LONG SEQUENCE)
puts s.guess(0.9, 1000)  # limit the guess to the first 1000 positions

The ‘index` value is where to start the guess. Perhaps you know there are a lot of gaps at the start…

s = Bio::Sequence.new('-----atgcc')
puts s.guess                            #=> Bio::Sequence::AA
puts s.guess(0.9,10000,5)               #=> Bio::Sequence::NA

Arguments:

  • (optional) threshold: Float in range 0,1 (default 0.9)

  • (optional) length: Fixnum (default 10000)

  • (optional) index: Fixnum (default 1)

Returns

Bio::Sequence::NA/AA



328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
# File 'lib/bio/sequence.rb', line 328

def guess(threshold = 0.9, length = 10000, index = 0)
  str = seq.to_s[index,length].to_s.extend Bio::Sequence::Common
  cmp = str.composition

  bases = cmp['A'] + cmp['T'] + cmp['G'] + cmp['C'] + cmp['U'] +
          cmp['a'] + cmp['t'] + cmp['g'] + cmp['c'] + cmp['u']

  total = str.length - cmp['N'] - cmp['n']

  if bases.to_f / total > threshold
    return NA
  else
    return AA
  end
end

#naObject

Transform the sequence wrapped in the current Bio::Sequence object into a Bio::Sequence::NA object. This method will change the current object. This method does not validate your choice, so be careful!

s = Bio::Sequence.new('RRLE')
puts s.seq.class                        #=> String
s.na
puts s.seq.class                        #=> Bio::Sequence::NA !!!

However, if you know your sequence type, this method may be constructively used after initialization,

s = Bio::Sequence.new('atgc')
s.na

Returns

Bio::Sequence::NA



401
402
403
404
# File 'lib/bio/sequence.rb', line 401

def na
  @seq = NA.new(seq)
  @moltype = NA
end

#to_sObject Also known as: to_str

Return sequence as String. The original sequence is unchanged.

seq = Bio::Sequence.new('atgc')
puts s.to_s                             #=> 'atgc'
puts s.to_s.class                       #=> String
puts s                                  #=> 'atgc'
puts s.class                            #=> Bio::Sequence

Returns

String object



27
28
29
# File 'lib/bio/sequence/compat.rb', line 27

def to_s
  String.new(self.seq)
end