Module: Bio::Sequence::Common
Overview
DESCRIPTION
Bio::Sequence::Common is a Mixin implementing methods common to Bio::Sequence::AA and Bio::Sequence::NA. All of these methods are available to either Amino Acid or Nucleic Acid sequences, and by encapsulation are also available to Bio::Sequence objects.
USAGE
# Create a sequence
dna = Bio::Sequence.auto('atgcatgcatgc')
# Splice out a subsequence using a Genbank-style location string
puts dna.splice('complement(1..4)')
# What is the base composition?
puts dna.composition
# Create a random sequence with the composition of a current sequence
puts dna.randomize
Instance Method Summary collapse
-
#+(*arg) ⇒ Object
Create a new sequence by adding to an existing sequence.
- #<<(*arg) ⇒ Object
-
#composition ⇒ Object
Returns a hash of the occurrence counts for each residue or base.
-
#concat(*arg) ⇒ Object
Add new data to the end of the current sequence.
-
#normalize! ⇒ Object
(also: #seq!)
Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA.
-
#randomize(hash = nil) ⇒ Object
Returns a randomized sequence.
-
#seq ⇒ Object
Create a new sequence based on the current sequence.
-
#splice(position) ⇒ Object
(also: #splicing)
Return a new sequence extracted from the original using a GenBank style position string.
-
#split(*arg) ⇒ Object
Acts almost the same as String#split.
-
#subseq(s = 1, e = self.length) ⇒ Object
Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters.
-
#to_fasta(header = '', width = nil) ⇒ Object
Bio::Sequence#to_fasta is DEPRECATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output instead.
-
#to_s ⇒ Object
(also: #to_str)
Return sequence as String.
-
#total(hash) ⇒ Object
Returns a float total value for the sequence given a hash of base or residue values,.
-
#window_search(window_size, step_size = 1) ⇒ Object
This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’.
Instance Method Details
#+(*arg) ⇒ Object
Create a new sequence by adding to an existing sequence. The existing sequence is not modified.
s = Bio::Sequence::NA.new('atgc')
s2 = s + 'atgc'
puts s2 #=> "atgcatgc"
puts s #=> "atgc"
The new sequence is of the same class as the existing sequence if the new data was added to an existing sequence,
puts s2.class == s.class #=> true
but if an existing sequence is added to a String, the result is a String
s3 = 'atgc' + s
puts s3.class #=> String
- Returns
-
new Bio::Sequence::NA/AA or String object
122 123 124 |
# File 'lib/bio/sequence/common.rb', line 122 def +(*arg) self.class.new(super(*arg)) end |
#<<(*arg) ⇒ Object
99 100 101 |
# File 'lib/bio/sequence/common.rb', line 99 def <<(*arg) concat(*arg) end |
#composition ⇒ Object
216 217 218 219 220 221 222 |
# File 'lib/bio/sequence/common.rb', line 216 def composition count = Hash.new(0) self.scan(/./) do |x| count[x] += 1 end return count end |
#concat(*arg) ⇒ Object
95 96 97 |
# File 'lib/bio/sequence/common.rb', line 95 def concat(*arg) super(self.class.new(*arg)) end |
#normalize! ⇒ Object Also known as: seq!
Normalize the current sequence, removing all whitespace and transforming all positions to uppercase if the sequence is AA or transforming all positions to lowercase if the sequence is NA. The original sequence is modified.
s = Bio::Sequence::NA.new('atgc')
s.normalize!
- Returns
-
current Bio::Sequence::NA/AA object (modified)
79 80 81 82 |
# File 'lib/bio/sequence/common.rb', line 79 def normalize! initialize(self) self end |
#randomize(hash = nil) ⇒ Object
Returns a randomized sequence. The default is to retain the same base/residue composition as the original. If a hash of base/residue counts is given, the new sequence will be based on that hash composition. If a block is given, each new randomly selected position will be passed into the block. In all cases, the original sequence is not modified.
s = Bio::Sequence::NA.new('atgc')
puts s.randomize #=> "tcag" (for example)
new_composition = {'a' => 2, 't' => 2}
puts s.randomize(new_composition) #=> "ttaa" (for example)
count = 0
s.randomize { |x| count += 1 }
puts count #=> 4
Arguments:
-
(optional) hash: Hash object
- Returns
-
new Bio::Sequence::NA/AA object
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
# File 'lib/bio/sequence/common.rb', line 244 def randomize(hash = nil) if hash tmp = '' hash.each {|k, v| tmp += k * v.to_i } else tmp = self end seq = self.class.new(tmp) # Reference: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle seq.length.downto(2) do |n| k = rand(n) c = seq[n - 1] seq[n - 1] = seq[k] seq[k] = c end if block_given? then (0...seq.length).each do |i| yield seq[i, 1] end return self.class.new('') else return seq end end |
#seq ⇒ Object
66 67 68 |
# File 'lib/bio/sequence/common.rb', line 66 def seq self.class.new(self) end |
#splice(position) ⇒ Object Also known as: splicing
Return a new sequence extracted from the original using a GenBank style position string. See also documentation for the Bio::Location class.
s = Bio::Sequence::NA.new('atgcatgcatgcatgc')
puts s.splice('1..3') #=> "atg"
puts s.splice('join(1..3,8..10)') #=> "atgcat"
puts s.splice('complement(1..3)') #=> "cat"
puts s.splice('complement(join(1..3,8..10))') #=> "atgcat"
Note that ‘complement’ed Genbank position strings will have no effect on Bio::Sequence::AA objects.
Arguments:
-
(required) position: String or Bio::Location object
- Returns
-
Bio::Sequence::NA/AA object
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 |
# File 'lib/bio/sequence/common.rb', line 286 def splice(position) unless position.is_a?(Locations) then position = Locations.new(position) end s = String.new position.each do |location| if location.sequence s << location.sequence else exon = self.subseq(location.from, location.to) begin exon.complement! if location.strand < 0 rescue NameError end s << exon end end return self.class.new(s) end |
#split(*arg) ⇒ Object
Acts almost the same as String#split.
312 313 314 315 316 317 318 319 320 |
# File 'lib/bio/sequence/common.rb', line 312 def split(*arg) if block_given? super else ret = super(*arg) ret.collect! { |x| self.class.new('').replace(x) } ret end end |
#subseq(s = 1, e = self.length) ⇒ Object
Returns a new sequence containing the subsequence identified by the start and end numbers given as parameters. Important: Biological sequence numbering conventions (one-based) rather than ruby’s (zero-based) numbering conventions are used.
s = Bio::Sequence::NA.new('atggaatga')
puts s.subseq(1,3) #=> "atg"
Start defaults to 1 and end defaults to the entire existing string, so subseq called without any parameters simply returns a new sequence identical to the existing sequence.
puts s.subseq #=> "atggaatga"
Arguments:
-
(optional) s(start): Integer (default 1)
-
(optional) e(end): Integer (default current sequence length)
- Returns
-
new Bio::Sequence::NA/AA object
144 145 146 147 148 149 |
# File 'lib/bio/sequence/common.rb', line 144 def subseq(s = 1, e = self.length) raise "Error: start/end position must be a positive integer" unless s > 0 and e > 0 s -= 1 e -= 1 self[s..e] end |
#to_fasta(header = '', width = nil) ⇒ Object
Bio::Sequence#to_fasta is DEPRECATED Do not use Bio::Sequence#to_fasta ! Use Bio::Sequence#output instead. Note that Bio::Sequence::NA#to_fasta, Bio::Sequence::AA#to_fasata, and Bio::Sequence::Generic#to_fasta can still be used, because there are no alternative methods.
Output the FASTA format string of the sequence. The 1st argument is used as the comment string. If the 2nd option is given, the output sequence will be folded.
Arguments:
-
(optional) header: String object
-
(optional) width: Fixnum object (default nil)
- Returns
-
String
49 50 51 52 53 54 55 56 57 |
# File 'lib/bio/sequence/compat.rb', line 49 def to_fasta(header = '', width = nil) warn "Bio::Sequence#to_fasta is obsolete. Use Bio::Sequence#output(:fasta) instead" if $DEBUG ">#{header}\n" + if width self.to_s.gsub(Regexp.new(".{1,#{width}}"), "\\0\n") else self.to_s + "\n" end end |
#to_s ⇒ Object Also known as: to_str
53 54 55 |
# File 'lib/bio/sequence/common.rb', line 53 def to_s String.new(self) end |
#total(hash) ⇒ Object
199 200 201 202 203 204 205 206 207 208 |
# File 'lib/bio/sequence/common.rb', line 199 def total(hash) hash.default = 0.0 unless hash.default sum = 0.0 self.each_byte do |x| begin sum += hash[x.chr] end end return sum end |
#window_search(window_size, step_size = 1) ⇒ Object
This method steps through a sequences in steps of ‘step_size’ by subsequences of ‘window_size’. Typically used with a block. Any remaining sequence at the terminal end will be returned.
Prints average GC% on each 100bp
s.window_search(100) do |subseq|
puts subseq.gc
end
Prints every translated peptide (length 5aa) in the same frame
s.window_search(15, 3) do |subseq|
puts subseq.translate
end
Split genome sequence by 10000bp with 1000bp overlap in fasta format
i = 1
remainder = s.window_search(10000, 9000) do |subseq|
puts subseq.to_fasta("segment #{i}", 60)
i += 1
end
puts remainder.to_fasta("segment #{i}", 60)
Arguments:
-
(required) window_size: Fixnum
-
(optional) step_size: Fixnum (default 1)
- Returns
-
new Bio::Sequence::NA/AA object
180 181 182 183 184 185 186 187 |
# File 'lib/bio/sequence/common.rb', line 180 def window_search(window_size, step_size = 1) last_step = 0 0.step(self.length - window_size, step_size) do |i| yield self[i, window_size] last_step = i end return self[last_step + window_size .. -1] end |