Class: Bio::FastaFormat

Inherits:
DB show all
Defined in:
lib/bio/db/fasta.rb

Overview

Treats a FASTA formatted entry, such as:

>id and/or some comments                    <== definition line
ATGCATGCATGCATGCATGCATGCATGCATGCATGC        <== sequence lines
ATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGC

The precedent ‘>’ can be omitted and the trailing ‘>’ will be removed automatically.

Examples

fasta_string = <<END_OF_STRING
>gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]
MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI
VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ
NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP
IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP
INRISARRAAIHPYFQES
END_OF_STRING

f = Bio::FastaFormat.new(fasta_string)

f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+
# MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+
# VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+
# NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+
# IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+
# INRISARRAAIHPYFQES"

Methods related to the name of the sequence

A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the FastaFormat#identifiers method.

f.entry_id #=> "gi|398365175"
f.first_name #=> "gi|398365175|ref|NP_009718.3|"
f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]"
f.identifiers #=> Bio::FastaDefline instance
f.accession #=> "NP_009718"
f.accessions #=> ["NP_009718"]
f.acc_version #=> "NP_009718.3"
f.comment #=> nil

Methods related to the actual sequence

f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n"
f.length #=> 298
f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4}
f.aalen #=> 298

A less structured fasta entry

f.entry #=> ">abc 123 456\nASDF"

f.entry_id #=> "abc"
f.first_name #=> "abc"
f.definition #=> "abc 123 456"
f.comment #=> nil
f.accession #=> nil
f.accessions #=> []
f.acc_version #=> nil

f.seq #=> "ASDF"
f.data #=> "\nASDF\n"
f.length #=> 4
f.aaseq #=> "ASDF"
f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1}
f.aalen #=> 4

References

Direct Known Subclasses

FastaNumericFormat

Constant Summary collapse

DELIMITER =

Entry delimiter in flatfile text.

RS = "\n>"
DELIMITER_OVERRUN =

(Integer) excess read size included in DELIMITER.

1

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods inherited from DB

#exists?, #fetch, #get, open, #tags

Constructor Details

#initialize(str) ⇒ FastaFormat

Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.



133
134
135
136
137
138
# File 'lib/bio/db/fasta.rb', line 133

def initialize(str)
  @definition = str[/.*/].sub(/^>/, '').strip	# 1st line
  @data = str.sub(/.*/, '')				# rests
  @data.sub!(/^>.*/m, '')	# remove trailing entries for sure
  @entry_overrun = $&
end

Instance Attribute Details

#dataObject

The seuqnce lines in text.



126
127
128
# File 'lib/bio/db/fasta.rb', line 126

def data
  @data
end

#definitionObject

The comment line of the FASTA formatted data.



123
124
125
# File 'lib/bio/db/fasta.rb', line 123

def definition
  @definition
end

#entry_overrunObject (readonly)

Returns the value of attribute entry_overrun.



128
129
130
# File 'lib/bio/db/fasta.rb', line 128

def entry_overrun
  @entry_overrun
end

Instance Method Details

#aalenObject

Returens the length of Bio::Sequence::AA.



223
224
225
# File 'lib/bio/db/fasta.rb', line 223

def aalen
  self.aaseq.length
end

#aaseqObject

Returens the Bio::Sequence::AA.



218
219
220
# File 'lib/bio/db/fasta.rb', line 218

def aaseq
  Sequence::AA.new(seq)
end

#acc_versionObject

Returns accession number with version.



279
280
281
# File 'lib/bio/db/fasta.rb', line 279

def acc_version
  identifiers.acc_version
end

#accessionObject

Returns an accession number.



267
268
269
# File 'lib/bio/db/fasta.rb', line 267

def accession
  identifiers.accession
end

#accessionsObject

Parsing FASTA Defline (using #identifiers method), and shows accession numbers. It returns an array of strings.



274
275
276
# File 'lib/bio/db/fasta.rb', line 274

def accessions
  identifiers.accessions
end

#commentObject

Returns comments.



197
198
199
200
# File 'lib/bio/db/fasta.rb', line 197

def comment
  seq
  @comment
end

#entryObject Also known as: to_s

Returns the stored one entry as a FASTA format. (same as to_s)



141
142
143
# File 'lib/bio/db/fasta.rb', line 141

def entry
  @entry = ">#{@definition}\n#{@data.strip}\n"
end

#entry_idObject

Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier. It returns a string.



253
254
255
# File 'lib/bio/db/fasta.rb', line 253

def entry_id
  identifiers.entry_id
end

#first_nameObject

Returns the first name (word) of the definition line - everything before the first whitespace.

>abc def #=> 'abc'
>gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c] #=> 'gi|398365175|ref|NP_009718.3|'
>abc #=> 'abc'


294
295
296
297
298
299
300
301
# File 'lib/bio/db/fasta.rb', line 294

def first_name
  index = definition.index(/\s/)
  if index.nil?
    return @definition
  else
    return @definition[0...index]
  end
end

#giObject

Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.



262
263
264
# File 'lib/bio/db/fasta.rb', line 262

def gi
  identifiers.gi
end

#identifiersObject

Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or “:”-separated IDs. It returns a Bio::FastaDefline instance.



243
244
245
246
247
248
# File 'lib/bio/db/fasta.rb', line 243

def identifiers
  unless defined?(@ids) then
    @ids = FastaDefline.new(@definition)
  end
  @ids
end

#lengthObject

Returns sequence length.



203
204
205
# File 'lib/bio/db/fasta.rb', line 203

def length
  seq.length
end

#locusObject

Returns locus.



284
285
286
# File 'lib/bio/db/fasta.rb', line 284

def locus
  identifiers.locus
end

#nalenObject

Returens the length of Bio::Sequence::NA.



213
214
215
# File 'lib/bio/db/fasta.rb', line 213

def nalen
  self.naseq.length
end

#naseqObject

Returens the Bio::Sequence::NA.



208
209
210
# File 'lib/bio/db/fasta.rb', line 208

def naseq
  Sequence::NA.new(seq)
end

#query(factory) ⇒ Object Also known as: fasta, blast

Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.

#!/usr/bin/env ruby
require 'bio'

factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
  p entry.definition
  result = entry.fasta(factory)
  result.each do |hit|
    print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
    p hit.lap_at
  end
end


164
165
166
# File 'lib/bio/db/fasta.rb', line 164

def query(factory)
  factory.query(entry)
end

#seqObject

Returns a joined sequence line as a String.



171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
# File 'lib/bio/db/fasta.rb', line 171

def seq
  unless defined?(@seq)
    unless /\A\s*^\#/ =~ @data then
      @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up
    else
      a = @data.split(/(^\#.*$)/)
      i = 0
      cmnt = {}
      s = []
      a.each do |x|
        if /^# ?(.*)$/ =~ x then
          cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1
        else
          x.tr!(" \t\r\n0-9", '') # lazy clean up
          i += x.length
          s << x
        end
      end
      @comment = cmnt
      @seq = Bio::Sequence::Generic.new(s.join(''))
    end
  end
  @seq
end

#to_biosequenceObject Also known as: to_seq

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.



234
235
236
# File 'lib/bio/db/fasta.rb', line 234

def to_biosequence
  Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat)
end