Class: Bio::FastaFormat

Inherits:
DB show all
Defined in:
lib/bio/db/fasta.rb

Overview

Treats a FASTA formatted entry, such as:

>id and/or some comments                    <== definition line
ATGCATGCATGCATGCATGCATGCATGCATGCATGC        <== sequence lines
ATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGC

The precedent ‘>’ can be omitted and the trailing ‘>’ will be removed automatically.

Examples

fasta_string = <<END_OF_STRING
>gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]
MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI
VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ
NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP
IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP
INRISARRAAIHPYFQES
END_OF_STRING

f = Bio::FastaFormat.new(fasta_string)

f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+
# MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+
# VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+
# NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+
# IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+
# INRISARRAAIHPYFQES"

Methods related to the name of the sequence

A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the FastaFormat#identifiers method.

f.entry_id #=> "gi|398365175"
f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]"
f.identifiers #=> Bio::FastaDefline instance
f.accession #=> "NP_009718"
f.accessions #=> ["NP_009718"]
f.acc_version #=> "NP_009718.3"
f.comment #=> nil

Methods related to the actual sequence

f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n"
f.length #=> 298
f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4}
f.aalen #=> 298

A less structured fasta entry

f.entry #=> ">abc 123 456\nASDF"

f.entry_id #=> "abc"
f.definition #=> "abc 123 456"
f.comment #=> nil
f.accession #=> nil
f.accessions #=> []
f.acc_version #=> nil

f.seq #=> "ASDF"
f.data #=> "\nASDF\n"
f.length #=> 4
f.aaseq #=> "ASDF"
f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1}
f.aalen #=> 4

References

Direct Known Subclasses

FastaNumericFormat

Constant Summary collapse

DELIMITER =

Entry delimiter in flatfile text.

RS = "\n>"
DELIMITER_OVERRUN =

(Integer) excess read size included in DELIMITER.

1

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods inherited from DB

#exists?, #fetch, #get, open, #tags

Constructor Details

#initialize(str) ⇒ FastaFormat

Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.



131
132
133
134
135
136
# File 'lib/bio/db/fasta.rb', line 131

def initialize(str)
  @definition = str[/.*/].sub(/^>/, '').strip	# 1st line
  @data = str.sub(/.*/, '')				# rests
  @data.sub!(/^>.*/m, '')	# remove trailing entries for sure
  @entry_overrun = $&
end

Instance Attribute Details

#dataObject

The seuqnce lines in text.



124
125
126
# File 'lib/bio/db/fasta.rb', line 124

def data
  @data
end

#definitionObject

The comment line of the FASTA formatted data.



121
122
123
# File 'lib/bio/db/fasta.rb', line 121

def definition
  @definition
end

#entry_overrunObject (readonly)

Returns the value of attribute entry_overrun.



126
127
128
# File 'lib/bio/db/fasta.rb', line 126

def entry_overrun
  @entry_overrun
end

Instance Method Details

#aalenObject

Returens the length of Bio::Sequence::AA.



221
222
223
# File 'lib/bio/db/fasta.rb', line 221

def aalen
  self.aaseq.length
end

#aaseqObject

Returens the Bio::Sequence::AA.



216
217
218
# File 'lib/bio/db/fasta.rb', line 216

def aaseq
  Sequence::AA.new(seq)
end

#acc_versionObject

Returns accession number with version.



277
278
279
# File 'lib/bio/db/fasta.rb', line 277

def acc_version
  identifiers.acc_version
end

#accessionObject

Returns an accession number.



265
266
267
# File 'lib/bio/db/fasta.rb', line 265

def accession
  identifiers.accession
end

#accessionsObject

Parsing FASTA Defline (using #identifiers method), and shows accession numbers. It returns an array of strings.



272
273
274
# File 'lib/bio/db/fasta.rb', line 272

def accessions
  identifiers.accessions
end

#commentObject

Returns comments.



195
196
197
198
# File 'lib/bio/db/fasta.rb', line 195

def comment
  seq
  @comment
end

#entryObject Also known as: to_s

Returns the stored one entry as a FASTA format. (same as to_s)



139
140
141
# File 'lib/bio/db/fasta.rb', line 139

def entry
  @entry = ">#{@definition}\n#{@data.strip}\n"
end

#entry_idObject

Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier. It returns a string.



251
252
253
# File 'lib/bio/db/fasta.rb', line 251

def entry_id
  identifiers.entry_id
end

#giObject

Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.



260
261
262
# File 'lib/bio/db/fasta.rb', line 260

def gi
  identifiers.gi
end

#identifiersObject

Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or “:”-separated IDs. It returns a Bio::FastaDefline instance.



241
242
243
244
245
246
# File 'lib/bio/db/fasta.rb', line 241

def identifiers
  unless defined?(@ids) then
    @ids = FastaDefline.new(@definition)
  end
  @ids
end

#lengthObject

Returns sequence length.



201
202
203
# File 'lib/bio/db/fasta.rb', line 201

def length
  seq.length
end

#locusObject

Returns locus.



282
283
284
# File 'lib/bio/db/fasta.rb', line 282

def locus
  identifiers.locus
end

#nalenObject

Returens the length of Bio::Sequence::NA.



211
212
213
# File 'lib/bio/db/fasta.rb', line 211

def nalen
  self.naseq.length
end

#naseqObject

Returens the Bio::Sequence::NA.



206
207
208
# File 'lib/bio/db/fasta.rb', line 206

def naseq
  Sequence::NA.new(seq)
end

#query(factory) ⇒ Object Also known as: fasta, blast

Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.

#!/usr/bin/env ruby
require 'bio'

factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
  p entry.definition
  result = entry.fasta(factory)
  result.each do |hit|
    print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
    p hit.lap_at
  end
end


162
163
164
# File 'lib/bio/db/fasta.rb', line 162

def query(factory)
  factory.query(entry)
end

#seqObject

Returns a joined sequence line as a String.



169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# File 'lib/bio/db/fasta.rb', line 169

def seq
  unless defined?(@seq)
    unless /\A\s*^\#/ =~ @data then
      @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up
    else
      a = @data.split(/(^\#.*$)/)
      i = 0
      cmnt = {}
      s = []
      a.each do |x|
        if /^# ?(.*)$/ =~ x then
          cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1
        else
          x.tr!(" \t\r\n0-9", '') # lazy clean up
          i += x.length
          s << x
        end
      end
      @comment = cmnt
      @seq = Bio::Sequence::Generic.new(s.join(''))
    end
  end
  @seq
end

#to_biosequenceObject Also known as: to_seq

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.



232
233
234
# File 'lib/bio/db/fasta.rb', line 232

def to_biosequence
  Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat)
end