Class: Bio::FastaFormat
Overview
Treats a FASTA formatted entry, such as:
>id and/or some comments <== definition line
ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines
ATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGC
The precedent ‘>’ can be omitted and the trailing ‘>’ will be removed automatically.
Examples
fasta_string = <<END_OF_STRING
>gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]
MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI
VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ
NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP
IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP
INRISARRAAIHPYFQES
END_OF_STRING
f = Bio::FastaFormat.new(fasta_string)
f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+
# MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+
# VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+
# NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+
# IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+
# INRISARRAAIHPYFQES"
Methods related to the name of the sequence
A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the FastaFormat#identifiers method.
f.entry_id #=> "gi|398365175"
f.first_name #=> "gi|398365175|ref|NP_009718.3|"
f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]"
f.identifiers #=> Bio::FastaDefline instance
f.accession #=> "NP_009718"
f.accessions #=> ["NP_009718"]
f.acc_version #=> "NP_009718.3"
f.comment #=> nil
Methods related to the actual sequence
f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n"
f.length #=> 298
f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES"
f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4}
f.aalen #=> 298
A less structured fasta entry
f.entry #=> ">abc 123 456\nASDF"
f.entry_id #=> "abc"
f.first_name #=> "abc"
f.definition #=> "abc 123 456"
f.comment #=> nil
f.accession #=> nil
f.accessions #=> []
f.acc_version #=> nil
f.seq #=> "ASDF"
f.data #=> "\nASDF\n"
f.length #=> 4
f.aaseq #=> "ASDF"
f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1}
f.aalen #=> 4
References
-
FASTA format (WikiPedia) en.wikipedia.org/wiki/FASTA_format
Direct Known Subclasses
Constant Summary collapse
- DELIMITER =
Entry delimiter in flatfile text.
RS = "\n>"
- DELIMITER_OVERRUN =
(Integer) excess read size included in DELIMITER.
1
Instance Attribute Summary collapse
-
#data ⇒ Object
The seuqnce lines in text.
-
#definition ⇒ Object
The comment line of the FASTA formatted data.
-
#entry_overrun ⇒ Object
readonly
Returns the value of attribute entry_overrun.
Instance Method Summary collapse
-
#aalen ⇒ Object
Returens the length of Bio::Sequence::AA.
-
#aaseq ⇒ Object
Returens the Bio::Sequence::AA.
-
#acc_version ⇒ Object
Returns accession number with version.
-
#accession ⇒ Object
Returns an accession number.
-
#accessions ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows accession numbers.
-
#comment ⇒ Object
Returns comments.
-
#entry ⇒ Object
(also: #to_s)
Returns the stored one entry as a FASTA format.
-
#entry_id ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier.
-
#first_name ⇒ Object
Returns the first name (word) of the definition line - everything before the first whitespace.
-
#gi ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number.
-
#identifiers ⇒ Object
Parsing FASTA Defline, and extract IDs.
-
#initialize(str) ⇒ FastaFormat
constructor
Stores the comment and sequence information from one entry of the FASTA format string.
-
#length ⇒ Object
Returns sequence length.
-
#locus ⇒ Object
Returns locus.
-
#nalen ⇒ Object
Returens the length of Bio::Sequence::NA.
-
#naseq ⇒ Object
Returens the Bio::Sequence::NA.
-
#query(factory) ⇒ Object
(also: #fasta, #blast)
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
-
#seq ⇒ Object
Returns a joined sequence line as a String.
-
#to_biosequence ⇒ Object
(also: #to_seq)
Returns sequence as a Bio::Sequence object.
Methods inherited from DB
#exists?, #fetch, #get, open, #tags
Constructor Details
#initialize(str) ⇒ FastaFormat
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
133 134 135 136 137 138 |
# File 'lib/bio/db/fasta.rb', line 133 def initialize(str) @definition = str[/.*/].sub(/^>/, '').strip # 1st line @data = str.sub(/.*/, '') # rests @data.sub!(/^>.*/m, '') # remove trailing entries for sure @entry_overrun = $& end |
Instance Attribute Details
#data ⇒ Object
The seuqnce lines in text.
126 127 128 |
# File 'lib/bio/db/fasta.rb', line 126 def data @data end |
#definition ⇒ Object
The comment line of the FASTA formatted data.
123 124 125 |
# File 'lib/bio/db/fasta.rb', line 123 def definition @definition end |
#entry_overrun ⇒ Object (readonly)
Returns the value of attribute entry_overrun.
128 129 130 |
# File 'lib/bio/db/fasta.rb', line 128 def entry_overrun @entry_overrun end |
Instance Method Details
#aalen ⇒ Object
Returens the length of Bio::Sequence::AA.
223 224 225 |
# File 'lib/bio/db/fasta.rb', line 223 def aalen self.aaseq.length end |
#aaseq ⇒ Object
Returens the Bio::Sequence::AA.
218 219 220 |
# File 'lib/bio/db/fasta.rb', line 218 def aaseq Sequence::AA.new(seq) end |
#acc_version ⇒ Object
Returns accession number with version.
279 280 281 |
# File 'lib/bio/db/fasta.rb', line 279 def acc_version identifiers.acc_version end |
#accession ⇒ Object
Returns an accession number.
267 268 269 |
# File 'lib/bio/db/fasta.rb', line 267 def accession identifiers.accession end |
#accessions ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows accession numbers. It returns an array of strings.
274 275 276 |
# File 'lib/bio/db/fasta.rb', line 274 def accessions identifiers.accessions end |
#comment ⇒ Object
Returns comments.
197 198 199 200 |
# File 'lib/bio/db/fasta.rb', line 197 def comment seq @comment end |
#entry ⇒ Object Also known as: to_s
Returns the stored one entry as a FASTA format. (same as to_s)
141 142 143 |
# File 'lib/bio/db/fasta.rb', line 141 def entry @entry = ">#{@definition}\n#{@data.strip}\n" end |
#entry_id ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier. It returns a string.
253 254 255 |
# File 'lib/bio/db/fasta.rb', line 253 def entry_id identifiers.entry_id end |
#first_name ⇒ Object
Returns the first name (word) of the definition line - everything before the first whitespace.
>abc def #=> 'abc'
>gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c] #=> 'gi|398365175|ref|NP_009718.3|'
>abc #=> 'abc'
294 295 296 297 298 299 300 301 |
# File 'lib/bio/db/fasta.rb', line 294 def first_name index = definition.index(/\s/) if index.nil? return @definition else return @definition[0...index] end end |
#gi ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
262 263 264 |
# File 'lib/bio/db/fasta.rb', line 262 def gi identifiers.gi end |
#identifiers ⇒ Object
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or “:”-separated IDs. It returns a Bio::FastaDefline instance.
243 244 245 246 247 248 |
# File 'lib/bio/db/fasta.rb', line 243 def identifiers unless defined?(@ids) then @ids = FastaDefline.new(@definition) end @ids end |
#length ⇒ Object
Returns sequence length.
203 204 205 |
# File 'lib/bio/db/fasta.rb', line 203 def length seq.length end |
#locus ⇒ Object
Returns locus.
284 285 286 |
# File 'lib/bio/db/fasta.rb', line 284 def locus identifiers.locus end |
#nalen ⇒ Object
Returens the length of Bio::Sequence::NA.
213 214 215 |
# File 'lib/bio/db/fasta.rb', line 213 def nalen self.naseq.length end |
#naseq ⇒ Object
Returens the Bio::Sequence::NA.
208 209 210 |
# File 'lib/bio/db/fasta.rb', line 208 def naseq Sequence::NA.new(seq) end |
#query(factory) ⇒ Object Also known as: fasta, blast
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby
require 'bio'
factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
p entry.definition
result = entry.fasta(factory)
result.each do |hit|
print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
p hit.lap_at
end
end
164 165 166 |
# File 'lib/bio/db/fasta.rb', line 164 def query(factory) factory.query(entry) end |
#seq ⇒ Object
Returns a joined sequence line as a String.
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
# File 'lib/bio/db/fasta.rb', line 171 def seq unless defined?(@seq) unless /\A\s*^\#/ =~ @data then @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up else a = @data.split(/(^\#.*$)/) i = 0 cmnt = {} s = [] a.each do |x| if /^# ?(.*)$/ =~ x then cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1 else x.tr!(" \t\r\n0-9", '') # lazy clean up i += x.length s << x end end @comment = cmnt @seq = Bio::Sequence::Generic.new(s.join('')) end end @seq end |
#to_biosequence ⇒ Object Also known as: to_seq
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
234 235 236 |
# File 'lib/bio/db/fasta.rb', line 234 def to_biosequence Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat) end |