Module: Ms::Fasta

Defined in:
lib/ms/fasta.rb

Overview

A convenience class for working with fasta formatted sequence databases. the file which includes this class also includes Enumerable with Bio::FlatFile so you can do things like this:

accessions = Ms::Fasta.open("file.fasta") do |fasta| 
  fasta.map(&:accession)
end

A few aliases are added to Bio::FastaFormat

entry.header == entry.definition
entry.sequence == entry.seq

Ms::Fasta.new accepts both an IO object or a String (a fasta formatted string itself)

# taking an io object:
File.open("file.fasta") do |io| 
  fasta = Ms::Fasta.new(io)
  ... do something with it
end
# taking a string
string = ">id1 a simple header\nAAASDDEEEDDD\n>id2 header again\nPPPPPPWWWWWWTTTTYY\n"
fasta = Ms::Fasta.new(string)
(simple, not_simple) = fasta.partition {|entry| entry.header =~ /simple/ }

Class Method Summary collapse

Class Method Details

.foreach(file, &block) ⇒ Object

yields each Bio::FastaFormat object in turn



46
47
48
49
50
# File 'lib/ms/fasta.rb', line 46

def self.foreach(file, &block)
  Bio::FlatFile.open(Bio::FastaFormat, file) do |fasta|
    fasta.each(&block)
  end
end

.new(io) ⇒ Object

takes an IO object or a string that is the fasta data itself



53
54
55
56
# File 'lib/ms/fasta.rb', line 53

def self.new(io)
  io = StringIO.new(io) if io.is_a?(String)
  Bio::FlatFile.new(Bio::FastaFormat, io)
end

.open(file, &block) ⇒ Object

opens the flatfile and yields a Bio::FlatFile object



41
42
43
# File 'lib/ms/fasta.rb', line 41

def self.open(file, &block)
  Bio::FlatFile.open(Bio::FastaFormat, file, &block)
end

.protein_lengths_and_descriptions(file) ⇒ Object

returns two hashes [id_to_length, id_to_description] faster (~4x) than official route.



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'lib/ms/fasta.rb', line 60

def self.protein_lengths_and_descriptions(file)
  protid_to_description = {}
  protid_to_length = {}
  re = /^>([^\s]+) (.*)/
    ids = []
  lengths = []
  current_length = nil
  IO.foreach(file) do |line|
    line.chomp!
    if md=re.match(line)  
      lengths << current_length
      current_id = md[1]
      ids << current_id
      current_length = 0
      protid_to_description[current_id] = md[2]
    else
      current_length += line.size
    end
  end
  lengths << current_length
  lengths.shift # remove the first nil entry
  [Hash[ids.zip(lengths).to_a], protid_to_description]
end