Class: FastaFile

Inherits:
File
  • Object
show all
Defined in:
lib/parse_fasta/fasta_file.rb

Overview

Provides simple interface for parsing fasta format files. Gzipped files are no problem.

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.open(fname, *args) ⇒ FastaFile

Use it like IO::open

Parameters:

  • fname (String)

    the name of the file to open

Returns:



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/parse_fasta/fasta_file.rb', line 30

def self.open(fname, *args)
  begin
    handle = Zlib::GzipReader.open(fname)
  rescue Zlib::GzipFile::Error => e
    handle = File.open(fname)
  end

  unless handle.each_char.peek[0] == '>'
    raise ParseFasta::DataFormatError
  end

  handle.close

  super
end

Instance Method Details

#each_record(separate_lines = nil) {|header, sequence| ... } ⇒ Object

Analagous to IO#each_line, #each_record is used to go through a fasta file record by record. It will accept gzipped files as well.

Examples:

Parsing a fasta file (default behavior, gzip files are fine)

FastaFile.open('reads.fna.gz').each_record do |header, sequence|
  puts [header, sequence.gc].join("\t")
end

Parsing a fasta file (with truthy value param)

FastaFile.open('reads.fna').each_record(1) do |header, sequence|
  # header => 'sequence_1'
  # sequence => ['AACTG', 'AGTCGT', ... ]
end

Parameters:

  • separate_lines (Object) (defaults to: nil)

    If truthy, separate lines of record into an array of Sequences, but if falsy, yield a Sequence object for the sequence instead.

Yields:

  • The header and sequence for each record in the fasta file to the block

Yield Parameters:

  • header (String)

    The header of the fasta record without the leading ‘>’

  • sequence (Sequence, Array<Sequence>)

    The sequence of the fasta record. If ‘separate_lines` is falsy (the default behavior), will be Sequence, but if truthy will be Array<String>.

Raises:



95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# File 'lib/parse_fasta/fasta_file.rb', line 95

def each_record(separate_lines=nil)
  begin
    f = Zlib::GzipReader.open(self)
  rescue Zlib::GzipFile::Error => e
    f = self
  end

  if separate_lines
    f.each("\n>") do |line|
      header, sequence = parse_line_separately(line)
      yield(header.strip, sequence)
    end

    # f.each_with_index(">") do |line, idx|
    #   if idx.zero?
    #     if line != ">"
    #       raise ParseFasta::DataFormatError
    #     end
    #   else
    #     header, sequence = parse_line_separately(line)
    #     yield(header.strip, sequence)
    #   end
    # end
  else
    f.each("\n>") do |line|
      header, sequence = parse_line(line)
      yield(header.strip, Sequence.new(sequence || ""))
    end

    # f.each_with_index(sep=/^>/) do |line, idx|
    #   if idx.zero?
    #     if line != ">"
    #       raise ParseFasta::DataFormatError
    #     end
    #   else
    #     header, sequence = parse_line(line)
    #     yield(header.strip, Sequence.new(sequence || ""))
    #   end
    # end
  end

  f.close if f.instance_of?(Zlib::GzipReader)
  return f
end

#each_record_fast {|header, sequence| ... } ⇒ Object

Note:

If the fastA file has spaces in the sequence, they will be retained. If this is a problem, use #each_record instead.

Fast version of #each_record

Yields the sequence as a String, not Sequence. No separate lines option.

Yields:

  • The header and sequence for each record in the fasta file to the block

Yield Parameters:

  • header (String)

    The header of the fasta record without the leading ‘>’

  • sequence (String)

    The sequence of the fasta record

Raises:



157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# File 'lib/parse_fasta/fasta_file.rb', line 157

def each_record_fast
  begin
    f = Zlib::GzipReader.open(self)
  rescue Zlib::GzipFile::Error => e
    f = self
  end

  f.each("\n>") do |line|
    header, sequence = parse_line(line)

    raise ParseFasta::SequenceFormatError if sequence.include? ">"

    yield(header.strip, sequence)
  end

  f.close if f.instance_of?(Zlib::GzipReader)
  return f
end

#to_hashHash

Returns the records in the fasta file as a hash map with the headers as keys and the Sequences as values.

Examples:

Read a fastA into a hash table.

seqs = FastaFile.open('reads.fa').to_hash

Returns:

  • (Hash)

    A hash with headers as keys, sequences as the values (Sequence objects)

Raises:



56
57
58
59
60
61
62
63
# File 'lib/parse_fasta/fasta_file.rb', line 56

def to_hash
  hash = {}
  self.each_record do |head, seq|
    hash[head] = seq
  end

  hash
end