Class: Bioworks

Inherits:

Object

Object
Bioworks

show all

Includes:: SpecID

Defined in:: lib/ms/sequest/bioworks.rb

Overview

For dealing with Bioworks .xml format

Defined Under Namespace

Modules: XML Classes: Pep, Prot, XMLParser

Constant Summary collapse

@@bioworksinfo_re = Regular expressions

/<bioworksinfo>(.*)<\/bioworksinfo>/o

@@modifications_re =

/<modifications>(.*)<\/modifications>/o

@@protein_re =

/<protein>/o

@@origfilename_re =

/<origfilename>(.*)<\/origfilename>/o

@@origfilepath_re =

/<origfilepath>(.*)<\/origfilepath>/o

Instance Attribute Summary collapse

#global_filename ⇒ Object

Returns the value of attribute global_filename.
#modifications ⇒ Object

a string of modifications e.g., “(M* 15.99491) (S@ 14.9322) ”.
#origfilename ⇒ Object

Returns the value of attribute origfilename.
#origfilepath ⇒ Object

Returns the value of attribute origfilepath.
#peps ⇒ Object

Returns the value of attribute peps.
#prots ⇒ Object

Returns the value of attribute prots.
#version ⇒ Object

Returns the value of attribute version.

Instance Method Summary collapse

#_uniq_peps_by_sequence_charge(peps) ⇒ Object

returns (peptides, proteins) where peptides is the unique list of peps and proteins is a parallel array of arrays of represented proteins note that each pep will contain its original prot it belongs to, even though the parallel protein actually represents the proteins it belongs to.
#get_prots_from_xml_stream(fh) ⇒ Object

returns proteins and peptides.
#get_regex_val(fh, regex) ⇒ Object

gets the regex and stops (and rewinds if it hits a protein) if no regex is found, returns nil and rewinds the filehandle.
#hi_prob_best ⇒ Object
#initialize(file = nil) ⇒ Bioworks constructor

A new instance of Bioworks.
#num_prots(file) ⇒ Object

returns the number of prots.
#parse_xml(file) ⇒ Object

This is highly specific to Bioworks 3.2 xml export.
#parse_xml_by_xmlparser(file) ⇒ Object
#to_excel(file) ⇒ Object

Outputs the bioworks browser excel format (tab delimited) to file.
#to_pepxml ⇒ Object

Outputs sequest xml files (pepxml) for the trans-proteomics pipeline.
#to_sqt(params_file) ⇒ Object

-> prints to file filename1.sqt, filename2.sqt @TODO: sqt file output.
#update_peptide_hit_counts ⇒ Object

for output to excel format or other things, updates each protein with a peptide hit count array based on ranking of xcorr per dta file where each array is the total number of peptide hits, then rank 1,2,3,4,5 @TODO: Can’t get this to check out yet.

Constructor Details

#initialize(file = nil) ⇒ `Bioworks`

Returns a new instance of Bioworks.

# File 'lib/ms/sequest/bioworks.rb', line 155

def initialize(file=nil)
  @peps = nil
  if file
    @filename = file
    parse_xml(file)
    #parse_xml_by_xmlparser(file)
  end
end

Instance Attribute Details

#global_filename ⇒ `Object`

Returns the value of attribute global_filename.



30
31
32

# File 'lib/ms/sequest/bioworks.rb', line 30

def global_filename
  @global_filename
end

#modifications ⇒ `Object`

a string of modifications e.g., “(M* 15.99491) (S@ 14.9322) ”



32
33
34

# File 'lib/ms/sequest/bioworks.rb', line 32

def modifications
  @modifications
end

#origfilename ⇒ `Object`

Returns the value of attribute origfilename.



30
31
32

# File 'lib/ms/sequest/bioworks.rb', line 30

def origfilename
  @origfilename
end

#origfilepath ⇒ `Object`

Returns the value of attribute origfilepath.



30
31
32

# File 'lib/ms/sequest/bioworks.rb', line 30

def origfilepath
  @origfilepath
end

#peps ⇒ `Object`

Returns the value of attribute peps.



30
31
32

# File 'lib/ms/sequest/bioworks.rb', line 30

def peps
  @peps
end

#prots ⇒ `Object`

Returns the value of attribute prots.



30
31
32

# File 'lib/ms/sequest/bioworks.rb', line 30

def prots
  @prots
end

#version ⇒ `Object`

Returns the value of attribute version.



30
31
32

# File 'lib/ms/sequest/bioworks.rb', line 30

def version
  @version
end

Instance Method Details

#_uniq_peps_by_sequence_charge(peps) ⇒ `Object`

returns (peptides, proteins) where peptides is the unique list of peps and proteins is a parallel array of arrays of represented proteins note that each pep will contain its original prot it belongs to, even though the parallel protein actually represents the proteins it belongs to. assumes that each peptide points to all its proteins in pep.prots

# File 'lib/ms/sequest/bioworks.rb', line 136

def _uniq_peps_by_sequence_charge(peps)
  new_arr = []
  prot_arr = []
  index_accounted_for = []
  (0...peps.size).each do |i|
    next if index_accounted_for.include?(i)
    new_arr << peps[i]
    prot_arr.push( peps[i].prots )
    ((i+1)...peps.size).each do |j|
      pep1, pep2 = peps[i], peps[j]
      if pep1.sequence == pep2.sequence && pep1.charge == pep2.charge
        prot_arr.last.push( *(pep2.prots) )
        index_accounted_for << j
      end
    end
  end
  return new_arr, prot_arr
end

#get_prots_from_xml_stream(fh) ⇒ `Object`

returns proteins and peptides

# File 'lib/ms/sequest/bioworks.rb', line 198

def get_prots_from_xml_stream(fh)
  uniq_pephit_hash = {}
  prots = []
  while line = fh.gets
    if line =~ @@protein_re
      prot =  Bioworks::Prot.new
      prot.bioworks = self
      prot.set_from_xml_stream(fh, uniq_pephit_hash)
      prots << prot
    end
  end
  [prots, uniq_pephit_hash.values] 
end

#get_regex_val(fh, regex) ⇒ `Object`

gets the regex and stops (and rewinds if it hits a protein) if no regex is found, returns nil and rewinds the filehandle

# File 'lib/ms/sequest/bioworks.rb', line 214

def get_regex_val(fh, regex)
  ver = nil
  last_pos = fh.pos
  while line = fh.gets
    if line =~ regex
      ver = $1.dup     
      break
    elsif line =~ @@protein_re
      fh.seek last_pos
      break
    end
    last_pos = fh.pos
  end
  unless ver then fh.rewind end
  ver
end

#hi_prob_best ⇒ `Object`

34	# File 'lib/ms/sequest/bioworks.rb', line 34 def hi_prob_best ; false end

#num_prots(file) ⇒ `Object`

returns the number of prots. Raises an Exception if open and closing xml tags don’t agree

# File 'lib/ms/sequest/bioworks.rb', line 45

def num_prots(file)
  re = /(<protein>)|(<\/protein>)/mo
  begin_tags = 0
  end_tags = 0
  IO.read(file).scan(re) do |match| 
    if match.first
      begin_tags += 1
    else
      end_tags += 1
    end
  end
  if begin_tags != end_tags 
    puts "WARNING: #{file} doesn't have matching closing tags"
    puts "for the <protein> tag.  Returning # of beginning tags."
  end
  begin_tags
end

#parse_xml(file) ⇒ `Object`

This is highly specific to Bioworks 3.2 xml export. In other words, unless the newlines, etc. are duplicated, this parser will fail! Not robust, but it is faster than xmlparser (which is based on the speedy expat)

# File 'lib/ms/sequest/bioworks.rb', line 179

def parse_xml(file)
  fh = nil
  if file =~ /\.gz$/
    fh = Zlib::GzipReader.open(file)  
  else
    fh = File.open(file)
  end
  @origfilename = get_regex_val(fh, @@origfilename_re)
  @origfilepath = get_regex_val(fh, @@origfilepath_re)
  if @origfilename
    @global_filename = @origfilename.gsub(File.extname(@origfilename), "")
  end
  @version = get_regex_val(fh, @@bioworksinfo_re)
  @modifications = get_regex_val(fh, @@modifications_re)
  @prots, @peps = get_prots_from_xml_stream(fh)
  fh.close
end

#parse_xml_by_xmlparser(file) ⇒ `Object`

# File 'lib/ms/sequest/bioworks.rb', line 164

def parse_xml_by_xmlparser(file)
  parser = Bioworks::XMLParser.new
  File.open(file) do |fh|
    #3.times do fh.gets end  ## TEMPFIX
    parser.parse(fh)
  end
  #puts "ETETWSST"
  #p parser.prots
  @prots = parser.prots
end

#to_excel(file) ⇒ `Object`

Outputs the bioworks browser excel format (tab delimited) to file. Useful if you have more than ~65,000 lines (can export bioworks.xml and then convert to excel format). Currently, the only things not precisely identical are:

1. The peptide hit counts (although the first number [total # peptides] is accurate)
2. The precise ordering of peptides within each protein.  When dealing with output from multiple runs, peptides with runs with exactly the  same scan numbers are not guaranteed to be in the same order.

# File 'lib/ms/sequest/bioworks.rb', line 71

def to_excel(file)
  update_peptide_hit_counts
  arr = []
  arr << ['', 'Reference', '', '', '', 'Score', 'Coverage', 'MW', 'Accession', 'Peptide (Hits)', '', ' ']
  arr << ['', '"File, Scan(s)"', 'Peptide', 'MH+', 'z', 'XC', 'DeltaCn', 'Sp', 'RSp', 'Ions', 'Count', ' ']
  @prots.each_with_index do |prot,index|
    line_arr = prot.get(:consensus_score, :coverage, :weight, :accession)
    if line_arr[1] == "0.0" then line_arr[1] = "" end
    line_arr.unshift('', '', '')
    line_arr.unshift('"' + prot.reference.split('|')[-1] + '"')
    line_arr.unshift(index+1)
    pep_hit_counts = prot.peptide_hit_counts
    pep_hit_counts_string = pep_hit_counts[0].to_s + ' (' + pep_hit_counts[1..-1].join(" ") + ')' 
    line_arr.push( pep_hit_counts_string )
    line_arr.push("")
    line_arr.push(" ")
    arr.push( line_arr )
    prot.peps.sort_by{|obj| [obj.first_scan.to_i, obj.last_scan.to_i] }.each do |pep|

      pep_arr = pep.get(:sequence, :mass, :charge, :xcorr, :deltacn, :sp, :rsp, :ions)
      count = pep.count
      if count == '0' then count = "" end
      pep_arr.push(count)
      pep_arr.push(' ')
      pep_arr.unshift('"' + pep.file + '"')
      pep_arr.unshift( '' )
      arr.push( pep_arr )
    end
  end
  File.open(file, "w") do |out|
    arr.each do |line|
      out.print(line.join("\t"), "\n")
    end
  end

end

#to_pepxml ⇒ `Object`

Outputs sequest xml files (pepxml) for the trans-proteomics pipeline

# File 'lib/ms/sequest/bioworks.rb', line 232

def to_pepxml
  string = xml_version
  string 
end

#to_sqt(params_file) ⇒ `Object`

-> prints to file filename1.sqt, filename2.sqt @TODO: sqt file output

# File 'lib/ms/sequest/bioworks.rb', line 38

def to_sqt(params_file)
  ## hash peps by filename 
  ## hash prots by peptide
end

#update_peptide_hit_counts ⇒ `Object`

for output to excel format or other things, updates each protein with a peptide hit count array based on ranking of xcorr per dta file where each array is the total number of peptide hits, then rank 1,2,3,4,5 @TODO: Can’t get this to check out yet. Perhaps they use normalized Xcorr?

# File 'lib/ms/sequest/bioworks.rb', line 113

def update_peptide_hit_counts
  @prots.each do |prot|
    prot.peptide_hit_counts[0] = prot.peps.size
  end
  hash = peps.hash_by(:file)
  hash.sort.each do |k,v|
    sorted = v.sort_by {|obj| obj.xcorr.to_f }
    peps, prot_groups = _uniq_peps_by_sequence_charge(sorted) ## but not on prot!!!!!uniq_peps_by_sequence_charge!

    prot_groups.each_with_index do |prot_group, i|
      prot_group.each do |prot|
        prot.peptide_hit_counts[i+1] += 1 if prot.peptide_hit_counts[i+1]
      end
    end
  end
end

Class: Bioworks

Overview

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(file = nil) ⇒ Bioworks

Instance Attribute Details

#global_filename ⇒ Object

#modifications ⇒ Object

#origfilename ⇒ Object

#origfilepath ⇒ Object

#peps ⇒ Object

#prots ⇒ Object

#version ⇒ Object

Instance Method Details

#_uniq_peps_by_sequence_charge(peps) ⇒ Object

#get_prots_from_xml_stream(fh) ⇒ Object

#get_regex_val(fh, regex) ⇒ Object

#hi_prob_best ⇒ Object

#num_prots(file) ⇒ Object

#parse_xml(file) ⇒ Object

#parse_xml_by_xmlparser(file) ⇒ Object

#to_excel(file) ⇒ Object

#to_pepxml ⇒ Object

#to_sqt(params_file) ⇒ Object

#update_peptide_hit_counts ⇒ Object