Module: Entrez

Defined in:: lib/rbbt/sources/entrez.rb

Overview

This module is used to parse and extract information from the gene_info file at Entrez Gene, as well as from the gene2pubmed file. Both need to be downloaded and accesible for Rbbt, which is done as part of a normal installation.

Defined Under Namespace

Classes: Gene, NoFileError

Class Method Summary collapse

.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ Object

Given a taxonomy, or set of taxonomies, it returns an inverse hash, where each key is the entrez id of a gene, and the value is an array of possible synonyms in other databases.
.entrez2pubmed(taxs) ⇒ Object

For a given taxonomy, or set of taxonomies, it returns a hash with genes as keys and arrays of related PubMed ids as values, as extracted from the gene2pubmed file from Entrez Gene.
.gene_filename(id) ⇒ Object

Build a file name for a gene based on the id.
.gene_text_similarity(gene, text) ⇒ Object

Counts the words in common between a chunk of text and the text found in Entrez Gene for that particular gene.
.get_gene(geneid) ⇒ Object

Returns a Gene object for the given Entrez Gene id.

Class Method Details

.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ `Object`

Given a taxonomy, or set of taxonomies, it returns an inverse hash, where each key is the entrez id of a gene, and the value is an array of possible synonyms in other databases. Is mostly used to translate entrez ids to the native database id of the organism. The parameter native specifies the position of the key containing synonym, the fifth by default, fix and check are Procs used, if present, to pre-process lines and to check if they should be processed.

Raises:

(NoFileError)

# File 'lib/rbbt/sources/entrez.rb', line 24

def self.entrez2native(taxs, native = nil, fix = nil, check = nil)

  raise NoFileError, "Install the Entrez gene_info file" unless File.exists? File.join(Rbbt.datadir, 'dbs/entrez/gene_info')

  native ||= 5

  taxs = [taxs] unless taxs.is_a?(Array)
  taxs = taxs.collect{|t| t.to_s}

  lexicon = {}
  tmp = TmpFile.tmp_file("entrez-")
  system "cat '#{File.join(Rbbt.datadir, 'dbs/entrez/gene_info')}' |grep '^\\(#{taxs.join('\\|')}\\)[[:space:]]' > #{tmp}"
  File.open(tmp).each{|l| 
    parts = l.chomp.split(/\t/)
    next if parts[native] == '-'
    entrez = parts[1]
    parts[native].split(/\|/).each{|id|
      id = fix.call(id) if fix
      next if check && !check.call(id)

      lexicon[entrez] ||= []
      lexicon[entrez] << id
    }
  }
  FileUtils.rm tmp

  lexicon
end

.entrez2pubmed(taxs) ⇒ `Object`

For a given taxonomy, or set of taxonomies, it returns a hash with genes as keys and arrays of related PubMed ids as values, as extracted from the gene2pubmed file from Entrez Gene.

Raises:

(NoFileError)

# File 'lib/rbbt/sources/entrez.rb', line 56

def self.entrez2pubmed(taxs)
  raise NoFileError, "Install the Entrez gene2pubmed file" unless File.exists? File.join(Rbbt.datadir, 'dbs/entrez/gene2pubmed')

  taxs = [taxs] unless taxs.is_a?(Array)
  taxs = taxs.collect{|t| t.to_s}

  data = {}
  tmp = TmpFile.tmp_file("entrez-")
  system "cat '#{File.join(Rbbt.datadir, 'dbs/entrez/gene2pubmed')}' |grep '^\\(#{taxs.join('\\|')}\\)[[:space:]]' > #{tmp}"
 
  data = Open.to_hash(tmp, :native => 1, :extra => 2).each{|code, value_lists| value_lists.flatten!}

  FileUtils.rm tmp

  data
end

.gene_filename(id) ⇒ `Object`

Build a file name for a gene based on the id. Prefix the id by ‘gene-’, substitute the slashes with ‘SLASH’, and add a ‘.xml’ extension.



138
139
140

# File 'lib/rbbt/sources/entrez.rb', line 138

def self.gene_filename(id)
  FileCache.clean_path('gene-' + id.to_s + '.xml')
end

.gene_text_similarity(gene, text) ⇒ `Object`

Counts the words in common between a chunk of text and the text found in Entrez Gene for that particular gene. The gene may be a gene identifier or a Gene class instance.

# File 'lib/rbbt/sources/entrez.rb', line 191

def self.gene_text_similarity(gene, text)

  case
  when Entrez::Gene === gene
    gene_text = gene.text
  when String === gene || Fixnum === gene
    gene_text =  get_gene(gene).text
  else
    return 0
  end


  gene_words = gene_text.words.to_set
  text_words = text.words.to_set

  return 0 if gene_words.empty? || text_words.empty?

  common = gene_words.intersection(text_words)
  common.length / (gene_words.length + text_words.length).to_f
end

.get_gene(geneid) ⇒ `Object`

Returns a Gene object for the given Entrez Gene id. If an array of ids is given instead, a hash is returned. This method uses the caching facilities from Rbbt.

# File 'lib/rbbt/sources/entrez.rb', line 145

def self.get_gene(geneid)

  return nil if geneid.nil?

  if Array === geneid
    missing = []
    list = {}

    geneid.each{|p|
      next if p.nil?
      filename = gene_filename p    
      if File.exists? FileCache.path(filename)
        list[p] = Gene.new(Open.read(FileCache.path(filename)))
      else
        missing << p
      end
    }

    return list unless missing.any?
    genes = get_online(missing)

    genes.each{|p, xml|
      filename = gene_filename p    
      FileCache.add_file(filename,xml) unless File.exist? FileCache.path(filename)
      list[p] =  Gene.new(xml)
    }

    return list

  else
    filename = gene_filename geneid    

    if File.exists? FileCache.path(filename)
      return Gene.new(Open.read(FileCache.path(filename)))
    else
      xml = get_online(geneid)
      FileCache.add_file(filename,xml)

      return Gene.new(xml)
    end
  end
end

Module: Entrez

Overview

Defined Under Namespace

Class Method Summary collapse

Class Method Details

.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ Object

.entrez2pubmed(taxs) ⇒ Object

.gene_filename(id) ⇒ Object

.gene_text_similarity(gene, text) ⇒ Object

.get_gene(geneid) ⇒ Object

.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ `Object`

.entrez2pubmed(taxs) ⇒ `Object`

.gene_filename(id) ⇒ `Object`

.gene_text_similarity(gene, text) ⇒ `Object`

.get_gene(geneid) ⇒ `Object`