Module: Entrez

Defined in:
lib/rbbt/sources/entrez.rb

Overview

This module is used to parse and extract information from the gene_info file at Entrez Gene, as well as from the gene2pubmed file. Both need to be downloaded and accesible for Rbbt, which is done as part of a normal installation.

Defined Under Namespace

Classes: Gene, NoFileError

Class Method Summary collapse

Class Method Details

.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ Object

Given a taxonomy, or set of taxonomies, it returns an inverse hash, where each key is the entrez id of a gene, and the value is an array of possible synonyms in other databases. Is mostly used to translate entrez ids to the native database id of the organism. The parameter native specifies the position of the key containing synonym, the fifth by default, fix and check are Procs used, if present, to pre-process lines and to check if they should be processed.

Raises:



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/rbbt/sources/entrez.rb', line 24

def self.entrez2native(taxs, native = nil, fix = nil, check = nil)

  raise NoFileError, "Install the Entrez gene_info file" unless File.exists? File.join(Rbbt.datadir, 'dbs/entrez/gene_info')

  native ||= 5

  taxs = [taxs] unless taxs.is_a?(Array)
  taxs = taxs.collect{|t| t.to_s}

  lexicon = {}
  tmp = TmpFile.tmp_file("entrez-")
  system "cat '#{File.join(Rbbt.datadir, 'dbs/entrez/gene_info')}' |grep '^\\(#{taxs.join('\\|')}\\)[[:space:]]' > #{tmp}"
  File.open(tmp).each{|l| 
    parts = l.chomp.split(/\t/)
    next if parts[native] == '-'
    entrez = parts[1]
    parts[native].split(/\|/).each{|id|
      id = fix.call(id) if fix
      next if check && !check.call(id)

      lexicon[entrez] ||= []
      lexicon[entrez] << id
    }
  }
  FileUtils.rm tmp

  lexicon
end

.entrez2pubmed(taxs) ⇒ Object

For a given taxonomy, or set of taxonomies, it returns a hash with genes as keys and arrays of related PubMed ids as values, as extracted from the gene2pubmed file from Entrez Gene.

Raises:



56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# File 'lib/rbbt/sources/entrez.rb', line 56

def self.entrez2pubmed(taxs)
  raise NoFileError, "Install the Entrez gene2pubmed file" unless File.exists? File.join(Rbbt.datadir, 'dbs/entrez/gene2pubmed')

  taxs = [taxs] unless taxs.is_a?(Array)
  taxs = taxs.collect{|t| t.to_s}

  data = {}
  tmp = TmpFile.tmp_file("entrez-")
  system "cat '#{File.join(Rbbt.datadir, 'dbs/entrez/gene2pubmed')}' |grep '^\\(#{taxs.join('\\|')}\\)[[:space:]]' > #{tmp}"
 
  data = Open.to_hash(tmp, :native => 1, :extra => 2).each{|code, value_lists| value_lists.flatten!}

  FileUtils.rm tmp

  data
end

.gene_filename(id) ⇒ Object

Build a file name for a gene based on the id. Prefix the id by ‘gene-’, substitute the slashes with ‘SLASH’, and add a ‘.xml’ extension.



138
139
140
# File 'lib/rbbt/sources/entrez.rb', line 138

def self.gene_filename(id)
  FileCache.clean_path('gene-' + id.to_s + '.xml')
end

.gene_text_similarity(gene, text) ⇒ Object

Counts the words in common between a chunk of text and the text found in Entrez Gene for that particular gene. The gene may be a gene identifier or a Gene class instance.



191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/rbbt/sources/entrez.rb', line 191

def self.gene_text_similarity(gene, text)

  case
  when Entrez::Gene === gene
    gene_text = gene.text
  when String === gene || Fixnum === gene
    gene_text =  get_gene(gene).text
  else
    return 0
  end


  gene_words = gene_text.words.to_set
  text_words = text.words.to_set

  return 0 if gene_words.empty? || text_words.empty?

  common = gene_words.intersection(text_words)
  common.length / (gene_words.length + text_words.length).to_f
end

.get_gene(geneid) ⇒ Object

Returns a Gene object for the given Entrez Gene id. If an array of ids is given instead, a hash is returned. This method uses the caching facilities from Rbbt.



145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# File 'lib/rbbt/sources/entrez.rb', line 145

def self.get_gene(geneid)

  return nil if geneid.nil?

  if Array === geneid
    missing = []
    list = {}

    geneid.each{|p|
      next if p.nil?
      filename = gene_filename p    
      if File.exists? FileCache.path(filename)
        list[p] = Gene.new(Open.read(FileCache.path(filename)))
      else
        missing << p
      end
    }

    return list unless missing.any?
    genes = get_online(missing)

    genes.each{|p, xml|
      filename = gene_filename p    
      FileCache.add_file(filename,xml) unless File.exist? FileCache.path(filename)
      list[p] =  Gene.new(xml)
    }

    return list

  else
    filename = gene_filename geneid    

    if File.exists? FileCache.path(filename)
      return Gene.new(Open.read(FileCache.path(filename)))
    else
      xml = get_online(geneid)
      FileCache.add_file(filename,xml)

      return Gene.new(xml)
    end
  end
end