Module: Entrez
- Defined in:
- lib/rbbt/sources/entrez.rb
Overview
This module is used to parse and extract information from the gene_info file at Entrez Gene, as well as from the gene2pubmed file. Both need to be downloaded and accesible for Rbbt, which is done as part of a normal installation.
Defined Under Namespace
Classes: Gene, NoFileError
Class Method Summary collapse
-
.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ Object
Given a taxonomy, or set of taxonomies, it returns an inverse hash, where each key is the entrez id of a gene, and the value is an array of possible synonyms in other databases.
-
.entrez2pubmed(taxs) ⇒ Object
For a given taxonomy, or set of taxonomies, it returns a hash with genes as keys and arrays of related PubMed ids as values, as extracted from the gene2pubmed file from Entrez Gene.
-
.gene_filename(id) ⇒ Object
Build a file name for a gene based on the id.
-
.gene_text_similarity(gene, text) ⇒ Object
Counts the words in common between a chunk of text and the text found in Entrez Gene for that particular gene.
-
.get_gene(geneid) ⇒ Object
Returns a Gene object for the given Entrez Gene id.
Class Method Details
.entrez2native(taxs, native = nil, fix = nil, check = nil) ⇒ Object
Given a taxonomy, or set of taxonomies, it returns an inverse hash, where each key is the entrez id of a gene, and the value is an array of possible synonyms in other databases. Is mostly used to translate entrez ids to the native database id of the organism. The parameter native specifies the position of the key containing synonym, the fifth by default, fix and check are Procs used, if present, to pre-process lines and to check if they should be processed.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/rbbt/sources/entrez.rb', line 24 def self.entrez2native(taxs, native = nil, fix = nil, check = nil) raise NoFileError, "Install the Entrez gene_info file" unless File.exists? File.join(Rbbt.datadir, 'dbs/entrez/gene_info') native ||= 5 taxs = [taxs] unless taxs.is_a?(Array) taxs = taxs.collect{|t| t.to_s} lexicon = {} tmp = TmpFile.tmp_file("entrez-") system "cat '#{File.join(Rbbt.datadir, 'dbs/entrez/gene_info')}' |grep '^\\(#{taxs.join('\\|')}\\)[[:space:]]' > #{tmp}" File.open(tmp).each{|l| parts = l.chomp.split(/\t/) next if parts[native] == '-' entrez = parts[1] parts[native].split(/\|/).each{|id| id = fix.call(id) if fix next if check && !check.call(id) lexicon[entrez] ||= [] lexicon[entrez] << id } } FileUtils.rm tmp lexicon end |
.entrez2pubmed(taxs) ⇒ Object
For a given taxonomy, or set of taxonomies, it returns a hash with genes as keys and arrays of related PubMed ids as values, as extracted from the gene2pubmed file from Entrez Gene.
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# File 'lib/rbbt/sources/entrez.rb', line 56 def self.entrez2pubmed(taxs) raise NoFileError, "Install the Entrez gene2pubmed file" unless File.exists? File.join(Rbbt.datadir, 'dbs/entrez/gene2pubmed') taxs = [taxs] unless taxs.is_a?(Array) taxs = taxs.collect{|t| t.to_s} data = {} tmp = TmpFile.tmp_file("entrez-") system "cat '#{File.join(Rbbt.datadir, 'dbs/entrez/gene2pubmed')}' |grep '^\\(#{taxs.join('\\|')}\\)[[:space:]]' > #{tmp}" data = Open.to_hash(tmp, :native => 1, :extra => 2).each{|code, value_lists| value_lists.flatten!} FileUtils.rm tmp data end |
.gene_filename(id) ⇒ Object
Build a file name for a gene based on the id. Prefix the id by ‘gene-’, substitute the slashes with ‘SLASH’, and add a ‘.xml’ extension.
138 139 140 |
# File 'lib/rbbt/sources/entrez.rb', line 138 def self.gene_filename(id) FileCache.clean_path('gene-' + id.to_s + '.xml') end |
.gene_text_similarity(gene, text) ⇒ Object
Counts the words in common between a chunk of text and the text found in Entrez Gene for that particular gene. The gene may be a gene identifier or a Gene class instance.
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
# File 'lib/rbbt/sources/entrez.rb', line 191 def self.gene_text_similarity(gene, text) case when Entrez::Gene === gene gene_text = gene.text when String === gene || Fixnum === gene gene_text = get_gene(gene).text else return 0 end gene_words = gene_text.words.to_set text_words = text.words.to_set return 0 if gene_words.empty? || text_words.empty? common = gene_words.intersection(text_words) common.length / (gene_words.length + text_words.length).to_f end |
.get_gene(geneid) ⇒ Object
Returns a Gene object for the given Entrez Gene id. If an array of ids is given instead, a hash is returned. This method uses the caching facilities from Rbbt.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
# File 'lib/rbbt/sources/entrez.rb', line 145 def self.get_gene(geneid) return nil if geneid.nil? if Array === geneid missing = [] list = {} geneid.each{|p| next if p.nil? filename = gene_filename p if File.exists? FileCache.path(filename) list[p] = Gene.new(Open.read(FileCache.path(filename))) else missing << p end } return list unless missing.any? genes = get_online(missing) genes.each{|p, xml| filename = gene_filename p FileCache.add_file(filename,xml) unless File.exist? FileCache.path(filename) list[p] = Gene.new(xml) } return list else filename = gene_filename geneid if File.exists? FileCache.path(filename) return Gene.new(Open.read(FileCache.path(filename))) else xml = get_online(geneid) FileCache.add_file(filename,xml) return Gene.new(xml) end end end |