Class: Treat::Workers::Extractors::Topics::Reuters
- Inherits:
-
Object
- Object
- Treat::Workers::Extractors::Topics::Reuters
- Defined in:
- lib/treat/workers/extractors/topics/reuters.rb
Overview
A Ruby text categorizer that was trained using the Reuters news story corpus. Works well for news articles, not so well for other sources.
Authors: Mark Watson, 2005; Louis Mullie, 2011.
Constant Summary collapse
- @@industry =
Hashes to hold the topics.
{}
- @@region =
{}
- @@topics =
{}
Class Method Summary collapse
-
.best_of_hash(hash, cutoff = 0.0, scale = 1.0) ⇒ Object
Retrieve the words with the scores above cutoff inside the hash of scored words.
-
.get_topics ⇒ Object
Read the topics from the XML files.
-
.read_xml(file_name) ⇒ Object
Read an XML file and populate a hash of topics.
-
.score_words(hash, word_list) ⇒ Object
Score the words by adding the scores of each word occurence.
-
.topics(text, options = {}) ⇒ Object
Get the general topic of the text using a Reuters-trained model.
Class Method Details
.best_of_hash(hash, cutoff = 0.0, scale = 1.0) ⇒ Object
Retrieve the words with the scores above cutoff inside the hash of scored words.
90 91 92 93 94 95 96 97 98 99 |
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 90 def self.best_of_hash(hash, cutoff = 0.0, scale = 1.0) ret = {} hash.keys.each do |key| if hash[key] > cutoff ret[key] = hash[key] * scale ret[key] = ret[key].round(2) end end ret end |
.get_topics ⇒ Object
Read the topics from the XML files.
44 45 46 47 48 49 50 51 |
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 44 def self.get_topics return unless @@industry.size == 0 path = (Treat.libraries.reuters.model_path || (Treat.paths.models + 'reuters/')) @@industry = read_xml(path + 'industry.xml') @@region = read_xml(path + 'region.xml') @@topics = read_xml(path + 'topics.xml') end |
.read_xml(file_name) ⇒ Object
Read an XML file and populate a hash of topics.
55 56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 55 def self.read_xml(file_name) hash = {} doc = Nokogiri::XML(File.read(file_name)) doc.root.children.each do |category| cat = category["cat"] next if cat.nil? cat = cat.downcase hash[cat] ||= {} hash[cat][category["name"]] = category["score"].to_f end hash end |
.score_words(hash, word_list) ⇒ Object
Score the words by adding the scores of each word occurence.
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 71 def self.score_words(hash, word_list) category_names = hash.keys count_hash = {} category_names.each do |cat_name| cat_name = cat_name.downcase count_hash[cat_name] ||= 0 word_list.each do |word| unless hash[cat_name][word].nil? count_hash[cat_name] += hash[cat_name][word] end end end count_hash = best_of_hash(count_hash) count_hash.keys end |
.topics(text, options = {}) ⇒ Object
Get the general topic of the text using a Reuters-trained model.
Options: none.
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 20 def self.topics(text, = {}) stems = [] @@reduce = 0 unless text.words.size > 0 raise Treat::Exception, "Annotator 'topics' requires " + "processor 'tokenize'." end text.words.collect! do |tok| stem = tok.stem.downcase val = tok.value.downcase stems << stem unless stem == val stems << val end end get_topics score_words(@@industry, stems) + score_words(@@region, stems) + score_words(@@topics, stems) #Treat::Feature.new(topics) end |