Class: Treat::Workers::Extractors::Topics::Reuters

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/extractors/topics/reuters.rb

Overview

A Ruby text categorizer that was trained using the Reuters news story corpus. Works well for news articles, not so well for other sources.

Authors: Mark Watson, 2005; Louis Mullie, 2011.

Constant Summary collapse

@@industry =

Hashes to hold the topics.

{}
@@region =
{}
@@topics =
{}

Class Method Summary collapse

Class Method Details

.best_of_hash(hash, cutoff = 0.0, scale = 1.0) ⇒ Object

Retrieve the words with the scores above cutoff inside the hash of scored words.



90
91
92
93
94
95
96
97
98
99
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 90

def self.best_of_hash(hash, cutoff = 0.0, scale = 1.0)
  ret = {}
  hash.keys.each do |key|
    if hash[key] > cutoff
      ret[key] = hash[key] * scale
      ret[key] = ret[key].round(2)
    end
  end
  ret
end

.get_topicsObject

Read the topics from the XML files.



44
45
46
47
48
49
50
51
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 44

def self.get_topics
  return unless @@industry.size == 0
  path = (Treat.libraries.reuters.model_path || 
  (Treat.paths.models + 'reuters/'))
  @@industry = read_xml(path + 'industry.xml')
  @@region = read_xml(path + 'region.xml')
  @@topics = read_xml(path + 'topics.xml')
end

.read_xml(file_name) ⇒ Object

Read an XML file and populate a hash of topics.



55
56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 55

def self.read_xml(file_name)
  hash = {}
  doc = Nokogiri::XML(File.read(file_name))
  doc.root.children.each do |category|
    cat = category["cat"]
    next if cat.nil?
    cat = cat.downcase
    hash[cat] ||= {}
    hash[cat][category["name"]] =
    category["score"].to_f
  end
  hash
end

.score_words(hash, word_list) ⇒ Object

Score the words by adding the scores of each word occurence.



71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 71

def self.score_words(hash, word_list)
  category_names = hash.keys
  count_hash = {}
  category_names.each do |cat_name|
    cat_name = cat_name.downcase
    count_hash[cat_name] ||= 0
    word_list.each do |word|
      unless hash[cat_name][word].nil?
        count_hash[cat_name] +=
        hash[cat_name][word]
      end
    end
  end
  count_hash = best_of_hash(count_hash)
  count_hash.keys
end

.topics(text, options = {}) ⇒ Object

Get the general topic of the text using a Reuters-trained model.

Options: none.



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# File 'lib/treat/workers/extractors/topics/reuters.rb', line 20

def self.topics(text, options = {})
  stems = []
  @@reduce = 0
  unless text.words.size > 0
    raise Treat::Exception,
    "Annotator 'topics' requires " +
    "processor 'tokenize'."
  end
  text.words.collect! do |tok|
    stem = tok.stem.downcase
    val = tok.value.downcase
    stems << stem
    unless stem == val
      stems << val
    end
  end
  get_topics
  score_words(@@industry, stems) +
  score_words(@@region, stems) +
  score_words(@@topics, stems)
  #Treat::Feature.new(topics)
end