Class: Treat::Workers::Lexicalizers::Taggers::Lingua

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/lexicalizers/taggers/lingua.rb

Overview

POS tagging using part-of-speech statistics from the Penn Treebank to assign POS tags to English text. The tagger applies a bigram (two-word) Hidden Markov Model to guess the appropriate POS tag for a word.

Constant Summary collapse

DefaultOptions =

Hold the default options.

{ :relax => false }
Punctuation =

Replace punctuation tags used by this gem to the standard PTB tags.

{
  'pp' => '.',
  'pps' => ';',
  'ppc' => ',',
  'ppd' => '$',
  'ppl' => 'lrb',
  'ppr' => 'rrb'
}
@@tagger =

Hold one instance of the tagger.

nil

Class Method Summary collapse

Class Method Details

.tag(entity, options = {}) ⇒ Object

Tag the word using a probabilistic model taking into account known words found in a lexicon and the tag of the previous word.

Options:

  • (Boolean) :relax => Relax the HMM model - this may improve accuracy for uncommon words, particularly words used polysemously.



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# File 'lib/treat/workers/lexicalizers/taggers/lingua.rb', line 39

def self.tag(entity, options = {})
  
  options = DefaultOptions.merge(options)
  
  @@tagger ||= ::EngTagger.new(options)
  left_tag = @@tagger.conf[:current_tag] = 'pp'
  isolated_token = entity.is_a?(Treat::Entities::Token)
  tokens = isolated_token ? [entity] : entity.tokens
  
  tokens.each do |token|
    next if token.to_s == ''
    w = @@tagger.clean_word(token.to_s)
    t = @@tagger.assign_tag(left_tag, w)
    t = 'fw' if t.nil? || t == ''
    @@tagger.conf[:current_tag] = left_tag = t
    t = 'prp$' if t == 'prps'
    t = 'dt' if t == 'det'
    t = Punctuation[t] if Punctuation[t]
    token.set :tag, t.upcase
    token.set :tag_set, :penn if isolated_token
    return t.upcase if isolated_token
    
  end

  
  if entity.is_a?(Treat::Entities::Group) && 
    !entity.parent_sentence
      entity.set :tag_set, :penn
  end
  
  return 'S' if entity.is_a?(Treat::Entities::Sentence)
  return 'P' if entity.is_a?(Treat::Entities::Phrase)
  return 'F' if entity.is_a?(Treat::Entities::Fragment)
  return 'G' if entity.is_a?(Treat::Entities::Group)

end