Class: EngTagger

Inherits:
Object
  • Object
show all
Extended by:
BoundedSpaceMemoizable
Defined in:
lib/engtagger.rb,
lib/engtagger/version.rb

Overview

English part-of-speech tagger class

Constant Summary collapse

DEFAULT_LEXPATH =

File paths

File.join(File.dirname(__FILE__), 'engtagger')
DEFAULT_WORDPATH =
File.join(DEFAULT_LEXPATH, "pos_words.hash")
DEFAULT_TAGPATH =
File.join(DEFAULT_LEXPATH, "pos_tags.hash")
NUM =

Regexps to match XML-style part-of-speech tags

get_ext('cd')
GER =
get_ext('vbg')
ADJ =
get_ext('jj[rs]*')
NN =
get_ext('nn[sp]*')
NNP =
get_ext('nnp')
PREP =
get_ext('in')
DET =
get_ext('det')
PAREN =
get_ext('[lr]rb')
QUOT =
get_ext('ppr')
SEN =
get_ext('pp')
WORD =
get_ext('\w+')
VB =
get_ext('vb')
VBG =
get_ext('vbg')
VBD =
get_ext('vbd')
PART =
get_ext('vbn')
VBP =
get_ext('vbp')
VBZ =
get_ext('vbz')
JJ =
get_ext('jj')
JJR =
get_ext('jjr')
JJS =
get_ext('jjs')
RB =
get_ext('rb')
RBR =
get_ext('rbr')
RBS =
get_ext('rbs')
RP =
get_ext('rp')
WRB =
get_ext('wrb')
WDT =
get_ext('wdt')
WP =
get_ext('wp')
WPS =
get_ext('wps')
CC =
get_ext('cc')
IN =
get_ext('in')
TAGS =
VERSION =
"0.3.2"

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from BoundedSpaceMemoizable

memoize

Constructor Details

#initialize(params = {}) ⇒ EngTagger

Take a hash of parameters that override default values. See above for details.



192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
# File 'lib/engtagger.rb', line 192

def initialize(params = {})
  @conf = Hash.new
  @conf[:unknown_word_tag] = ''
  @conf[:stem] = false
  @conf[:weight_noun_phrases] = false
  @conf[:longest_noun_phrase] = 5
  @conf[:relax] = false
  @conf[:tag_lex] = 'tags.yml'
  @conf[:word_lex] = 'words.yml'
  @conf[:unknown_lex] = 'unknown.yml'
  @conf[:word_path] = DEFAULT_WORDPATH
  @conf[:tag_path] = DEFAULT_TAGPATH
  @conf[:debug] = false
  # assuming that we start analyzing from the beginninga new sentence...
  @conf[:current_tag] = 'pp'
  @conf.merge!(params) if params
  unless File.exist?(@conf[:word_path]) and File.exist?(@conf[:tag_path])
    print "Couldn't locate POS lexicon, creating a new one" if @conf[:debug]
    @@hmm = Hash.new
    @@lexicon = Hash.new
  else
    lexf = File.open(@conf[:word_path], 'r')
    @@lexicon = Marshal.load(lexf)
    lexf.close
    hmmf = File.open(@conf[:tag_path], 'r')
    @@hmm = Marshal.load(hmmf)
    hmmf.close
  end
  @@mnp = get_max_noun_regex
end

Instance Attribute Details

#confObject

Hash storing config values:

  • :unknown_word_tag => (String) Tag to assign to unknown words
  • :stem => (Boolean) Stem single words using Porter module
  • :weight_noun_phrases => (Boolean) When returning occurrence counts for a noun phrase, multiply the valuethe number of words in the NP.
  • :longest_noun_phrase => (Integer) Will ignore noun phrases longer than this threshold. This affects only the get_words() and get_nouns() methods.
  • :relax => (Boolean) Relax the Hidden Markov Model: this may improve accuracy for uncommon words, particularly words used polysemously
  • :tag_lex => (String) Name of the YAML file containing a hash of adjacent part of speech tags and the probability of each
  • :word_lex => (String) Name of the YAML file containing a hash of words and corresponding parts of speech
  • :unknown_lex => (String) Name of the YAML file containing a hash of tags for unknown words and corresponding parts of speech
  • :tag_path => (String) Directory path of tag_lex
  • :word_path => (String) Directory path of word_lex and unknown_lex
  • :debug => (Boolean) Print debug messages


184
185
186
# File 'lib/engtagger.rb', line 184

def conf
  @conf
end

Class Method Details

.explain_tag(tag) ⇒ String

Convert a Treebank-style, abbreviated tag into verbose definitions



93
94
95
96
97
98
99
100
# File 'lib/engtagger.rb', line 93

def self.explain_tag(tag)
  tag = tag.to_s.downcase
  if TAGS[tag]
    return TAGS[tag]
  else
    return tag
  end
end

.get_ext(tag = nil) ⇒ Object

Return a regexp from a string argument that matches an XML-style pos tag



51
52
53
54
# File 'lib/engtagger.rb', line 51

def self.get_ext(tag = nil)
  return nil unless tag
  return Regexp.new("<#{tag}>[^<]+</#{tag}>\s*")
end

.hmmHash

Return a class variable that holds probability data.



38
39
40
# File 'lib/engtagger.rb', line 38

def self.hmm
  return @@hmm
end

.lexiconHash

Return a class variable that holds lexical data.



46
47
48
# File 'lib/engtagger.rb', line 46

def self.lexicon
  return @@lexicon
end

Instance Method Details

#add_tags(text, verbose = false) ⇒ String

Examine the string provided and return it fully tagged in XML style.

Examine the string provided and return it fully tagged in XML style



255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# File 'lib/engtagger.rb', line 255

def add_tags(text, verbose = false)
  return nil unless valid_text(text)
  tagged = []
  words = clean_text(text)
  tags = Array.new
  words.each do |word|
    cleaned_word = clean_word(word)
    tag = assign_tag(@conf[:current_tag], cleaned_word)
    @conf[:current_tag] = tag = (tag and tag != "") ? tag : 'nn'
    tag = EngTagger.explain_tag(tag) if verbose
    tagged << '<' + tag + '>' + word + '</' + tag + '>'
  end
  reset
  return tagged.join(' ')
end

#get_adjectives(tagged) ⇒ Hash



440
441
442
443
444
# File 'lib/engtagger.rb', line 440

def get_adjectives(tagged)
  return nil unless valid_text(tagged)
  tags = [JJ]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_adverbs(tagged) ⇒ Hash



470
471
472
473
474
# File 'lib/engtagger.rb', line 470

def get_adverbs(tagged)
  return nil unless valid_text(tagged)
  tags = [RB, RBR, RBS, RP]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_base_present_verbs(tagged) ⇒ Hash



420
421
422
423
424
# File 'lib/engtagger.rb', line 420

def get_base_present_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [VBP]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_comparative_adjectives(tagged) ⇒ Hash



450
451
452
453
454
# File 'lib/engtagger.rb', line 450

def get_comparative_adjectives(tagged)
  return nil unless valid_text(tagged)
  tags = [JJR]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_conjunctions(tagged) ⇒ Hash

Returns all types of conjunctions and does not discriminate between the various kinds. E.g. coordinating, subordinating, correlative...



497
498
499
500
501
# File 'lib/engtagger.rb', line 497

def get_conjunctions(tagged)
  return nil unless valid_text(tagged)
  tags = [CC, IN]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_gerund_verbs(tagged) ⇒ Hash



400
401
402
403
404
# File 'lib/engtagger.rb', line 400

def get_gerund_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [VBG]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_infinitive_verbs(tagged) ⇒ Hash



380
381
382
383
384
# File 'lib/engtagger.rb', line 380

def get_infinitive_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [VB]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_interrogatives(tagged) ⇒ Hash Also known as: get_question_parts



480
481
482
483
484
# File 'lib/engtagger.rb', line 480

def get_interrogatives(tagged)
  return nil unless valid_text(tagged)
  tags = [WRB, WDT, WP, WPS]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_max_noun_phrases(tagged) ⇒ Hash

Given a POS-tagged text, this method returns only the maximal noun phrases. May be called directly, but is also used by get_noun_phrases.



509
510
511
512
513
514
515
516
517
518
519
# File 'lib/engtagger.rb', line 509

def get_max_noun_phrases(tagged)
  return nil unless valid_text(tagged)
  tags = [@@mnp]
  mn_phrases = build_trimmed(tagged, tags)
  ret = Hash.new(0)
  mn_phrases.each do |p|
    p = stem(p) unless p =~ /\s/  # stem single words
    ret[p] += 1 unless p =~ /\A\s*\z/
  end
  return ret
end

#get_noun_phrases(tagged) ⇒ Hash

Similar to get_words, but requires a POS-tagged text as an argument.



526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
# File 'lib/engtagger.rb', line 526

def get_noun_phrases(tagged)
  return nil unless valid_text(tagged)
  found = Hash.new(0)
  phrase_ext = /(?:#{PREP}|#{DET}|#{NUM})+/xo
    scanned = tagged.scan(@@mnp)
  # Find MNPs in the text, one sentence at a time
  # Record and split if the phrase is extended by a (?:PREP|DET|NUM)
  mn_phrases = []
  scanned.each do |m|
    found[m] += 1 if phrase_ext =~ m
    mn_phrases += m.split(phrase_ext)
  end
  mn_phrases.each do |mnp|
    # Split the phrase into an array of words, and create a loop for each word,
    # shortening the phrase by removing the word in the first position.
    # Record the phrase and any single nouns that are found
    words = mnp.split
    words.length.times do |i|
      found[words.join(' ')] += 1 if words.length > 1
      w = words.shift
      found[w] += 1 if w =~ /#{NN}/
    end
  end
  ret = Hash.new(0)
  found.keys.each do |f|
    k = strip_tags(f)
    v = found[f]
    # We weight by the word count to favor long noun phrases
    space_count = k.scan(/\s+/)
    word_count = space_count.length + 1
    # Throttle MNPs if necessary
    next if word_count > @conf[:longest_noun_phrase]
    k = stem(k) unless word_count > 1  # stem single words
    multiplier = 1
    multiplier = word_count if @conf[:weight_noun_phrases]
    ret[k] += multiplier * v
  end
  return ret
end

#get_nouns(tagged) ⇒ Hash

Given a POS-tagged text, this method returns all nouns and their occurrence frequencies.



356
357
358
359
360
# File 'lib/engtagger.rb', line 356

def get_nouns(tagged)
  return nil unless valid_text(tagged)
  tags = [NN]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_passive_verbs(tagged) ⇒ Hash



410
411
412
413
414
# File 'lib/engtagger.rb', line 410

def get_passive_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [PART]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_past_tense_verbs(tagged) ⇒ Hash



390
391
392
393
394
# File 'lib/engtagger.rb', line 390

def get_past_tense_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [VBD]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_present_verbs(tagged) ⇒ Hash



430
431
432
433
434
# File 'lib/engtagger.rb', line 430

def get_present_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [VBZ]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_proper_nouns(tagged) ⇒ Object

Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies. The method is greedy and will return multi-word phrases, if possible, so it would find ``Linguistic Data Consortium'' as a single unit, rather than as three individual proper nouns. This method does not stem the found words.



322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
# File 'lib/engtagger.rb', line 322

def get_proper_nouns(tagged)
  return nil unless valid_text(tagged)
  tags = [NNP]
  nnp = build_matches_hash(build_trimmed(tagged, tags))
  # Now for some fancy resolution stuff...
  nnp.keys.each do |key|
    words = key.split(/\s/)
    # Let's say this is an organization's name --
    # (and it's got at least three words)
    # is there a corresponding acronym in this hash?
    if words.length > 2
      # Make a (naive) acronym out of this name
      acronym = words.map do |word|
        /\A([a-z])[a-z]*\z/ =~ word
        $1
      end.join ''
      # If that acronym has been seen,
      # remove it and add the values to
      # the full name
      if nnp[acronym]
        nnp[key] += nnp[acronym]
        nnp.delete(acronym)
      end
    end
  end
  return nnp
end

#get_readable(text, verbose = false) ⇒ Object

Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read.



290
291
292
293
294
295
296
297
# File 'lib/engtagger.rb', line 290

def get_readable(text, verbose = false)
  return nil unless valid_text(text)
  tagged = add_tags(text, verbose)
  tagged = tagged.gsub(/<\w+>([^<]+|[<\w>]+)<\/(\w+)>/o) do
  #!!!# tagged = tagged.gsub(/<\w+>([^<]+)<\/(\w+)>/o) do
    $1 + '/' + $2.upcase
  end
end

#get_sentences(text) ⇒ Object

Return an array of sentences (without POS tags) from a text.



300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# File 'lib/engtagger.rb', line 300

def get_sentences(text)
  return nil unless valid_text(text)
  tagged = add_tags(text)
  sentences = Array.new
  tagged.split(/<\/pp>/).each do |line|
    sentences << strip_tags(line)
  end
  sentences = sentences.map do |sentence|
    sentence.gsub(Regexp.new(" ('s?) ")){$1 + ' '}
    sentence.gsub(Regexp.new(" (\W+) ")){$1 + ' '}
    sentence.gsub(Regexp.new(" (`+) ")){' ' + $1}
    sentence.gsub(Regexp.new(" (\W+)$")){$1}
    sentence.gsub(Regexp.new("^(`+) ")){$1}
  end
  return sentences
end

#get_superlative_adjectives(tagged) ⇒ Hash



460
461
462
463
464
# File 'lib/engtagger.rb', line 460

def get_superlative_adjectives(tagged)
  return nil unless valid_text(tagged)
  tags = [JJS]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_verbs(tagged) ⇒ Hash

Returns all types of verbs and does not descriminate between the various kinds. Combines all other verb methods listed in this class.



369
370
371
372
373
# File 'lib/engtagger.rb', line 369

def get_verbs(tagged)
  return nil unless valid_text(tagged)
  tags = [VB, VBD, VBG, PART, VBP, VBZ]
  build_matches_hash(build_trimmed(tagged, tags))
end

#get_words(text) ⇒ Object

Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages:

  • Tag the text
  • Extract all the maximal noun phrases
  • Recursively extract all noun phrases from the MNPs


278
279
280
281
282
283
284
285
286
# File 'lib/engtagger.rb', line 278

def get_words(text)
  return false unless valid_text(text)
  tagged = add_tags(text)
  if(@conf[:longest_noun_phrase] <= 1)
    return get_nouns(tagged)
  else
    return get_noun_phrases(tagged)
  end
end

#installObject

Reads some included corpus data and saves it in a stored hash on the local file system. This is called automatically if the tagger can't find the stored lexicon.



569
570
571
572
573
574
575
576
577
578
579
580
# File 'lib/engtagger.rb', line 569

def install
  puts "Creating part-of-speech lexicon" if @conf[:debug]
  load_tags(@conf[:tag_lex])
  load_words(@conf[:word_lex])
  load_words(@conf[:unknown_lex])
  File.open(@conf[:word_path], 'w') do |f|
    Marshal.dump(@@lexicon, f)
  end
  File.open(@conf[:tag_path], 'w') do |f|
    Marshal.dump(@@hmm, f)
  end
end

#tag_pairs(text) ⇒ Array

Return an array of pairs of the form ["word", :tag].



232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# File 'lib/engtagger.rb', line 232

def tag_pairs(text)
  return [] unless valid_text(text)

  out = clean_text(text).map do |word|
    cleaned_word = clean_word word
    tag = assign_tag(@conf[:current_tag], cleaned_word)
    @conf[:current_tag] = tag = (tag and !tag.empty?) ? tag : 'nn'
    [word, tag.to_sym]
  end

  # reset the tagger state
  reset

  out
end