Class: EngTagger
- Inherits:
-
Object
- Object
- EngTagger
- Extended by:
- BoundedSpaceMemoizable
- Defined in:
- lib/engtagger.rb,
lib/engtagger/version.rb
Overview
English part-of-speech tagger class
Constant Summary collapse
- DEFAULT_LEXPATH =
File paths
File.join(File.dirname(__FILE__), 'engtagger')
- DEFAULT_WORDPATH =
File.join(DEFAULT_LEXPATH, "pos_words.hash")
- DEFAULT_TAGPATH =
File.join(DEFAULT_LEXPATH, "pos_tags.hash")
- NUM =
Regexps to match XML-style part-of-speech tags
get_ext('cd')
- GER =
get_ext('vbg')
- ADJ =
get_ext('jj[rs]*')
- NN =
get_ext('nn[sp]*')
- NNP =
get_ext('nnp')
- PREP =
get_ext('in')
- DET =
get_ext('det')
- PAREN =
get_ext('[lr]rb')
- QUOT =
get_ext('ppr')
- SEN =
get_ext('pp')
- WORD =
get_ext('\w+')
- VB =
get_ext('vb')
- VBG =
get_ext('vbg')
- VBD =
get_ext('vbd')
- PART =
get_ext('vbn')
- VBP =
get_ext('vbp')
- VBZ =
get_ext('vbz')
- JJ =
get_ext('jj')
- JJR =
get_ext('jjr')
- JJS =
get_ext('jjs')
- RB =
get_ext('rb')
- RBR =
get_ext('rbr')
- RBS =
get_ext('rbs')
- RP =
get_ext('rp')
- WRB =
get_ext('wrb')
- WDT =
get_ext('wdt')
- WP =
get_ext('wp')
- WPS =
get_ext('wps')
- CC =
get_ext('cc')
- IN =
get_ext('in')
- TAGS =
- VERSION =
"0.3.2"
Instance Attribute Summary collapse
-
#conf ⇒ Object
Hash storing config values:.
Class Method Summary collapse
-
.explain_tag(tag) ⇒ String
Convert a Treebank-style, abbreviated tag into verbose definitions.
-
.get_ext(tag = nil) ⇒ Object
Return a regexp from a string argument that matches an XML-style pos tag.
-
.hmm ⇒ Hash
Return a class variable that holds probability data.
-
.lexicon ⇒ Hash
Return a class variable that holds lexical data.
Instance Method Summary collapse
-
#add_tags(text, verbose = false) ⇒ String
Examine the string provided and return it fully tagged in XML style.
-
#get_adjectives(tagged) ⇒ Hash
The hash of matches.
-
#get_adverbs(tagged) ⇒ Hash
The hash of matches.
-
#get_base_present_verbs(tagged) ⇒ Hash
The hash of matches.
-
#get_comparative_adjectives(tagged) ⇒ Hash
The hash of matches.
-
#get_conjunctions(tagged) ⇒ Hash
Returns all types of conjunctions and does not discriminate between the various kinds.
-
#get_gerund_verbs(tagged) ⇒ Hash
The hash of matches.
-
#get_infinitive_verbs(tagged) ⇒ Hash
The hash of matches.
-
#get_interrogatives(tagged) ⇒ Hash
(also: #get_question_parts)
The hash of matches.
-
#get_max_noun_phrases(tagged) ⇒ Hash
Given a POS-tagged text, this method returns only the maximal noun phrases.
-
#get_noun_phrases(tagged) ⇒ Hash
Similar to get_words, but requires a POS-tagged text as an argument.
-
#get_nouns(tagged) ⇒ Hash
Given a POS-tagged text, this method returns all nouns and their occurrence frequencies.
-
#get_passive_verbs(tagged) ⇒ Hash
The hash of matches.
-
#get_past_tense_verbs(tagged) ⇒ Hash
The hash of matches.
-
#get_present_verbs(tagged) ⇒ Hash
The hash of matches.
-
#get_proper_nouns(tagged) ⇒ Object
Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies.
-
#get_readable(text, verbose = false) ⇒ Object
Return an easy-on-the-eyes tagged version of a text string.
-
#get_sentences(text) ⇒ Object
Return an array of sentences (without POS tags) from a text.
-
#get_superlative_adjectives(tagged) ⇒ Hash
The hash of matches.
-
#get_verbs(tagged) ⇒ Hash
Returns all types of verbs and does not descriminate between the various kinds.
-
#get_words(text) ⇒ Object
Given a text string, return as many nouns and noun phrases as possible.
-
#initialize(params = {}) ⇒ EngTagger
constructor
Take a hash of parameters that override default values.
-
#install ⇒ Object
Reads some included corpus data and saves it in a stored hash on the local file system.
-
#tag_pairs(text) ⇒ Array
Return an array of pairs of the form
["word", :tag].
Methods included from BoundedSpaceMemoizable
Constructor Details
#initialize(params = {}) ⇒ EngTagger
Take a hash of parameters that override default values. See above for details.
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
# File 'lib/engtagger.rb', line 192 def initialize(params = {}) @conf = Hash.new @conf[:unknown_word_tag] = '' @conf[:stem] = false @conf[:weight_noun_phrases] = false @conf[:longest_noun_phrase] = 5 @conf[:relax] = false @conf[:tag_lex] = 'tags.yml' @conf[:word_lex] = 'words.yml' @conf[:unknown_lex] = 'unknown.yml' @conf[:word_path] = DEFAULT_WORDPATH @conf[:tag_path] = DEFAULT_TAGPATH @conf[:debug] = false # assuming that we start analyzing from the beginninga new sentence... @conf[:current_tag] = 'pp' @conf.merge!(params) if params unless File.exist?(@conf[:word_path]) and File.exist?(@conf[:tag_path]) print "Couldn't locate POS lexicon, creating a new one" if @conf[:debug] @@hmm = Hash.new @@lexicon = Hash.new else lexf = File.open(@conf[:word_path], 'r') @@lexicon = Marshal.load(lexf) lexf.close hmmf = File.open(@conf[:tag_path], 'r') @@hmm = Marshal.load(hmmf) hmmf.close end @@mnp = get_max_noun_regex end |
Instance Attribute Details
#conf ⇒ Object
Hash storing config values:
- :unknown_word_tag => (String) Tag to assign to unknown words
- :stem => (Boolean) Stem single words using Porter module
- :weight_noun_phrases => (Boolean) When returning occurrence counts for a noun phrase, multiply the valuethe number of words in the NP.
- :longest_noun_phrase => (Integer) Will ignore noun phrases longer than this threshold. This affects only the get_words() and get_nouns() methods.
- :relax => (Boolean) Relax the Hidden Markov Model: this may improve accuracy for uncommon words, particularly words used polysemously
- :tag_lex => (String) Name of the YAML file containing a hash of adjacent part of speech tags and the probability of each
- :word_lex => (String) Name of the YAML file containing a hash of words and corresponding parts of speech
- :unknown_lex => (String) Name of the YAML file containing a hash of tags for unknown words and corresponding parts of speech
- :tag_path => (String) Directory path of tag_lex
- :word_path => (String) Directory path of word_lex and unknown_lex
- :debug => (Boolean) Print debug messages
184 185 186 |
# File 'lib/engtagger.rb', line 184 def conf @conf end |
Class Method Details
.explain_tag(tag) ⇒ String
Convert a Treebank-style, abbreviated tag into verbose definitions
93 94 95 96 97 98 99 100 |
# File 'lib/engtagger.rb', line 93 def self.explain_tag(tag) tag = tag.to_s.downcase if TAGS[tag] return TAGS[tag] else return tag end end |
.get_ext(tag = nil) ⇒ Object
Return a regexp from a string argument that matches an XML-style pos tag
51 52 53 54 |
# File 'lib/engtagger.rb', line 51 def self.get_ext(tag = nil) return nil unless tag return Regexp.new("<#{tag}>[^<]+</#{tag}>\s*") end |
.hmm ⇒ Hash
Return a class variable that holds probability data.
38 39 40 |
# File 'lib/engtagger.rb', line 38 def self.hmm return @@hmm end |
.lexicon ⇒ Hash
Return a class variable that holds lexical data.
46 47 48 |
# File 'lib/engtagger.rb', line 46 def self.lexicon return @@lexicon end |
Instance Method Details
#add_tags(text, verbose = false) ⇒ String
Examine the string provided and return it fully tagged in XML style.
Examine the string provided and return it fully tagged in XML style
255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
# File 'lib/engtagger.rb', line 255 def (text, verbose = false) return nil unless valid_text(text) tagged = [] words = clean_text(text) = Array.new words.each do |word| cleaned_word = clean_word(word) tag = assign_tag(@conf[:current_tag], cleaned_word) @conf[:current_tag] = tag = (tag and tag != "") ? tag : 'nn' tag = EngTagger.explain_tag(tag) if verbose tagged << '<' + tag + '>' + word + '</' + tag + '>' end reset return tagged.join(' ') end |
#get_adjectives(tagged) ⇒ Hash
440 441 442 443 444 |
# File 'lib/engtagger.rb', line 440 def get_adjectives(tagged) return nil unless valid_text(tagged) = [JJ] build_matches_hash(build_trimmed(tagged, )) end |
#get_adverbs(tagged) ⇒ Hash
470 471 472 473 474 |
# File 'lib/engtagger.rb', line 470 def get_adverbs(tagged) return nil unless valid_text(tagged) = [RB, RBR, RBS, RP] build_matches_hash(build_trimmed(tagged, )) end |
#get_base_present_verbs(tagged) ⇒ Hash
420 421 422 423 424 |
# File 'lib/engtagger.rb', line 420 def get_base_present_verbs(tagged) return nil unless valid_text(tagged) = [VBP] build_matches_hash(build_trimmed(tagged, )) end |
#get_comparative_adjectives(tagged) ⇒ Hash
450 451 452 453 454 |
# File 'lib/engtagger.rb', line 450 def get_comparative_adjectives(tagged) return nil unless valid_text(tagged) = [JJR] build_matches_hash(build_trimmed(tagged, )) end |
#get_conjunctions(tagged) ⇒ Hash
Returns all types of conjunctions and does not discriminate between the various kinds. E.g. coordinating, subordinating, correlative...
497 498 499 500 501 |
# File 'lib/engtagger.rb', line 497 def get_conjunctions(tagged) return nil unless valid_text(tagged) = [CC, IN] build_matches_hash(build_trimmed(tagged, )) end |
#get_gerund_verbs(tagged) ⇒ Hash
400 401 402 403 404 |
# File 'lib/engtagger.rb', line 400 def get_gerund_verbs(tagged) return nil unless valid_text(tagged) = [VBG] build_matches_hash(build_trimmed(tagged, )) end |
#get_infinitive_verbs(tagged) ⇒ Hash
380 381 382 383 384 |
# File 'lib/engtagger.rb', line 380 def get_infinitive_verbs(tagged) return nil unless valid_text(tagged) = [VB] build_matches_hash(build_trimmed(tagged, )) end |
#get_interrogatives(tagged) ⇒ Hash Also known as: get_question_parts
480 481 482 483 484 |
# File 'lib/engtagger.rb', line 480 def get_interrogatives(tagged) return nil unless valid_text(tagged) = [WRB, WDT, WP, WPS] build_matches_hash(build_trimmed(tagged, )) end |
#get_max_noun_phrases(tagged) ⇒ Hash
Given a POS-tagged text, this method returns only the maximal noun phrases.
May be called directly, but is also used by get_noun_phrases.
509 510 511 512 513 514 515 516 517 518 519 |
# File 'lib/engtagger.rb', line 509 def get_max_noun_phrases(tagged) return nil unless valid_text(tagged) = [@@mnp] mn_phrases = build_trimmed(tagged, ) ret = Hash.new(0) mn_phrases.each do |p| p = stem(p) unless p =~ /\s/ # stem single words ret[p] += 1 unless p =~ /\A\s*\z/ end return ret end |
#get_noun_phrases(tagged) ⇒ Hash
Similar to get_words, but requires a POS-tagged text as an argument.
526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 |
# File 'lib/engtagger.rb', line 526 def get_noun_phrases(tagged) return nil unless valid_text(tagged) found = Hash.new(0) phrase_ext = /(?:#{PREP}|#{DET}|#{NUM})+/xo scanned = tagged.scan(@@mnp) # Find MNPs in the text, one sentence at a time # Record and split if the phrase is extended by a (?:PREP|DET|NUM) mn_phrases = [] scanned.each do |m| found[m] += 1 if phrase_ext =~ m mn_phrases += m.split(phrase_ext) end mn_phrases.each do |mnp| # Split the phrase into an array of words, and create a loop for each word, # shortening the phrase by removing the word in the first position. # Record the phrase and any single nouns that are found words = mnp.split words.length.times do |i| found[words.join(' ')] += 1 if words.length > 1 w = words.shift found[w] += 1 if w =~ /#{NN}/ end end ret = Hash.new(0) found.keys.each do |f| k = (f) v = found[f] # We weight by the word count to favor long noun phrases space_count = k.scan(/\s+/) word_count = space_count.length + 1 # Throttle MNPs if necessary next if word_count > @conf[:longest_noun_phrase] k = stem(k) unless word_count > 1 # stem single words multiplier = 1 multiplier = word_count if @conf[:weight_noun_phrases] ret[k] += multiplier * v end return ret end |
#get_nouns(tagged) ⇒ Hash
Given a POS-tagged text, this method returns all nouns and their occurrence frequencies.
356 357 358 359 360 |
# File 'lib/engtagger.rb', line 356 def get_nouns(tagged) return nil unless valid_text(tagged) = [NN] build_matches_hash(build_trimmed(tagged, )) end |
#get_passive_verbs(tagged) ⇒ Hash
410 411 412 413 414 |
# File 'lib/engtagger.rb', line 410 def get_passive_verbs(tagged) return nil unless valid_text(tagged) = [PART] build_matches_hash(build_trimmed(tagged, )) end |
#get_past_tense_verbs(tagged) ⇒ Hash
390 391 392 393 394 |
# File 'lib/engtagger.rb', line 390 def get_past_tense_verbs(tagged) return nil unless valid_text(tagged) = [VBD] build_matches_hash(build_trimmed(tagged, )) end |
#get_present_verbs(tagged) ⇒ Hash
430 431 432 433 434 |
# File 'lib/engtagger.rb', line 430 def get_present_verbs(tagged) return nil unless valid_text(tagged) = [VBZ] build_matches_hash(build_trimmed(tagged, )) end |
#get_proper_nouns(tagged) ⇒ Object
Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies. The method is greedy and will return multi-word phrases, if possible, so it would find ``Linguistic Data Consortium'' as a single unit, rather than as three individual proper nouns. This method does not stem the found words.
322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 |
# File 'lib/engtagger.rb', line 322 def get_proper_nouns(tagged) return nil unless valid_text(tagged) = [NNP] nnp = build_matches_hash(build_trimmed(tagged, )) # Now for some fancy resolution stuff... nnp.keys.each do |key| words = key.split(/\s/) # Let's say this is an organization's name -- # (and it's got at least three words) # is there a corresponding acronym in this hash? if words.length > 2 # Make a (naive) acronym out of this name acronym = words.map do |word| /\A([a-z])[a-z]*\z/ =~ word $1 end.join '' # If that acronym has been seen, # remove it and add the values to # the full name if nnp[acronym] nnp[key] += nnp[acronym] nnp.delete(acronym) end end end return nnp end |
#get_readable(text, verbose = false) ⇒ Object
Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read.
290 291 292 293 294 295 296 297 |
# File 'lib/engtagger.rb', line 290 def get_readable(text, verbose = false) return nil unless valid_text(text) tagged = (text, verbose) tagged = tagged.gsub(/<\w+>([^<]+|[<\w>]+)<\/(\w+)>/o) do #!!!# tagged = tagged.gsub(/<\w+>([^<]+)<\/(\w+)>/o) do $1 + '/' + $2.upcase end end |
#get_sentences(text) ⇒ Object
Return an array of sentences (without POS tags) from a text.
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
# File 'lib/engtagger.rb', line 300 def get_sentences(text) return nil unless valid_text(text) tagged = (text) sentences = Array.new tagged.split(/<\/pp>/).each do |line| sentences << (line) end sentences = sentences.map do |sentence| sentence.gsub(Regexp.new(" ('s?) ")){$1 + ' '} sentence.gsub(Regexp.new(" (\W+) ")){$1 + ' '} sentence.gsub(Regexp.new(" (`+) ")){' ' + $1} sentence.gsub(Regexp.new(" (\W+)$")){$1} sentence.gsub(Regexp.new("^(`+) ")){$1} end return sentences end |
#get_superlative_adjectives(tagged) ⇒ Hash
460 461 462 463 464 |
# File 'lib/engtagger.rb', line 460 def get_superlative_adjectives(tagged) return nil unless valid_text(tagged) = [JJS] build_matches_hash(build_trimmed(tagged, )) end |
#get_verbs(tagged) ⇒ Hash
Returns all types of verbs and does not descriminate between the various kinds. Combines all other verb methods listed in this class.
369 370 371 372 373 |
# File 'lib/engtagger.rb', line 369 def get_verbs(tagged) return nil unless valid_text(tagged) = [VB, VBD, VBG, PART, VBP, VBZ] build_matches_hash(build_trimmed(tagged, )) end |
#get_words(text) ⇒ Object
Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages:
- Tag the text
- Extract all the maximal noun phrases
- Recursively extract all noun phrases from the MNPs
278 279 280 281 282 283 284 285 286 |
# File 'lib/engtagger.rb', line 278 def get_words(text) return false unless valid_text(text) tagged = (text) if(@conf[:longest_noun_phrase] <= 1) return get_nouns(tagged) else return get_noun_phrases(tagged) end end |
#install ⇒ Object
Reads some included corpus data and saves it in a stored hash on the local file system. This is called automatically if the tagger can't find the stored lexicon.
569 570 571 572 573 574 575 576 577 578 579 580 |
# File 'lib/engtagger.rb', line 569 def install puts "Creating part-of-speech lexicon" if @conf[:debug] (@conf[:tag_lex]) load_words(@conf[:word_lex]) load_words(@conf[:unknown_lex]) File.open(@conf[:word_path], 'w') do |f| Marshal.dump(@@lexicon, f) end File.open(@conf[:tag_path], 'w') do |f| Marshal.dump(@@hmm, f) end end |
#tag_pairs(text) ⇒ Array
Return an array of pairs of the form ["word", :tag].
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
# File 'lib/engtagger.rb', line 232 def tag_pairs(text) return [] unless valid_text(text) out = clean_text(text).map do |word| cleaned_word = clean_word word tag = assign_tag(@conf[:current_tag], cleaned_word) @conf[:current_tag] = tag = (tag and !tag.empty?) ? tag : 'nn' [word, tag.to_sym] end # reset the tagger state reset out end |