Module: Ebooks::NLP
- Defined in:
- lib/twitter_ebooks/nlp.rb
Constant Summary collapse
- PUNCTUATION =
We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine
".?!,"
Class Method Summary collapse
-
.adjectives ⇒ Array<String>
Lazily loads an array of known English adjectives.
-
.htmlentities ⇒ HTMLEntities
Lazily load HTML entity decoder.
-
.keywords(text) ⇒ Highscore::Keywords
Use highscore gem to find interesting keywords in a corpus.
-
.normalize(text) ⇒ String
Normalize some strange unicode punctuation variants.
-
.nouns ⇒ Array<String>
Lazily loads an array of known English nouns.
-
.punctuation?(token) ⇒ Boolean
Is this token comprised of punctuation?.
-
.reconstruct(tikis, tokens) ⇒ String
Builds a proper sentence from a list of tikis.
-
.sentences(text) ⇒ Array<String>
Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation.
-
.space_between?(token1, token2) ⇒ Boolean
Determine if we need to insert a space between two tokens.
-
.stem(word) ⇒ String
Get the ‘stem’ form of a word e.g.
-
.stopword?(token) ⇒ Boolean
Is this token a stopword?.
-
.stopwords ⇒ Array<String>
Lazily loads an array of stopwords Stopwords are common words that should often be ignored.
-
.subseq?(a1, a2) ⇒ Boolean
Determine if a2 is a subsequence of a1.
-
.tagger ⇒ EngTagger
Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb.
-
.tokenize(sentence) ⇒ Array<String>
Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt.
-
.unmatched_enclosers?(text) ⇒ Boolean
Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry.
Class Method Details
.adjectives ⇒ Array<String>
Lazily loads an array of known English adjectives
31 32 33 |
# File 'lib/twitter_ebooks/nlp.rb', line 31 def self.adjectives @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split end |
.htmlentities ⇒ HTMLEntities
Lazily load HTML entity decoder
45 46 47 |
# File 'lib/twitter_ebooks/nlp.rb', line 45 def self.htmlentities @htmlentities ||= HTMLEntities.new end |
.keywords(text) ⇒ Highscore::Keywords
Use highscore gem to find interesting keywords in a corpus
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
# File 'lib/twitter_ebooks/nlp.rb', line 88 def self.keywords(text) # Preprocess to remove stopwords (highscore's blacklist is v. slow) text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ') text = Highscore::Content.new(text) text.configure do #set :multiplier, 2 #set :upper_case, 3 #set :long_words, 2 #set :long_words_threshold, 15 #set :vowels, 1 # => default: 0 = not considered #set :consonants, 5 # => default: 0 = not considered #set :ignore_case, true # => default: false set :word_pattern, /(?<!@)(?<=\s)[\p{Word}']+/ # => default: /\w+/ #set :stemming, true # => default: false end text.keywords end |
.normalize(text) ⇒ String
Normalize some strange unicode punctuation variants
54 55 56 |
# File 'lib/twitter_ebooks/nlp.rb', line 54 def self.normalize(text) htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('’', "'").gsub('…', '...') end |
.nouns ⇒ Array<String>
Lazily loads an array of known English nouns
25 26 27 |
# File 'lib/twitter_ebooks/nlp.rb', line 25 def self.nouns @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split end |
.punctuation?(token) ⇒ Boolean
Is this token comprised of punctuation?
147 148 149 |
# File 'lib/twitter_ebooks/nlp.rb', line 147 def self.punctuation?(token) (token.chars.to_set - PUNCTUATION.chars.to_set).empty? end |
.reconstruct(tikis, tokens) ⇒ String
Builds a proper sentence from a list of tikis
113 114 115 116 117 118 119 120 121 122 123 124 |
# File 'lib/twitter_ebooks/nlp.rb', line 113 def self.reconstruct(tikis, tokens) text = "" last_token = nil tikis.each do |tiki| next if tiki == INTERIM token = tokens[tiki] text += ' ' if last_token && space_between?(last_token, token) text += token last_token = token end text end |
.sentences(text) ⇒ Array<String>
Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation
64 65 66 |
# File 'lib/twitter_ebooks/nlp.rb', line 64 def self.sentences(text) text.split(/\n+|(?<=[.?!])\s+/) end |
.space_between?(token1, token2) ⇒ Boolean
Determine if we need to insert a space between two tokens
130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/twitter_ebooks/nlp.rb', line 130 def self.space_between?(token1, token2) p1 = self.punctuation?(token1) p2 = self.punctuation?(token2) if p1 && p2 # "foo?!" false elsif !p1 && p2 # "foo." false elsif p1 && !p2 # "foo. rah" true else # "foo rah" true end end |
.stem(word) ⇒ String
Get the ‘stem’ form of a word e.g. ‘cats’ -> ‘cat’
81 82 83 |
# File 'lib/twitter_ebooks/nlp.rb', line 81 def self.stem(word) Stemmer::stem_word(word.downcase) end |
.stopword?(token) ⇒ Boolean
Is this token a stopword?
154 155 156 157 |
# File 'lib/twitter_ebooks/nlp.rb', line 154 def self.stopword?(token) @stopword_set ||= stopwords.map(&:downcase).to_set @stopword_set.include?(token.downcase) end |
.stopwords ⇒ Array<String>
Lazily loads an array of stopwords Stopwords are common words that should often be ignored
19 20 21 |
# File 'lib/twitter_ebooks/nlp.rb', line 19 def self.stopwords @stopwords ||= File.exists?('stopwords.txt') ? File.read('stopwords.txt').split : [] end |
.subseq?(a1, a2) ⇒ Boolean
Determine if a2 is a subsequence of a1
189 190 191 192 193 |
# File 'lib/twitter_ebooks/nlp.rb', line 189 def self.subseq?(a1, a2) !a1.each_index.find do |i| a1[i...i+a2.length] == a2 end.nil? end |
.tagger ⇒ EngTagger
Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb
38 39 40 41 |
# File 'lib/twitter_ebooks/nlp.rb', line 38 def self.tagger require 'engtagger' @tagger ||= EngTagger.new end |
.tokenize(sentence) ⇒ Array<String>
Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps
73 74 75 76 |
# File 'lib/twitter_ebooks/nlp.rb', line 73 def self.tokenize(sentence) regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/ sentence.split(regex) end |
.unmatched_enclosers?(text) ⇒ Boolean
Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 |
# File 'lib/twitter_ebooks/nlp.rb', line 164 def self.unmatched_enclosers?(text) enclosers = ['**', '""', '()', '[]', '``', "''"] enclosers.each do |pair| starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S') ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)') opened = 0 tokenize(text).each do |token| opened += 1 if token.match(starter) opened -= 1 if token.match(ender) return true if opened < 0 # Too many ends! end return true if opened != 0 # Mismatch somewhere. end false end |