Module: Ebooks::NLP

Defined in:
lib/bot_twitter_ebooks/nlp.rb

Constant Summary collapse

PUNCTUATION =

We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine

".¿?¡!,"

Class Method Summary collapse

Class Method Details

.adjectivesArray<String>

Lazily loads an array of known English adjectives

Returns:

  • (Array<String>)


31
32
33
# File 'lib/bot_twitter_ebooks/nlp.rb', line 31

def self.adjectives
  @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end

.htmlentitiesHTMLEntities

Lazily load HTML entity decoder

Returns:

  • (HTMLEntities)


45
46
47
# File 'lib/bot_twitter_ebooks/nlp.rb', line 45

def self.htmlentities
  @htmlentities ||= HTMLEntities.new
end

.keywords(text) ⇒ Highscore::Keywords

Use highscore gem to find interesting keywords in a corpus

Parameters:

  • text (String)

Returns:

  • (Highscore::Keywords)


88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/bot_twitter_ebooks/nlp.rb', line 88

def self.keywords(text)
  # Preprocess to remove stopwords and urls (highscore's blacklist is v. slow)
  text = NLP.tokenize(text).reject do |t|
    t.downcase.start_with?('http') || stopword?(t)
  end

  text = Highscore::Content.new(text.join(' '))

  text.configure do
    #set :multiplier, 2
    #set :upper_case, 3
    #set :long_words, 2
    #set :long_words_threshold, 15
    #set :vowels, 1                     # => default: 0 = not considered
    #set :consonants, 5                 # => default: 0 = not considered
    set :ignore_case, true             # => default: false
    set :word_pattern, /(?<!@)(?<=\s)[\p{Word}']+/           # => default: /\w+/
    #set :stemming, true                # => default: false
  end

  text.keywords
end

.normalize(text) ⇒ String

Normalize some strange unicode punctuation variants

Parameters:

  • text (String)

Returns:

  • (String)


54
55
56
# File 'lib/bot_twitter_ebooks/nlp.rb', line 54

def self.normalize(text)
  htmlentities.decode text.gsub('', '"').gsub('', '"').gsub('', "'").gsub('', '...')
end

.nounsArray<String>

Lazily loads an array of known English nouns

Returns:

  • (Array<String>)


25
26
27
# File 'lib/bot_twitter_ebooks/nlp.rb', line 25

def self.nouns
  @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end

.punctuation?(token) ⇒ Boolean

Is this token comprised of punctuation?

Parameters:

  • token (String)

Returns:

  • (Boolean)


149
150
151
# File 'lib/bot_twitter_ebooks/nlp.rb', line 149

def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end

.reconstruct(tikis, tokens) ⇒ String

Builds a proper sentence from a list of tikis

Parameters:

  • tikis (Array<Integer>)
  • tokens (Array<String>)

Returns:

  • (String)


115
116
117
118
119
120
121
122
123
124
125
126
# File 'lib/bot_twitter_ebooks/nlp.rb', line 115

def self.reconstruct(tikis, tokens)
  text = ""
  last_token = nil
  tikis.each do |tiki|
    next if tiki == INTERIM
    token = tokens[tiki]
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end

.sentences(text) ⇒ Array<String>

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation

Parameters:

  • text (String)

Returns:

  • (Array<String>)


64
65
66
# File 'lib/bot_twitter_ebooks/nlp.rb', line 64

def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end

.space_between?(token1, token2) ⇒ Boolean

Determine if we need to insert a space between two tokens

Parameters:

  • token1 (String)
  • token2 (String)

Returns:

  • (Boolean)


132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/bot_twitter_ebooks/nlp.rb', line 132

def self.space_between?(token1, token2)
  p1 = self.punctuation?(token1)
  p2 = self.punctuation?(token2)
  if p1 && p2 # "foo?!"
    false
  elsif !p1 && p2 # "foo."
    false
  elsif p1 && !p2 # "foo. rah"
    true
  else # "foo rah"
    true
  end
end

.stem(word) ⇒ String

Get the ‘stem’ form of a word e.g. ‘cats’ -> ‘cat’

Parameters:

  • word (String)

Returns:

  • (String)


81
82
83
# File 'lib/bot_twitter_ebooks/nlp.rb', line 81

def self.stem(word)
  Stemmer::stem_word(word.downcase)
end

.stopword?(token) ⇒ Boolean

Is this token a stopword?

Parameters:

  • token (String)

Returns:

  • (Boolean)


156
157
158
159
# File 'lib/bot_twitter_ebooks/nlp.rb', line 156

def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end

.stopwordsArray<String>

Lazily loads an array of stopwords Stopwords are common words that should often be ignored

Returns:

  • (Array<String>)


19
20
21
# File 'lib/bot_twitter_ebooks/nlp.rb', line 19

def self.stopwords
  @stopwords ||= File.exists?('stopwords.txt') ? File.read('stopwords.txt').split : []
end

.subseq?(a1, a2) ⇒ Boolean

Determine if a2 is a subsequence of a1

Parameters:

  • a1 (Array)
  • a2 (Array)

Returns:

  • (Boolean)


191
192
193
194
195
# File 'lib/bot_twitter_ebooks/nlp.rb', line 191

def self.subseq?(a1, a2)
  !a1.each_index.find do |i|
    a1[i...i+a2.length] == a2
  end.nil?
end

.taggerEngTagger

Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb

Returns:

  • (EngTagger)


38
39
40
41
# File 'lib/bot_twitter_ebooks/nlp.rb', line 38

def self.tagger
  require 'engtagger'
  @tagger ||= EngTagger.new
end

.tokenize(sentence) ⇒ Array<String>

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps

Parameters:

  • sentence (String)

Returns:

  • (Array<String>)


73
74
75
76
# File 'lib/bot_twitter_ebooks/nlp.rb', line 73

def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
  sentence.split(regex)
end

.unmatched_enclosers?(text) ⇒ Boolean

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry

Parameters:

  • text (String)

Returns:

  • (Boolean)


166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/bot_twitter_ebooks/nlp.rb', line 166

def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened < 0 # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end