Module: Ebooks::NLP

Defined in:
lib/twitter_ebooks/nlp.rb

Constant Summary collapse

PUNCTUATION =

We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine

".?!,"

Class Method Summary collapse

Class Method Details

.adjectivesArray<String>

Lazily loads an array of known English adjectives

Returns:

  • (Array<String>)


30
31
32
# File 'lib/twitter_ebooks/nlp.rb', line 30

def self.adjectives
  @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end

.htmlentitiesHTMLEntities

Lazily load HTML entity decoder

Returns:

  • (HTMLEntities)


44
45
46
47
# File 'lib/twitter_ebooks/nlp.rb', line 44

def self.htmlentities
  require 'htmlentities'
  @htmlentities ||= HTMLEntities.new
end

.keywords(text) ⇒ Highscore::Keywords

Use highscore gem to find interesting keywords in a corpus

Parameters:

  • text (String)

Returns:

  • (Highscore::Keywords)


88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/twitter_ebooks/nlp.rb', line 88

def self.keywords(text)
  # Preprocess to remove stopwords (highscore's blacklist is v. slow)
  text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ')

  text = Highscore::Content.new(text)

  text.configure do
    #set :multiplier, 2
    #set :upper_case, 3
    #set :long_words, 2
    #set :long_words_threshold, 15
    #set :vowels, 1                     # => default: 0 = not considered
    #set :consonants, 5                 # => default: 0 = not considered
    #set :ignore_case, true             # => default: false
    set :word_pattern, /(?<!@)(?<=\s)[\w']+/           # => default: /\w+/
    #set :stemming, true                # => default: false
  end

  text.keywords
end

.normalize(text) ⇒ String

Normalize some strange unicode punctuation variants

Parameters:

  • text (String)

Returns:

  • (String)


54
55
56
# File 'lib/twitter_ebooks/nlp.rb', line 54

def self.normalize(text)
  htmlentities.decode text.gsub('', '"').gsub('', '"').gsub('', "'").gsub('', '...')
end

.nounsArray<String>

Lazily loads an array of known English nouns

Returns:

  • (Array<String>)


24
25
26
# File 'lib/twitter_ebooks/nlp.rb', line 24

def self.nouns
  @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end

.punctuation?(token) ⇒ Boolean

Is this token comprised of punctuation?

Parameters:

  • token (String)

Returns:

  • (Boolean)


147
148
149
# File 'lib/twitter_ebooks/nlp.rb', line 147

def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end

.reconstruct(tikis, tokens) ⇒ String

Builds a proper sentence from a list of tikis

Parameters:

  • tikis (Array<Integer>)
  • tokens (Array<String>)

Returns:

  • (String)


113
114
115
116
117
118
119
120
121
122
123
124
# File 'lib/twitter_ebooks/nlp.rb', line 113

def self.reconstruct(tikis, tokens)
  text = ""
  last_token = nil
  tikis.each do |tiki|
    next if tiki == INTERIM
    token = tokens[tiki]
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end

.sentences(text) ⇒ Array<String>

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation

Parameters:

  • text (String)

Returns:

  • (Array<String>)


64
65
66
# File 'lib/twitter_ebooks/nlp.rb', line 64

def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end

.space_between?(token1, token2) ⇒ Boolean

Determine if we need to insert a space between two tokens

Parameters:

  • token1 (String)
  • token2 (String)

Returns:

  • (Boolean)


130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/twitter_ebooks/nlp.rb', line 130

def self.space_between?(token1, token2)
  p1 = self.punctuation?(token1)
  p2 = self.punctuation?(token2)
  if p1 && p2 # "foo?!"
    false
  elsif !p1 && p2 # "foo."
    false
  elsif p1 && !p2 # "foo. rah"
    true
  else # "foo rah"
    true
  end
end

.stem(word) ⇒ String

Get the ‘stem’ form of a word e.g. ‘cats’ -> ‘cat’

Parameters:

  • word (String)

Returns:

  • (String)


81
82
83
# File 'lib/twitter_ebooks/nlp.rb', line 81

def self.stem(word)
  Stemmer::stem_word(word.downcase)
end

.stopword?(token) ⇒ Boolean

Is this token a stopword?

Parameters:

  • token (String)

Returns:

  • (Boolean)


154
155
156
157
# File 'lib/twitter_ebooks/nlp.rb', line 154

def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end

.stopwordsArray<String>

Lazily loads an array of stopwords Stopwords are common English words that should often be ignored

Returns:

  • (Array<String>)


18
19
20
# File 'lib/twitter_ebooks/nlp.rb', line 18

def self.stopwords
  @stopwords ||= File.read(File.join(DATA_PATH, 'stopwords.txt')).split
end

.subseq?(a1, a2) ⇒ Boolean

Determine if a2 is a subsequence of a1

Parameters:

  • a1 (Array)
  • a2 (Array)

Returns:

  • (Boolean)


189
190
191
192
193
# File 'lib/twitter_ebooks/nlp.rb', line 189

def self.subseq?(a1, a2)
  !a1.each_index.find do |i|
    a1[i...i+a2.length] == a2
  end.nil?
end

.taggerEngTagger

Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb

Returns:

  • (EngTagger)


37
38
39
40
# File 'lib/twitter_ebooks/nlp.rb', line 37

def self.tagger
  require 'engtagger'
  @tagger ||= EngTagger.new
end

.tokenize(sentence) ⇒ Array<String>

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps

Parameters:

  • sentence (String)

Returns:

  • (Array<String>)


73
74
75
76
# File 'lib/twitter_ebooks/nlp.rb', line 73

def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
  sentence.split(regex)
end

.unmatched_enclosers?(text) ⇒ Boolean

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry

Parameters:

  • text (String)

Returns:

  • (Boolean)


164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# File 'lib/twitter_ebooks/nlp.rb', line 164

def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened < 0 # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end