Module: Ebooks::NLP

Defined in:: lib/twitter_ebooks/nlp.rb

Constant Summary collapse

PUNCTUATION = We deliberately limit our punctuation handling to stuff we can do consistently It’ll just be a part of another token if we don’t split it out, and that’s fine

".?!,"

Class Method Summary collapse

.adjectives ⇒ Array<String>

Lazily loads an array of known English adjectives.
.htmlentities ⇒ HTMLEntities

Lazily load HTML entity decoder.
.keywords(text) ⇒ Highscore::Keywords

Use highscore gem to find interesting keywords in a corpus.
.normalize(text) ⇒ String

Normalize some strange unicode punctuation variants.
.nouns ⇒ Array<String>

Lazily loads an array of known English nouns.
.punctuation?(token) ⇒ Boolean

Is this token comprised of punctuation?.
.reconstruct(tikis, tokens) ⇒ String

Builds a proper sentence from a list of tikis.
.sentences(text) ⇒ Array<String>

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation.
.space_between?(token1, token2) ⇒ Boolean

Determine if we need to insert a space between two tokens.
.stem(word) ⇒ String

Get the ‘stem’ form of a word e.g.
.stopword?(token) ⇒ Boolean

Is this token a stopword?.
.stopwords ⇒ Array<String>

Lazily loads an array of stopwords Stopwords are common words that should often be ignored.
.subseq?(a1, a2) ⇒ Boolean

Determine if a2 is a subsequence of a1.
.tagger ⇒ EngTagger

Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb.
.tokenize(sentence) ⇒ Array<String>

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt.
.unmatched_enclosers?(text) ⇒ Boolean

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry.

Class Method Details

.adjectives ⇒ `Array<String>`

Lazily loads an array of known English adjectives

Returns:

(Array<String>)



31
32
33

# File 'lib/twitter_ebooks/nlp.rb', line 31

def self.adjectives
  @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end

.htmlentities ⇒ `HTMLEntities`

Lazily load HTML entity decoder

Returns:

(HTMLEntities)



45
46
47

# File 'lib/twitter_ebooks/nlp.rb', line 45

def self.htmlentities
  @htmlentities ||= HTMLEntities.new
end

.keywords(text) ⇒ `Highscore::Keywords`

Use highscore gem to find interesting keywords in a corpus

Parameters:

text (String)

Returns:

(Highscore::Keywords)

# File 'lib/twitter_ebooks/nlp.rb', line 88

def self.keywords(text)
  # Preprocess to remove stopwords (highscore's blacklist is v. slow)
  text = NLP.tokenize(text).reject { |t| stopword?(t) }.join(' ')

  text = Highscore::Content.new(text)

  text.configure do
    #set :multiplier, 2
    #set :upper_case, 3
    #set :long_words, 2
    #set :long_words_threshold, 15
    #set :vowels, 1                     # => default: 0 = not considered
    #set :consonants, 5                 # => default: 0 = not considered
    #set :ignore_case, true             # => default: false
    set :word_pattern, /(?<!@)(?<=\s)[\p{Word}']+/           # => default: /\w+/
    #set :stemming, true                # => default: false
  end

  text.keywords
end

.normalize(text) ⇒ `String`

Normalize some strange unicode punctuation variants

Parameters:

text (String)

Returns:

(String)



54
55
56

# File 'lib/twitter_ebooks/nlp.rb', line 54

def self.normalize(text)
  htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('’', "'").gsub('…', '...')
end

.nouns ⇒ `Array<String>`

Lazily loads an array of known English nouns

Returns:

(Array<String>)



25
26
27

# File 'lib/twitter_ebooks/nlp.rb', line 25

def self.nouns
  @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end

.punctuation?(token) ⇒ `Boolean`

Is this token comprised of punctuation?

Parameters:

token (String)

Returns:

(Boolean)



147
148
149

# File 'lib/twitter_ebooks/nlp.rb', line 147

def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end

.reconstruct(tikis, tokens) ⇒ `String`

Builds a proper sentence from a list of tikis

Parameters:

tikis (Array<Integer>)
tokens (Array<String>)

Returns:

(String)

# File 'lib/twitter_ebooks/nlp.rb', line 113

def self.reconstruct(tikis, tokens)
  text = ""
  last_token = nil
  tikis.each do |tiki|
    next if tiki == INTERIM
    token = tokens[tiki]
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end

.sentences(text) ⇒ `Array<String>`

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation

Parameters:

text (String)

Returns:

(Array<String>)



64
65
66

# File 'lib/twitter_ebooks/nlp.rb', line 64

def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end

.space_between?(token1, token2) ⇒ `Boolean`

Determine if we need to insert a space between two tokens

Parameters:

token1 (String)
token2 (String)

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 130

def self.space_between?(token1, token2)
  p1 = self.punctuation?(token1)
  p2 = self.punctuation?(token2)
  if p1 && p2 # "foo?!"
    false
  elsif !p1 && p2 # "foo."
    false
  elsif p1 && !p2 # "foo. rah"
    true
  else # "foo rah"
    true
  end
end

.stem(word) ⇒ `String`

Get the ‘stem’ form of a word e.g. ‘cats’ -> ‘cat’

Parameters:

word (String)

Returns:

(String)



81
82
83

# File 'lib/twitter_ebooks/nlp.rb', line 81

def self.stem(word)
  Stemmer::stem_word(word.downcase)
end

.stopword?(token) ⇒ `Boolean`

Is this token a stopword?

Parameters:

token (String)

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 154

def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end

.stopwords ⇒ `Array<String>`

Lazily loads an array of stopwords Stopwords are common words that should often be ignored

Returns:

(Array<String>)



19
20
21

# File 'lib/twitter_ebooks/nlp.rb', line 19

def self.stopwords
  @stopwords ||= File.exists?('stopwords.txt') ? File.read('stopwords.txt').split : []
end

.subseq?(a1, a2) ⇒ `Boolean`

Determine if a2 is a subsequence of a1

Parameters:

a1 (Array)
a2 (Array)

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 189

def self.subseq?(a1, a2)
  !a1.each_index.find do |i|
    a1[i...i+a2.length] == a2
  end.nil?
end

.tagger ⇒ `EngTagger`

Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb

Returns:

(EngTagger)

# File 'lib/twitter_ebooks/nlp.rb', line 38

def self.tagger
  require 'engtagger'
  @tagger ||= EngTagger.new
end

.tokenize(sentence) ⇒ `Array<String>`

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps

Parameters:

sentence (String)

Returns:

(Array<String>)

# File 'lib/twitter_ebooks/nlp.rb', line 73

def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
  sentence.split(regex)
end

.unmatched_enclosers?(text) ⇒ `Boolean`

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry

Parameters:

text (String)

Returns:

(Boolean)

# File 'lib/twitter_ebooks/nlp.rb', line 164

def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened < 0 # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end

Module: Ebooks::NLP

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.adjectives ⇒ Array<String>

.htmlentities ⇒ HTMLEntities

.keywords(text) ⇒ Highscore::Keywords

.normalize(text) ⇒ String

.nouns ⇒ Array<String>

.punctuation?(token) ⇒ Boolean

.reconstruct(tikis, tokens) ⇒ String

.sentences(text) ⇒ Array<String>

.space_between?(token1, token2) ⇒ Boolean

.stem(word) ⇒ String

.stopword?(token) ⇒ Boolean

.stopwords ⇒ Array<String>

.subseq?(a1, a2) ⇒ Boolean

.tagger ⇒ EngTagger

.tokenize(sentence) ⇒ Array<String>

.unmatched_enclosers?(text) ⇒ Boolean

.adjectives ⇒ `Array<String>`

.htmlentities ⇒ `HTMLEntities`

.keywords(text) ⇒ `Highscore::Keywords`

.normalize(text) ⇒ `String`

.nouns ⇒ `Array<String>`

.punctuation?(token) ⇒ `Boolean`

.reconstruct(tikis, tokens) ⇒ `String`

.sentences(text) ⇒ `Array<String>`

.space_between?(token1, token2) ⇒ `Boolean`

.stem(word) ⇒ `String`

.stopword?(token) ⇒ `Boolean`

.stopwords ⇒ `Array<String>`

.subseq?(a1, a2) ⇒ `Boolean`

.tagger ⇒ `EngTagger`

.tokenize(sentence) ⇒ `Array<String>`

.unmatched_enclosers?(text) ⇒ `Boolean`