Class: TactfulTokenizer::Doc

Inherits:
Object
  • Object
show all
Defined in:
lib/tactful_tokenizer.rb

Overview

A document represents the input text. It holds a list of fragments generated from the text.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text) ⇒ Doc

Receives a text, which is then broken into fragments. A fragment ends with a period, quesetion mark, or exclamation mark followed possibly by right handed punctuation like quotation marks or closing braces and trailing whitespace. Failing that, it'll accept something like “I hate cheesen” No, it doesn't have a period, but that's the end of paragraph.

Input assumption: Paragraphs delimited by line breaks.


134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/tactful_tokenizer.rb', line 134

def initialize(text)
  @frags = []
  res = nil
  text.each_line do |line|
    unless line.strip.empty?
      line.split(/(.*?[.!?](?:[”"')\]}]|(?:<.*>))*[[:space:]])/u).each do |res|
        unless res.strip.empty?
          frag = Frag.new(res)
          @frags.last.next = frag.cleaned.first unless @frags.empty?
          @frags.push frag
        end
      end
    end
  end
end

Instance Attribute Details

#fragsObject

List of fragments.


125
126
127
# File 'lib/tactful_tokenizer.rb', line 125

def frags
  @frags
end

Instance Method Details

#segmentObject

Segments the text. More precisely, it reassembles the fragments into sentences. We call something a sentence whenever it is more likely to be a sentence than not.


152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/tactful_tokenizer.rb', line 152

def segment
  sents, sent = [], []
  thresh = 0.5

  frag = nil
  @frags.each do |frag|
    sent.push(frag.orig)
    if frag.pred && frag.pred > thresh
      break if frag.orig.nil?
      sents.push(sent.join('').strip)
      sent = []
    end
  end
  sents
end