Class: TactfulTokenizer::Doc
- Inherits:
-
Object
- Object
- TactfulTokenizer::Doc
- Defined in:
- lib/tactful_tokenizer.rb
Overview
A document represents the input text. It holds a list of fragments generated from the text.
Instance Attribute Summary collapse
-
#frags ⇒ Object
List of fragments.
Instance Method Summary collapse
-
#initialize(text) ⇒ Doc
constructor
Receives a text, which is then broken into fragments.
-
#segment ⇒ Object
Segments the text.
Constructor Details
#initialize(text) ⇒ Doc
Receives a text, which is then broken into fragments. A fragment ends with a period, quesetion mark, or exclamation mark followed possibly by right handed punctuation like quotation marks or closing braces and trailing whitespace. Failing that, it’ll accept something like “I hate cheesen” No, it doesn’t have a period, but that’s the end of paragraph.
Input assumption: Paragraphs delimited by line breaks.
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
# File 'lib/tactful_tokenizer.rb', line 143 def initialize(text) @frags = [] res = nil puts "Hey!" puts text.inspect text.each_line do |line| unless line.strip.empty? line.split(/(.*?[.!?](?:["')\]}]|(?:<.*>))*[\s])/).each do |res| unless res.strip.empty? frag = Frag.new(res) @frags.last.next = frag.cleaned.first unless @frags.empty? @frags.push frag end end end end end |
Instance Attribute Details
#frags ⇒ Object
List of fragments.
134 135 136 |
# File 'lib/tactful_tokenizer.rb', line 134 def frags @frags end |
Instance Method Details
#segment ⇒ Object
Segments the text. More precisely, it reassembles the fragments into sentences. We call something a sentence whenever it is more likely to be a sentence than not.
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
# File 'lib/tactful_tokenizer.rb', line 163 def segment sents, sent = [], [] thresh = 0.5 frag = nil @frags.each do |frag| sent.push(frag.orig) if frag.pred > thresh break if frag.orig.nil? sents.push(sent.join('').strip) sent = [] end end sents end |