Class: PragmaticTokenizer::PostProcessor

Inherits:
Object
  • Object
show all
Defined in:
lib/pragmatic_tokenizer/post_processor.rb

Constant Summary collapse

DOT =
'.'.freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(text:, abbreviations:, downcase:) ⇒ PostProcessor

Returns a new instance of PostProcessor.



8
9
10
11
12
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 8

def initialize(text:, abbreviations:, downcase:)
  @text            = text
  @abbreviations   = abbreviations
  @downcase        = downcase
end

Instance Attribute Details

#abbreviationsObject (readonly)

Returns the value of attribute abbreviations.



6
7
8
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6

def abbreviations
  @abbreviations
end

#downcaseObject (readonly)

Returns the value of attribute downcase.



6
7
8
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6

def downcase
  @downcase
end

#textObject (readonly)

Returns the value of attribute text.



6
7
8
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6

def text
  @text
end

Instance Method Details

#callObject

Every #flat_map will increase memory usage, we should try to merge whatever can be merged We need to run #split(Regex::ENDS_WITH_PUNCTUATION2) before AND after #split(Regex::VARIOUS), can this be fixed?



16
17
18
19
20
21
22
23
24
25
26
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 16

def call
  text
      .split
      .map      { |token| convert_sym_to_punct(token) }
      .flat_map { |token| token.split(Regex::COMMAS_OR_PUNCTUATION) }
      .flat_map { |token| token.split(Regex::VARIOUS) }
      .flat_map { |token| token.split(Regex::ENDS_WITH_PUNCTUATION2) }
      .flat_map { |token| split_dotted_email_or_digit(token) }
      .flat_map { |token| split_abbreviations(token) }
      .flat_map { |token| split_period_after_last_word(token) }
end