Class: PragmaticTokenizer::PostProcessor

Inherits:

Object

Object
PragmaticTokenizer::PostProcessor

Defined in:: lib/pragmatic_tokenizer/post_processor.rb

Constant Summary collapse

DOT =

'.'.freeze

Instance Attribute Summary collapse

#abbreviations ⇒ Object readonly

Returns the value of attribute abbreviations.
#downcase ⇒ Object readonly

Returns the value of attribute downcase.
#text ⇒ Object readonly

Returns the value of attribute text.

Instance Method Summary collapse

#call ⇒ Object

Every #flat_map will increase memory usage, we should try to merge whatever can be merged We need to run #split(Regex::ENDS_WITH_PUNCTUATION2) before AND after #split(Regex::VARIOUS), can this be fixed?.
#initialize(text:, abbreviations:, downcase:) ⇒ PostProcessor constructor

A new instance of PostProcessor.

Constructor Details

#initialize(text:, abbreviations:, downcase:) ⇒ `PostProcessor`

Returns a new instance of PostProcessor.

# File 'lib/pragmatic_tokenizer/post_processor.rb', line 8

def initialize(text:, abbreviations:, downcase:)
  @text            = text
  @abbreviations   = abbreviations
  @downcase        = downcase
end

Instance Attribute Details

#abbreviations ⇒ `Object` (readonly)

Returns the value of attribute abbreviations.



6
7
8

# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6

def abbreviations
  @abbreviations
end

#downcase ⇒ `Object` (readonly)

Returns the value of attribute downcase.



6
7
8

# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6

def downcase
  @downcase
end

#text ⇒ `Object` (readonly)

Returns the value of attribute text.



6
7
8

# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6

def text
  @text
end

Instance Method Details

#call ⇒ `Object`

Every #flat_map will increase memory usage, we should try to merge whatever can be merged We need to run #split(Regex::ENDS_WITH_PUNCTUATION2) before AND after #split(Regex::VARIOUS), can this be fixed?

# File 'lib/pragmatic_tokenizer/post_processor.rb', line 16

def call
  text
      .split
      .map      { |token| convert_sym_to_punct(token) }
      .flat_map { |token| token.split(Regex::COMMAS_OR_PUNCTUATION) }
      .flat_map { |token| token.split(Regex::VARIOUS) }
      .flat_map { |token| token.split(Regex::ENDS_WITH_PUNCTUATION2) }
      .flat_map { |token| split_dotted_email_or_digit(token) }
      .flat_map { |token| split_abbreviations(token) }
      .flat_map { |token| split_period_after_last_word(token) }
end