Class: PragmaticTokenizer::PostProcessor
- Inherits:
-
Object
- Object
- PragmaticTokenizer::PostProcessor
- Defined in:
- lib/pragmatic_tokenizer/post_processor.rb
Constant Summary collapse
- DOT =
'.'.freeze
Instance Attribute Summary collapse
-
#abbreviations ⇒ Object
readonly
Returns the value of attribute abbreviations.
-
#downcase ⇒ Object
readonly
Returns the value of attribute downcase.
-
#text ⇒ Object
readonly
Returns the value of attribute text.
Instance Method Summary collapse
-
#call ⇒ Object
Every #flat_map will increase memory usage, we should try to merge whatever can be merged We need to run #split(Regex::ENDS_WITH_PUNCTUATION2) before AND after #split(Regex::VARIOUS), can this be fixed?.
-
#initialize(text:, abbreviations:, downcase:) ⇒ PostProcessor
constructor
A new instance of PostProcessor.
Constructor Details
#initialize(text:, abbreviations:, downcase:) ⇒ PostProcessor
Returns a new instance of PostProcessor.
8 9 10 11 12 |
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 8 def initialize(text:, abbreviations:, downcase:) @text = text @abbreviations = abbreviations @downcase = downcase end |
Instance Attribute Details
#abbreviations ⇒ Object (readonly)
Returns the value of attribute abbreviations.
6 7 8 |
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6 def abbreviations @abbreviations end |
#downcase ⇒ Object (readonly)
Returns the value of attribute downcase.
6 7 8 |
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6 def downcase @downcase end |
#text ⇒ Object (readonly)
Returns the value of attribute text.
6 7 8 |
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 6 def text @text end |
Instance Method Details
#call ⇒ Object
Every #flat_map will increase memory usage, we should try to merge whatever can be merged We need to run #split(Regex::ENDS_WITH_PUNCTUATION2) before AND after #split(Regex::VARIOUS), can this be fixed?
16 17 18 19 20 21 22 23 24 25 26 |
# File 'lib/pragmatic_tokenizer/post_processor.rb', line 16 def call text .split .map { |token| convert_sym_to_punct(token) } .flat_map { |token| token.split(Regex::COMMAS_OR_PUNCTUATION) } .flat_map { |token| token.split(Regex::VARIOUS) } .flat_map { |token| token.split(Regex::ENDS_WITH_PUNCTUATION2) } .flat_map { |token| split_dotted_email_or_digit(token) } .flat_map { |token| split_abbreviations(token) } .flat_map { |token| split_period_after_last_word(token) } end |