Class: ConfidentialInfoRedactorLite::Extractor
- Inherits:
-
Object
- Object
- ConfidentialInfoRedactorLite::Extractor
- Defined in:
- lib/confidential_info_redactor_lite/extractor.rb
Overview
This class extracts proper nouns from a text
Constant Summary collapse
- EXTRACT_REGEX =
Rubular: rubular.com/r/qE0g4r9zR7
/(?<=\s|^|\s\"|\s\“|\s\«|\s\‹|\s\”|\s\»|\s\›)(\p{Lu}\S*\s)*\p{Lu}\S*(?=(\s|\.|\z))|(?<=\s|^|\s\"|\s\”|\s\»|\s\›|\s\“|\s\«|\s\‹)[i][A-Z][a-z]+/
- PUNCTUATION_REGEX =
/[\?\)\(\!\\\/\"\:\;\,\”\“\«\»\‹\›]/
Instance Attribute Summary collapse
-
#corpus ⇒ Object
readonly
Returns the value of attribute corpus.
-
#language ⇒ Object
readonly
Returns the value of attribute language.
Instance Method Summary collapse
- #extract(text) ⇒ Object
-
#initialize(corpus:, **args) ⇒ Extractor
constructor
A new instance of Extractor.
Constructor Details
#initialize(corpus:, **args) ⇒ Extractor
Returns a new instance of Extractor.
9 10 11 12 |
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 9 def initialize(corpus:, **args) @corpus = Set.new(corpus).freeze @language = args[:language] || 'en' end |
Instance Attribute Details
#corpus ⇒ Object (readonly)
Returns the value of attribute corpus.
8 9 10 |
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 8 def corpus @corpus end |
#language ⇒ Object (readonly)
Returns the value of attribute language.
8 9 10 |
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 8 def language @language end |
Instance Method Details
#extract(text) ⇒ Object
14 15 16 17 18 19 20 21 |
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 14 def extract(text) extracted_terms = [] PragmaticSegmenter::Segmenter.new(text: text.gsub(/[’‘]/, "'"), language: language).segment.each do |segment| initial_extracted_terms = extract_preliminary_terms(segment) search_ngrams(initial_extracted_terms, extracted_terms) end extracted_terms.map { |t| t.gsub(/\{\}/, '') }.delete_if { |t| t.length == 1 }.uniq.reject(&:empty?) end |