Class: ConfidentialInfoRedactorLite::Extractor

Inherits:
Object
  • Object
show all
Defined in:
lib/confidential_info_redactor_lite/extractor.rb

Overview

This class extracts proper nouns from a text

Constant Summary collapse

EXTRACT_REGEX =
/(?<=\s|^|\s\"|\s\“|\s\«|\s\‹|\s\”|\s\»|\s\›)(\p{Lu}\S*\s)*\p{Lu}\S*(?=(\s|\.|\z))|(?<=\s|^|\s\"|\s\”|\s\»|\s\›|\s\“|\s\«|\s\‹)[i][A-Z][a-z]+/
PUNCTUATION_REGEX =
/[\?\)\(\!\\\/\"\:\;\,\”\“\«\»\‹\›]/

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(corpus:, **args) ⇒ Extractor

Returns a new instance of Extractor.



9
10
11
12
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 9

def initialize(corpus:, **args)
  @corpus = Set.new(corpus).freeze
  @language = args[:language] || 'en'
end

Instance Attribute Details

#corpusObject (readonly)

Returns the value of attribute corpus.



8
9
10
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 8

def corpus
  @corpus
end

#languageObject (readonly)

Returns the value of attribute language.



8
9
10
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 8

def language
  @language
end

Instance Method Details

#extract(text) ⇒ Object



14
15
16
17
18
19
20
21
# File 'lib/confidential_info_redactor_lite/extractor.rb', line 14

def extract(text)
  extracted_terms = []
  PragmaticSegmenter::Segmenter.new(text: text.gsub(/[’‘]/, "'"), language: language).segment.each do |segment|
    initial_extracted_terms = extract_preliminary_terms(segment)
    search_ngrams(initial_extracted_terms, extracted_terms)
  end
  extracted_terms.map { |t| t.gsub(/\{\}/, '') }.delete_if { |t| t.length == 1 }.uniq.reject(&:empty?)
end