Class: TextRank::KeywordExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/text_rank/keyword_extractor.rb

Overview

Primary class for keyword extraction and hub for filters, tokenizers, and graph strategies # that customize how the text is processed and how the TextRank algorithm is applied.

See Also:

  • README

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**options) ⇒ KeywordExtractor

Returns a new instance of KeywordExtractor.

Parameters:

  • options (Hash)

    a customizable set of options

Options Hash (**options):

  • :char_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied prior to tokenization

  • :tokenizers (Array<Symbol, Regexp, String>)

    A list of tokenizer regular expressions to perform tokenization

  • :token_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to each token after tokenization

  • :graph_strategy (Class, Symbol, #build_graph)

    A class or strategy instance for producing a graph from tokens

  • :rank_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to the keyword ranks after keyword extraction

  • :strategy (Symbol)

    PageRank strategy to use (either :sparse or :dense)

  • :damping (Float)

    The probability of following the graph vs. randomly choosing a new node

  • :tolerance (Float)

    The desired accuracy of the results



42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/text_rank/keyword_extractor.rb', line 42

def initialize(**options)
  @page_rank_options = {
    strategy: options[:strategy] || :sparse,
    damping: options[:damping],
    tolerance: options[:tolerance],
  }
  @char_filters   = options[:char_filters] || []
  @tokenizers     = options[:tokenizers] || [Tokenizer::Word]
  @token_filters  = options[:token_filters] || []
  @rank_filters   = options[:rank_filters] || []
  @graph_strategy = options[:graph_strategy] || GraphStrategy::Coocurrence
end

Class Method Details

.advanced(**options) ⇒ KeywordExtractor

Creates an “advanced” keyword extractor with a larger set of default filters

Options Hash (**options):

  • :char_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied prior to tokenization

  • :tokenizers (Array<Symbol, Regexp, String>)

    A list of tokenizer regular expressions to perform tokenization

  • :token_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to each token after tokenization

  • :graph_strategy (Class, Symbol, #build_graph)

    A class or strategy instance for producing a graph from tokens

  • :rank_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to the keyword ranks after keyword extraction

  • :strategy (Symbol)

    PageRank strategy to use (either :sparse or :dense)

  • :damping (Float)

    The probability of following the graph vs. randomly choosing a new node

  • :tolerance (Float)

    The desired accuracy of the results

Returns:



26
27
28
29
30
31
32
33
34
# File 'lib/text_rank/keyword_extractor.rb', line 26

def self.advanced(**options)
  new(**{
    char_filters:   [:AsciiFolding, :Lowercase, :StripHtml, :StripEmail, :UndoContractions, :StripPossessive],
    tokenizers:     [:Url, :Money, :Number, :Word, :Punctuation],
    token_filters:  [:PartOfSpeech, :Stopwords, :MinLength],
    graph_strategy: :Coocurrence,
    rank_filters:   [:CollapseAdjacent, :NormalizeUnitVector, :SortByValue],
  }.merge(options))
end

.basic(**options) ⇒ KeywordExtractor

Creates a “basic” keyword extractor with default options

Options Hash (**options):

  • :char_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied prior to tokenization

  • :tokenizers (Array<Symbol, Regexp, String>)

    A list of tokenizer regular expressions to perform tokenization

  • :token_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to each token after tokenization

  • :graph_strategy (Class, Symbol, #build_graph)

    A class or strategy instance for producing a graph from tokens

  • :rank_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to the keyword ranks after keyword extraction

  • :strategy (Symbol)

    PageRank strategy to use (either :sparse or :dense)

  • :damping (Float)

    The probability of following the graph vs. randomly choosing a new node

  • :tolerance (Float)

    The desired accuracy of the results

Returns:



14
15
16
17
18
19
20
21
# File 'lib/text_rank/keyword_extractor.rb', line 14

def self.basic(**options)
  new(**{
    char_filters:   [:AsciiFolding, :Lowercase],
    tokenizers:     [:Word],
    token_filters:  [:Stopwords, :MinLength],
    graph_strategy: :Coocurrence,
  }.merge(options))
end

Instance Method Details

#add_char_filter(filter, **options) ⇒ nil

Add a new CharFilter for processing text before tokenization

Parameters:

  • filter (Class, Symbol, #filter!)

    A filter to process text before tokenization

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


59
60
61
62
# File 'lib/text_rank/keyword_extractor.rb', line 59

def add_char_filter(filter, **options)
  add_into(@char_filters, filter, **options)
  nil
end

#add_rank_filter(filter, **options) ⇒ nil

Add a new RankFilter for processing ranks after calculating

Parameters:

  • filter (Class, Symbol, #filter!)

    A filter to process ranks

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


93
94
95
96
# File 'lib/text_rank/keyword_extractor.rb', line 93

def add_rank_filter(filter, **options)
  add_into(@rank_filters, filter, **options)
  nil
end

#add_token_filter(filter, **options) ⇒ nil

Add a new TokenFilter for processing tokens after tokenization

Parameters:

  • filter (Class, Symbol, #filter!)

    A filter to process tokens after tokenization

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


84
85
86
87
# File 'lib/text_rank/keyword_extractor.rb', line 84

def add_token_filter(filter, **options)
  add_into(@token_filters, filter, **options)
  nil
end

#add_tokenizer(tokenizer, **options) ⇒ nil

Add a tokenizer regular expression for producing tokens from filtered text

Parameters:

  • tokenizer (Symbol, Regexp, String)

    Tokenizer regular expression

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


68
69
70
71
# File 'lib/text_rank/keyword_extractor.rb', line 68

def add_tokenizer(tokenizer, **options)
  add_into(@tokenizers, tokenizer, **options)
  nil
end

#extract(text, **options) ⇒ Hash<String, Float>

Filter & tokenize text, and return PageRank

Parameters:

  • text (String)

    unfiltered text to be processed

Returns:

  • (Hash<String, Float>)

    tokens and page ranks (in descending order)



110
111
112
113
114
115
116
# File 'lib/text_rank/keyword_extractor.rb', line 110

def extract(text, **options)
  tokens = tokenize(text)
  graph = PageRank.new(**@page_rank_options)
  classify(@graph_strategy, context: GraphStrategy).build_graph(tokens, graph)
  ranks = graph.calculate(**options)
  apply_rank_filters(ranks, tokens: tokens, original_text: text)
end

#graph_strategy=(strategy) ⇒ Class, ...

Sets the graph strategy for producing a graph from tokens

Parameters:

  • strategy (Class, Symbol, #build_graph)

    Strategy for producing a graph from tokens

Returns:

  • (Class, Symbol, #build_graph)


76
77
78
# File 'lib/text_rank/keyword_extractor.rb', line 76

def graph_strategy=(strategy)
  @graph_strategy = strategy
end

#tokenize(text) ⇒ Array<String>

Filters and tokenizes text

Parameters:

  • text (String)

    unfiltered text to be tokenized

Returns:

  • (Array<String>)

    tokens



101
102
103
104
105
# File 'lib/text_rank/keyword_extractor.rb', line 101

def tokenize(text)
  filtered_text = apply_char_filters(text)
  tokens = Tokenizer.tokenize(filtered_text, *tokenizer_regular_expressions)
  apply_token_filters(tokens)
end