Class: TextRank::KeywordExtractor

Inherits:

Object

Object
TextRank::KeywordExtractor

Defined in:: lib/text_rank/keyword_extractor.rb

Overview

Primary class for keyword extraction and hub for filters, tokenizers, and graph strategies # that customize how the text is processed and how the TextRank algorithm is applied.

Class Method Summary collapse

.advanced(**options) ⇒ KeywordExtractor

Creates an “advanced” keyword extractor with a larger set of default filters.
.basic(**options) ⇒ KeywordExtractor

Creates a “basic” keyword extractor with default options.

Instance Method Summary collapse

#add_char_filter(filter, **options) ⇒ nil

Add a new CharFilter for processing text before tokenization.
#add_rank_filter(filter, **options) ⇒ nil

Add a new RankFilter for processing ranks after calculating.
#add_token_filter(filter, **options) ⇒ nil

Add a new TokenFilter for processing tokens after tokenization.
#add_tokenizer(tokenizer, **options) ⇒ nil

Add a tokenizer regular expression for producing tokens from filtered text.
#extract(text, **options) ⇒ Hash<String, Float>

Filter & tokenize text, and return PageRank.
#graph_strategy=(strategy) ⇒ Class, ...

Sets the graph strategy for producing a graph from tokens.
#initialize(**options) ⇒ KeywordExtractor constructor

A new instance of KeywordExtractor.
#tokenize(text) ⇒ Array<String>

Filters and tokenizes text.

Constructor Details

#initialize(**options) ⇒ `KeywordExtractor`

Returns a new instance of KeywordExtractor.

Parameters:

options (Hash) —

a customizable set of options

Options Hash (**options):

:char_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied prior to tokenization
:tokenizers (Array<Symbol, Regexp, String>) —

A list of tokenizer regular expressions to perform tokenization
:token_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied to each token after tokenization
:graph_strategy (Class, Symbol, #build_graph) —

A class or strategy instance for producing a graph from tokens
:rank_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied to the keyword ranks after keyword extraction
:strategy (Symbol) —

PageRank strategy to use (either :sparse or :dense)
:damping (Float) —

The probability of following the graph vs. randomly choosing a new node
:tolerance (Float) —

The desired accuracy of the results

# File 'lib/text_rank/keyword_extractor.rb', line 42

def initialize(**options)
  @page_rank_options = {
    strategy: options[:strategy] || :sparse,
    damping: options[:damping],
    tolerance: options[:tolerance],
  }
  @char_filters   = options[:char_filters] || []
  @tokenizers     = options[:tokenizers] || [Tokenizer::Word]
  @token_filters  = options[:token_filters] || []
  @rank_filters   = options[:rank_filters] || []
  @graph_strategy = options[:graph_strategy] || GraphStrategy::Coocurrence
end

Class Method Details

.advanced(**options) ⇒ `KeywordExtractor`

Creates an “advanced” keyword extractor with a larger set of default filters

Options Hash (**options):

:char_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied prior to tokenization
:tokenizers (Array<Symbol, Regexp, String>) —

A list of tokenizer regular expressions to perform tokenization
:token_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied to each token after tokenization
:graph_strategy (Class, Symbol, #build_graph) —

A class or strategy instance for producing a graph from tokens
:rank_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied to the keyword ranks after keyword extraction
:strategy (Symbol) —

PageRank strategy to use (either :sparse or :dense)
:damping (Float) —

The probability of following the graph vs. randomly choosing a new node
:tolerance (Float) —

The desired accuracy of the results

Returns:

(KeywordExtractor)

# File 'lib/text_rank/keyword_extractor.rb', line 26

def self.advanced(**options)
  new(**{
    char_filters:   [:AsciiFolding, :Lowercase, :StripHtml, :StripEmail, :UndoContractions, :StripPossessive],
    tokenizers:     [:Url, :Money, :Number, :Word, :Punctuation],
    token_filters:  [:PartOfSpeech, :Stopwords, :MinLength],
    graph_strategy: :Coocurrence,
    rank_filters:   [:CollapseAdjacent, :NormalizeUnitVector, :SortByValue],
  }.merge(options))
end

.basic(**options) ⇒ `KeywordExtractor`

Creates a “basic” keyword extractor with default options

Options Hash (**options):

:char_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied prior to tokenization
:tokenizers (Array<Symbol, Regexp, String>) —

A list of tokenizer regular expressions to perform tokenization
:token_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied to each token after tokenization
:graph_strategy (Class, Symbol, #build_graph) —

A class or strategy instance for producing a graph from tokens
:rank_filters (Array<Class, Symbol, #filter!>) —

A list of filters to be applied to the keyword ranks after keyword extraction
:strategy (Symbol) —

PageRank strategy to use (either :sparse or :dense)
:damping (Float) —

The probability of following the graph vs. randomly choosing a new node
:tolerance (Float) —

The desired accuracy of the results

Returns:

(KeywordExtractor)

# File 'lib/text_rank/keyword_extractor.rb', line 14

def self.basic(**options)
  new(**{
    char_filters:   [:AsciiFolding, :Lowercase],
    tokenizers:     [:Word],
    token_filters:  [:Stopwords, :MinLength],
    graph_strategy: :Coocurrence,
  }.merge(options))
end

Instance Method Details

#add_char_filter(filter, **options) ⇒ `nil`

Add a new CharFilter for processing text before tokenization

Parameters:

filter (Class, Symbol, #filter!) —

A filter to process text before tokenization
before (Class, Symbol, Object) —

item to add before
at (Fixnum) —

index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 59

def add_char_filter(filter, **options)
  add_into(@char_filters, filter, **options)
  nil
end

#add_rank_filter(filter, **options) ⇒ `nil`

Add a new RankFilter for processing ranks after calculating

Parameters:

filter (Class, Symbol, #filter!) —

A filter to process ranks
before (Class, Symbol, Object) —

item to add before
at (Fixnum) —

index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 93

def add_rank_filter(filter, **options)
  add_into(@rank_filters, filter, **options)
  nil
end

#add_token_filter(filter, **options) ⇒ `nil`

Add a new TokenFilter for processing tokens after tokenization

Parameters:

filter (Class, Symbol, #filter!) —

A filter to process tokens after tokenization
before (Class, Symbol, Object) —

item to add before
at (Fixnum) —

index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 84

def add_token_filter(filter, **options)
  add_into(@token_filters, filter, **options)
  nil
end

#add_tokenizer(tokenizer, **options) ⇒ `nil`

Add a tokenizer regular expression for producing tokens from filtered text

Parameters:

tokenizer (Symbol, Regexp, String) —

Tokenizer regular expression
before (Class, Symbol, Object) —

item to add before
at (Fixnum) —

index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 68

def add_tokenizer(tokenizer, **options)
  add_into(@tokenizers, tokenizer, **options)
  nil
end

#extract(text, **options) ⇒ `Hash<String, Float>`

Filter & tokenize text, and return PageRank

Parameters:

text (String) —

unfiltered text to be processed

Returns:

(Hash<String, Float>) —

tokens and page ranks (in descending order)

# File 'lib/text_rank/keyword_extractor.rb', line 110

def extract(text, **options)
  tokens = tokenize(text)
  graph = PageRank.new(**@page_rank_options)
  classify(@graph_strategy, context: GraphStrategy).build_graph(tokens, graph)
  ranks = graph.calculate(**options)
  apply_rank_filters(ranks, tokens: tokens, original_text: text)
end

#graph_strategy=(strategy) ⇒ `Class`, ...

Sets the graph strategy for producing a graph from tokens

Parameters:

strategy (Class, Symbol, #build_graph) —

Strategy for producing a graph from tokens

Returns:

(Class, Symbol, #build_graph)



76
77
78

# File 'lib/text_rank/keyword_extractor.rb', line 76

def graph_strategy=(strategy)
  @graph_strategy = strategy
end

#tokenize(text) ⇒ `Array<String>`

Filters and tokenizes text

Parameters:

text (String) —

unfiltered text to be tokenized

Returns:

(Array<String>) —

tokens

# File 'lib/text_rank/keyword_extractor.rb', line 101

def tokenize(text)
  filtered_text = apply_char_filters(text)
  tokens = Tokenizer.tokenize(filtered_text, *tokenizer_regular_expressions)
  apply_token_filters(tokens)
end

Class: TextRank::KeywordExtractor

Overview

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**options) ⇒ KeywordExtractor

Class Method Details

.advanced(**options) ⇒ KeywordExtractor

.basic(**options) ⇒ KeywordExtractor

Instance Method Details

#add_char_filter(filter, **options) ⇒ nil

#add_rank_filter(filter, **options) ⇒ nil

#add_token_filter(filter, **options) ⇒ nil

#add_tokenizer(tokenizer, **options) ⇒ nil

#extract(text, **options) ⇒ Hash<String, Float>

#graph_strategy=(strategy) ⇒ Class, ...

#tokenize(text) ⇒ Array<String>

#initialize(**options) ⇒ `KeywordExtractor`

.advanced(**options) ⇒ `KeywordExtractor`

.basic(**options) ⇒ `KeywordExtractor`

#add_char_filter(filter, **options) ⇒ `nil`

#add_rank_filter(filter, **options) ⇒ `nil`

#add_token_filter(filter, **options) ⇒ `nil`

#add_tokenizer(tokenizer, **options) ⇒ `nil`

#extract(text, **options) ⇒ `Hash<String, Float>`

#graph_strategy=(strategy) ⇒ `Class`, ...

#tokenize(text) ⇒ `Array<String>`