Module: ConnectorsShared::ExtractionUtils

Defined in:: lib/connectors_shared/extraction_utils.rb

Constant Summary collapse

NON_CONTENT_TAGS = A list of tags tags we want to remove before extracting content

Set.new(%w[
  comment
  object
  script
  style
  svg
  video
]).freeze

BREAK_ELEMENTS = Tags, that generate a word/line break when rendered

Set.new(%w[
  br
  hr
]).freeze

OMISSION = The character used to signal that a string has been truncated

'…'

Class Method Summary collapse

.limit_bytesize(string, limit) ⇒ Object

————————————————————————————————- Limits the size of a given string value down to a given limit (in bytes) This is heavily inspired by github.com/rails/rails/pull/27319/files.
.node_descendant_text(node) ⇒ Object

————————————————————————————————- Expects a Nokogiri HTML node, returns textual content from the node and all of its children.
.replace_with_whitespace?(node) ⇒ Boolean

————————————————————————————————- Returns true, if the node should be replaced with a space when extracting text from a document.

Class Method Details

.limit_bytesize(string, limit) ⇒ `Object`

Limits the size of a given string value down to a given limit (in bytes) This is heavily inspired by github.com/rails/rails/pull/27319/files

# File 'lib/connectors_shared/extraction_utils.rb', line 94

def self.limit_bytesize(string, limit)
  return string if string.nil? || string.bytesize <= limit
  real_limit = limit - OMISSION.bytesize
  (+'').tap do |cut|
    string.scan(/\X/) do |grapheme|
      if cut.bytesize + grapheme.bytesize <= real_limit
        cut << grapheme
      else
        cut << OMISSION
        break
      end
    end
  end
end

.node_descendant_text(node) ⇒ `Object`

Expects a Nokogiri HTML node, returns textual content from the node and all of its children

# File 'lib/connectors_shared/extraction_utils.rb', line 34

def self.node_descendant_text(node)
  return '' unless node&.present?

  unless node.respond_to?(:children) && node.respond_to?(:name) && node.respond_to?(:text?)
    raise ArgumentError, "Expecting something node-like but got a #{node.class}"
  end

  to_process_stack = [node]
  text = []

  loop do
    # Get the next node to process
    node = to_process_stack.pop
    break unless node

    # Base cases where we append content to the text buffer
    if node.kind_of?(String)
      text << node unless node == ' ' && text.last == ' '
      next
    end

    # Remove tags that do not contain any text (and which sometimes are treated as CDATA, generating garbage text in jruby)
    next if NON_CONTENT_TAGS.include?(node.name)

    # Tags, that need to be replaced by spaces according to the standards
    if replace_with_whitespace?(node)
      text << ' ' unless text.last == ' '
      next
    end

    # Extract the text from all text nodes
    if node.text?
      content = node.content
      text << content.squish if content
      next
    end

    # Add spaces before all tags
    to_process_stack << ' '

    # Recursion by adding the node's children to the stack and looping
    node.children.reverse_each { |child| to_process_stack << child }

    # Add spaces after all tags
    to_process_stack << ' '
  end

  # Remove any duplicate spaces and return the content
  text.join.squish!
end

.replace_with_whitespace?(node) ⇒ `Boolean`

Returns true, if the node should be replaced with a space when extracting text from a document

Returns:

(Boolean)



87
88
89

# File 'lib/connectors_shared/extraction_utils.rb', line 87

def self.replace_with_whitespace?(node)
  BREAK_ELEMENTS.include?(node.name)
end