Module: ConnectorsShared::ExtractionUtils

Defined in:
lib/connectors_shared/extraction_utils.rb

Constant Summary collapse

NON_CONTENT_TAGS =

A list of tags tags we want to remove before extracting content

Set.new(%w[
  comment
  object
  script
  style
  svg
  video
]).freeze
BREAK_ELEMENTS =

Tags, that generate a word/line break when rendered

Set.new(%w[
  br
  hr
]).freeze
OMISSION =

The character used to signal that a string has been truncated

''

Class Method Summary collapse

Class Method Details

.limit_bytesize(string, limit) ⇒ Object


Limits the size of a given string value down to a given limit (in bytes) This is heavily inspired by github.com/rails/rails/pull/27319/files



94
95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/connectors_shared/extraction_utils.rb', line 94

def self.limit_bytesize(string, limit)
  return string if string.nil? || string.bytesize <= limit
  real_limit = limit - OMISSION.bytesize
  (+'').tap do |cut|
    string.scan(/\X/) do |grapheme|
      if cut.bytesize + grapheme.bytesize <= real_limit
        cut << grapheme
      else
        cut << OMISSION
        break
      end
    end
  end
end

.node_descendant_text(node) ⇒ Object


Expects a Nokogiri HTML node, returns textual content from the node and all of its children



34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/connectors_shared/extraction_utils.rb', line 34

def self.node_descendant_text(node)
  return '' unless node&.present?

  unless node.respond_to?(:children) && node.respond_to?(:name) && node.respond_to?(:text?)
    raise ArgumentError, "Expecting something node-like but got a #{node.class}"
  end

  to_process_stack = [node]
  text = []

  loop do
    # Get the next node to process
    node = to_process_stack.pop
    break unless node

    # Base cases where we append content to the text buffer
    if node.kind_of?(String)
      text << node unless node == ' ' && text.last == ' '
      next
    end

    # Remove tags that do not contain any text (and which sometimes are treated as CDATA, generating garbage text in jruby)
    next if NON_CONTENT_TAGS.include?(node.name)

    # Tags, that need to be replaced by spaces according to the standards
    if replace_with_whitespace?(node)
      text << ' ' unless text.last == ' '
      next
    end

    # Extract the text from all text nodes
    if node.text?
      content = node.content
      text << content.squish if content
      next
    end

    # Add spaces before all tags
    to_process_stack << ' '

    # Recursion by adding the node's children to the stack and looping
    node.children.reverse_each { |child| to_process_stack << child }

    # Add spaces after all tags
    to_process_stack << ' '
  end

  # Remove any duplicate spaces and return the content
  text.join.squish!
end

.replace_with_whitespace?(node) ⇒ Boolean


Returns true, if the node should be replaced with a space when extracting text from a document

Returns:

  • (Boolean)


87
88
89
# File 'lib/connectors_shared/extraction_utils.rb', line 87

def self.replace_with_whitespace?(node)
  BREAK_ELEMENTS.include?(node.name)
end