Module: ConnectorsShared::ExtractionUtils
- Defined in:
- lib/connectors_shared/extraction_utils.rb
Constant Summary collapse
- NON_CONTENT_TAGS =
A list of tags tags we want to remove before extracting content
Set.new(%w[ comment object script style svg video ]).freeze
- BREAK_ELEMENTS =
Tags, that generate a word/line break when rendered
Set.new(%w[ br hr ]).freeze
- OMISSION =
The character used to signal that a string has been truncated
'…'
Class Method Summary collapse
-
.limit_bytesize(string, limit) ⇒ Object
————————————————————————————————- Limits the size of a given string value down to a given limit (in bytes) This is heavily inspired by github.com/rails/rails/pull/27319/files.
-
.node_descendant_text(node) ⇒ Object
————————————————————————————————- Expects a Nokogiri HTML node, returns textual content from the node and all of its children.
-
.replace_with_whitespace?(node) ⇒ Boolean
————————————————————————————————- Returns true, if the node should be replaced with a space when extracting text from a document.
Class Method Details
.limit_bytesize(string, limit) ⇒ Object
Limits the size of a given string value down to a given limit (in bytes) This is heavily inspired by github.com/rails/rails/pull/27319/files
94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
# File 'lib/connectors_shared/extraction_utils.rb', line 94 def self.limit_bytesize(string, limit) return string if string.nil? || string.bytesize <= limit real_limit = limit - OMISSION.bytesize (+'').tap do |cut| string.scan(/\X/) do |grapheme| if cut.bytesize + grapheme.bytesize <= real_limit cut << grapheme else cut << OMISSION break end end end end |
.node_descendant_text(node) ⇒ Object
Expects a Nokogiri HTML node, returns textual content from the node and all of its children
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/connectors_shared/extraction_utils.rb', line 34 def self.node_descendant_text(node) return '' unless node&.present? unless node.respond_to?(:children) && node.respond_to?(:name) && node.respond_to?(:text?) raise ArgumentError, "Expecting something node-like but got a #{node.class}" end to_process_stack = [node] text = [] loop do # Get the next node to process node = to_process_stack.pop break unless node # Base cases where we append content to the text buffer if node.kind_of?(String) text << node unless node == ' ' && text.last == ' ' next end # Remove tags that do not contain any text (and which sometimes are treated as CDATA, generating garbage text in jruby) next if NON_CONTENT_TAGS.include?(node.name) # Tags, that need to be replaced by spaces according to the standards if replace_with_whitespace?(node) text << ' ' unless text.last == ' ' next end # Extract the text from all text nodes if node.text? content = node.content text << content.squish if content next end # Add spaces before all tags to_process_stack << ' ' # Recursion by adding the node's children to the stack and looping node.children.reverse_each { |child| to_process_stack << child } # Add spaces after all tags to_process_stack << ' ' end # Remove any duplicate spaces and return the content text.join.squish! end |
.replace_with_whitespace?(node) ⇒ Boolean
Returns true, if the node should be replaced with a space when extracting text from a document
87 88 89 |
# File 'lib/connectors_shared/extraction_utils.rb', line 87 def self.replace_with_whitespace?(node) BREAK_ELEMENTS.include?(node.name) end |