Module: HtmlToPlainText

Defined in:
lib/html_to_plain_text.rb

Overview

The main method on this module plain_text will convert a string of HTML to a plain text approximation.

Constant Summary collapse

IGNORE_TAGS =
%w(script style object applet iframe).inject({}){|h, t| h[t] = true; h}.freeze
PARAGRAPH_TAGS =
%w(p h1 h2 h3 h4 h5 h6 table ol ul dl dd blockquote dialog figure aside section).inject({}){|h, t| h[t] = true; h}.freeze
BLOCK_TAGS =
%w(div address li dt center del article header header footer nav pre legend tr).inject({}){|h, t| h[t] = true; h}.freeze
WHITESPACE =
[" ", "\n", "\r"].freeze
PLAINTEXT =
"plaintext".freeze
PRE =
"pre".freeze
BR =
"br".freeze
HR =
"hr".freeze
TD =
"td".freeze
TH =
"th".freeze
TR =
"tr".freeze
OL =
"ol".freeze
UL =
"ul".freeze
LI =
"li".freeze
NUMBERS =
["1", "a"].freeze
ABSOLUTE_URL_PATTERN =
/^[a-z]+:\/\/[a-z0-9]/i.freeze
HTML_PATTERN =
/[<&]/.freeze
TRAILING_WHITESPACE =
/[ \t]+$/.freeze

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.plain_text(html) ⇒ Object

Convert some HTML into a plain text approximation.



31
32
33
34
35
36
37
# File 'lib/html_to_plain_text.rb', line 31

def plain_text(html)
  return nil if html.nil?
  return html.dup unless html.match(HTML_PATTERN)
  body = Nokogiri::HTML::Document.parse(html).css("body").first
  return unless body
  convert_node_to_plain_text(body).strip.gsub(/\r(\n?)/, "\n")
end

Instance Method Details

#plain_text(html) ⇒ Object

Helper instance method for converting HTML into plain text. This method simply calls HtmlToPlainText.plain_text.



25
26
27
# File 'lib/html_to_plain_text.rb', line 25

def plain_text(html)
  HtmlToPlainText.plain_text(html)
end