Module: HtmlToPlainText

Defined in:
lib/html_to_plain_text.rb

Overview

The main method on this module plain_text will convert a string of HTML to a plain text approximation.

Constant Summary collapse

IGNORE_TAGS =
%w(script style object applet iframe).inject({}){|h, t| h[t] = true; h}.freeze
PARAGRAPH_TAGS =
%w(p h1 h2 h3 h4 h5 h6 table ol ul dl dd blockquote dialog figure aside section).inject({}){|h, t| h[t] = true; h}.freeze
BLOCK_TAGS =
%w(div address li dt center del article header header footer nav pre legend tr).inject({}){|h, t| h[t] = true; h}.freeze
WHITESPACE =
[" ", "\n", "\r"].freeze
PLAINTEXT =
"plaintext".freeze
PRE =
"pre".freeze
BR =
"br".freeze
HR =
"hr".freeze
TD =
"td".freeze
TH =
"th".freeze
TR =
"tr".freeze
OL =
"ol".freeze
UL =
"ul".freeze
LI =
"li".freeze
A =
"a".freeze
TABLE =
"table".freeze
NUMBERS =
["1", "a"].freeze
ABSOLUTE_URL_PATTERN =
/^[a-z]+:\/\/[a-z0-9]/i.freeze
HTML_PATTERN =
/[<&]/.freeze
TRAILING_WHITESPACE =
/[ \t]+$/.freeze
BODY_TAG_XPATH =
"/html/body".freeze
CARRIDGE_RETURN_PATTERN =
/\r(\n?)/.freeze
LINE_BREAK_PATTERN =
/[\n\r]/.freeze
NON_PROTOCOL_PATTERN =
/:\/?\/?(.*)/.freeze
NOT_WHITESPACE_PATTERN =
/\S/.freeze
SPACE =
" ".freeze
EMPTY =
"".freeze
NEWLINE =
"\n".freeze
HREF =
"href".freeze
TABLE_SEPARATOR =
" | ".freeze

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.plain_text(html) ⇒ Object

Convert some HTML into a plain text approximation.



44
45
46
47
48
49
50
# File 'lib/html_to_plain_text.rb', line 44

def plain_text(html)
  return nil if html.nil?
  return html.dup unless html =~ HTML_PATTERN
  body = Nokogiri::HTML::Document.parse(html).xpath(BODY_TAG_XPATH).first
  return unless body
  convert_node_to_plain_text(body).strip.gsub(CARRIDGE_RETURN_PATTERN, NEWLINE)
end

Instance Method Details

#plain_text(html) ⇒ Object

Helper instance method for converting HTML into plain text. This method simply calls HtmlToPlainText.plain_text.



37
38
39
# File 'lib/html_to_plain_text.rb', line 37

def plain_text(html)
  HtmlToPlainText.plain_text(html)
end