Module: HtmlToPlainText
- Defined in:
- lib/html_to_plain_text.rb
Overview
The main method on this module plain_text
will convert a string of HTML to a plain text approximation.
Constant Summary collapse
- IGNORE_TAGS =
%w(script style object applet iframe).inject({}){|h, t| h[t] = true; h}.freeze
- PARAGRAPH_TAGS =
%w(p h1 h2 h3 h4 h5 h6 table ol ul dl dd blockquote dialog figure aside section).inject({}){|h, t| h[t] = true; h}.freeze
- BLOCK_TAGS =
%w(div address li dt center del article header header footer nav pre legend tr).inject({}){|h, t| h[t] = true; h}.freeze
- WHITESPACE =
[" ", "\n", "\r"].freeze
- PLAINTEXT =
"plaintext".freeze
- PRE =
"pre".freeze
- BR =
"br".freeze
- HR =
"hr".freeze
- TD =
"td".freeze
- TH =
"th".freeze
- TR =
"tr".freeze
- OL =
"ol".freeze
- UL =
"ul".freeze
- LI =
"li".freeze
- A =
"a".freeze
- TABLE =
"table".freeze
- NUMBERS =
["1", "a"].freeze
- ABSOLUTE_URL_PATTERN =
/^[a-z]+:\/\/[a-z0-9]/i.freeze
- HTML_PATTERN =
/[<&]/.freeze
- TRAILING_WHITESPACE =
/[ \t]+$/.freeze
- BODY_TAG_XPATH =
"/html/body".freeze
- CARRIDGE_RETURN_PATTERN =
/\r(\n?)/.freeze
- LINE_BREAK_PATTERN =
/[\n\r]/.freeze
- NON_PROTOCOL_PATTERN =
/:\/?\/?(.*)/.freeze
- NOT_WHITESPACE_PATTERN =
/\S/.freeze
- SPACE =
" ".freeze
- EMPTY =
"".freeze
- NEWLINE =
"\n".freeze
- HREF =
"href".freeze
- TABLE_SEPARATOR =
" | ".freeze
Class Method Summary collapse
-
.plain_text(html) ⇒ Object
Convert some HTML into a plain text approximation.
Instance Method Summary collapse
-
#plain_text(html) ⇒ Object
Helper instance method for converting HTML into plain text.
Class Method Details
.plain_text(html) ⇒ Object
Convert some HTML into a plain text approximation.
44 45 46 47 48 49 50 |
# File 'lib/html_to_plain_text.rb', line 44 def plain_text(html) return nil if html.nil? return html.dup unless html =~ HTML_PATTERN body = Nokogiri::HTML::Document.parse(html).xpath(BODY_TAG_XPATH).first return unless body convert_node_to_plain_text(body).strip.gsub(CARRIDGE_RETURN_PATTERN, NEWLINE) end |
Instance Method Details
#plain_text(html) ⇒ Object
Helper instance method for converting HTML into plain text. This method simply calls HtmlToPlainText.plain_text.
37 38 39 |
# File 'lib/html_to_plain_text.rb', line 37 def plain_text(html) HtmlToPlainText.plain_text(html) end |