Module: HtmlToPlainText
- Defined in:
- lib/html_to_plain_text.rb
Overview
The main method on this module plain_text will convert a string of HTML to a plain text approximation.
Constant Summary collapse
- IGNORE_TAGS =
%w(script style object applet iframe).inject({}){|h, t| h[t] = true; h}.freeze
- PARAGRAPH_TAGS =
%w(p h1 h2 h3 h4 h5 h6 table ol ul dl dd blockquote dialog figure aside section).inject({}){|h, t| h[t] = true; h}.freeze
- BLOCK_TAGS =
%w(div address li dt center del article header header footer nav pre legend tr).inject({}){|h, t| h[t] = true; h}.freeze
- WHITESPACE =
[" ", "\n", "\r"].freeze
- PLAINTEXT =
"plaintext".freeze
- PRE =
"pre".freeze
- BR =
"br".freeze
- HR =
"hr".freeze
- TD =
"td".freeze
- TH =
"th".freeze
- TR =
"tr".freeze
- OL =
"ol".freeze
- UL =
"ul".freeze
- LI =
"li".freeze
- NUMBERS =
["1", "a"].freeze
- ABSOLUTE_URL_PATTERN =
/^[a-z]+:\/\/[a-z0-9]/i.freeze
- HTML_PATTERN =
/[<&]/.freeze
- TRAILING_WHITESPACE =
/[ \t]+$/.freeze
Class Method Summary collapse
-
.plain_text(html) ⇒ Object
Convert some HTML into a plain text approximation.
Instance Method Summary collapse
-
#plain_text(html) ⇒ Object
Helper instance method for converting HTML into plain text.
Class Method Details
.plain_text(html) ⇒ Object
Convert some HTML into a plain text approximation.
31 32 33 34 35 36 37 |
# File 'lib/html_to_plain_text.rb', line 31 def plain_text(html) return nil if html.nil? return html.dup unless html.match(HTML_PATTERN) body = Nokogiri::HTML::Document.parse(html).css("body").first return unless body convert_node_to_plain_text(body).strip.gsub(/\r(\n?)/, "\n") end |
Instance Method Details
#plain_text(html) ⇒ Object
Helper instance method for converting HTML into plain text. This method simply calls HtmlToPlainText.plain_text.
25 26 27 |
# File 'lib/html_to_plain_text.rb', line 25 def plain_text(html) HtmlToPlainText.plain_text(html) end |