Class: EpubTools::XHTMLCleaner
- Inherits:
-
Object
- Object
- EpubTools::XHTMLCleaner
- Defined in:
- lib/epub_tools/xhtml_cleaner.rb
Overview
Google Docs makes a mess out of EPUBs and creates html without proper tag names and just uses classes for everything. This class does the following to clean invalid xhtml:
-
Removes any
<br />
or<hr />
tags. -
Removes empty
<p>
tags. -
Using the
class_config
, it removes<span>
tags that are used for bold or italics and replaces them with<b>
or<i>
tags. -
Unwraps any
<span>
tags that have no classes assigned. -
Outputs everything to a cleanly formatted
.xhtml
Instance Method Summary collapse
-
#initialize(options = {}) ⇒ XHTMLCleaner
constructor
Initializes the class.
-
#run ⇒ String
Runs the cleaner.
Constructor Details
#initialize(options = {}) ⇒ XHTMLCleaner
Initializes the class
24 25 26 27 28 |
# File 'lib/epub_tools/xhtml_cleaner.rb', line 24 def initialize( = {}) @filename = .fetch(:filename) class_config = [:class_config] || 'text_style_classes.yaml' @classes = YAML.load_file(class_config).transform_keys(&:to_sym) end |
Instance Method Details
#run ⇒ String
Runs the cleaner
32 33 34 35 36 37 38 39 40 41 |
# File 'lib/epub_tools/xhtml_cleaner.rb', line 32 def run raw_content = read_and_strip_problematic_hr doc = parse_xml(raw_content) remove_empty_paragraphs(doc) remove_bold_spans(doc) replace_italic_spans(doc) unwrap_remaining_spans(doc) write_pretty_output(doc) @filename end |