Class: EpubTools::XHTMLCleaner

Inherits:

Object

Object
EpubTools::XHTMLCleaner

show all

Defined in:: lib/epub_tools/xhtml_cleaner.rb

Overview

Google Docs makes a mess out of EPUBs and creates html without proper tag names and just uses classes for everything. This class does the following to clean invalid xhtml:

Removes any   or <hr /> tags.
Removes empty  tags.
Using the class_config, it removes  tags that are used for bold or italics and replaces them with  or  tags.
Unwraps any  tags that have no classes assigned.
Outputs everything to a cleanly formatted .xhtml

Instance Method Summary collapse

#initialize(options = {}) ⇒ XHTMLCleaner constructor

Initializes the class.
#run ⇒ String

Runs the cleaner.

Constructor Details

#initialize(options = {}) ⇒ `XHTMLCleaner`

Initializes the class

Parameters:

options (Hash) (defaults to: {}) —

Configuration options

Options Hash (options):

:filename (String) —

The path to the xhtml to clean (required)
:class_config (String) —

Path to a YAML file containing the bold and italic classes to check (default: ‘text_style_classes.yaml’)

# File 'lib/epub_tools/xhtml_cleaner.rb', line 24

def initialize(options = {})
  @filename = options.fetch(:filename)
  class_config = options[:class_config] || 'text_style_classes.yaml'
  @classes = YAML.load_file(class_config).transform_keys(&:to_sym)
end

Instance Method Details

#run ⇒ `String`

Runs the cleaner

Returns:

(String) —

Path to the cleaned file

# File 'lib/epub_tools/xhtml_cleaner.rb', line 32

def run
  raw_content = read_and_strip_problematic_hr
  doc = parse_xml(raw_content)
  remove_empty_paragraphs(doc)
  remove_bold_spans(doc)
  replace_italic_spans(doc)
  unwrap_remaining_spans(doc)
  write_pretty_output(doc)
  @filename
end