Class: EpubTools::XHTMLCleaner

Inherits:
Object
  • Object
show all
Defined in:
lib/epub_tools/xhtml_cleaner.rb

Overview

Google Docs makes a mess out of EPUBs and creates html without proper tag names and just uses classes for everything. This class does the following to clean invalid xhtml:

  • Removes any <br /> or <hr /> tags.

  • Removes empty <p> tags.

  • Using the class_config, it removes <span> tags that are used for bold or italics and replaces them with <b> or <i> tags.

  • Unwraps any <span> tags that have no classes assigned.

  • Outputs everything to a cleanly formatted .xhtml

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ XHTMLCleaner

Initializes the class

Parameters:

  • options (Hash) (defaults to: {})

    Configuration options

Options Hash (options):

  • :filename (String)

    The path to the xhtml to clean (required)

  • :class_config (String)

    Path to a YAML file containing the bold and italic classes to check (default: ‘text_style_classes.yaml’)



24
25
26
27
28
# File 'lib/epub_tools/xhtml_cleaner.rb', line 24

def initialize(options = {})
  @filename = options.fetch(:filename)
  class_config = options[:class_config] || 'text_style_classes.yaml'
  @classes = YAML.load_file(class_config).transform_keys(&:to_sym)
end

Instance Method Details

#runString

Runs the cleaner

Returns:

  • (String)

    Path to the cleaned file



32
33
34
35
36
37
38
39
40
41
# File 'lib/epub_tools/xhtml_cleaner.rb', line 32

def run
  raw_content = read_and_strip_problematic_hr
  doc = parse_xml(raw_content)
  remove_empty_paragraphs(doc)
  remove_bold_spans(doc)
  replace_italic_spans(doc)
  unwrap_remaining_spans(doc)
  write_pretty_output(doc)
  @filename
end