Home > AggregateContentsFromWeb

Aggregate Contents From the Web

From version 0.2.1, EPUB Parser can parse unpacked(unzipped) EPUB files on the web and aggregate contents in the books.

Let's get contents of pretty cmmic Page Blanche from IDPF's GitHub repository: https://github.com/IDPF/epub3-samples/tree/master/30/page-blanche

We can consider URI https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/ as the root directory of the book because we can get EPUB Open Container Format's container.xml file from https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/META-INF/container.xml.

Note: Don't forget slash at the end of URI

EPUB Parser can treat the URI as EPUB book file path and parse contents from it by using EPUB::OCF::PhysicalContainer::UnpackedURI:

require 'epub/parser'

uri = 'https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/'
epub = EPUB::Parser.parse(uri, container_adapter: :UnpackedURI)

The trick is to set container adapter to :UnpackedURI. It makes it possible to parse EPUB book from the web. Now we can play with EPUB books as always!

As an example, I will show you a script to download all the files of specified EPUB book to local directory(source code is available in repository's aggregate-contents-from-web).

Execution:

$ ruby examples/aggregate-contents-from-web.rb https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/
Started downloading EPUB contents...
  from: https://raw.githubusercontent.com/IDPF/epub3-samples/master/30/page-blanche/
  to: /tmp/epub-parser20150703-13148-ghdtfq
Making mimetype file...
Downloading META-INF/container.xml ...
Downloading EPUB/package.opf ...
Downloading EPUB/Style/style.css ...
Downloading EPUB/Navigation/nav.xhtml ...
Downloading EPUB/Navigation/toc.ncx ...
Downloading EPUB/Content/cover.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_000.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_001.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_002.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_003.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_004.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_005.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_006.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_007.xhtml ...
Downloading EPUB/Content/PageBlanche_Page_008.xhtml ...
Downloading EPUB/Image/cover.jpg ...
Downloading EPUB/Image/PageBlanche_Page_001.jpg ...
Downloading EPUB/Image/PageBlanche_Page_002.jpg ...
Downloading EPUB/Image/PageBlanche_Page_003.jpg ...
Downloading EPUB/Image/PageBlanche_Page_004.jpg ...
Downloading EPUB/Image/PageBlanche_Page_005.jpg ...
Downloading EPUB/Image/PageBlanche_Page_006.jpg ...
Downloading EPUB/Image/PageBlanche_Page_007.jpg ...
Downloading EPUB/Image/PageBlanche_Page_008.jpg ...
/tmp/epub-parser20150703-13148-ghdtfq

The last line of the output is path to directory which contents are downloaded to. We can repackage it as an EPUB file. Let's use epzip utility to do that easily:

$ epzip /tmp/epub-parser20150703-13148-ghdtfq ./page-blanche.epub

Command-line tools

Command-line tools epubinfo and epub-open may also handle with URI as EPUB books.