pismo - Web page content analysis and metadata extraction

DESCRIPTION:

Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents. Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords. Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.

All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).

USAGE:

A basic example of extracting basic metadata from a Web page:

require 'pismo'

# Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')

doc.title     # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
doc.author    # => "Peter Cooper"
doc.lede      # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
doc.keywords  # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]

There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:

Pismo['http://www.rubyflow.com/items/4082'].title   # => "Install Ruby as a non-root User"

The current metadata methods are:

  • title
  • titles
  • author
  • authors
  • lede
  • keywords
  • sentences(qty)
  • body
  • html_body
  • feed
  • feeds
  • favicon
  • description
  • datetime

These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.

The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.

CAVEATS AND SHORTCOMINGS:

There are some shortcomings or problems that I'm aware of and am going to pursue:

  • I do not know how Pismo fares on JRuby, Rubinius, or others yet.
  • The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
  • The author name extraction is quite poor.
  • The image extraction only handles images with absolute URLs.
  • The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)

OTHER GROOVY STUFF:

Command Line Tool

A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.

Usage:

./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime

Output:

--- 
:url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
:title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
:lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
:author: Peter Cooper
:datetime: 2010-01-07 12:00:00 +00:00

If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded and assigned to both the constant 'P' and the variable @p.

Stopword access

You can access Pismo's stopword list directly:

Pismo.stopwords    # => [.., .., ..]

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don't break it in a future version unintentionally.
  • Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
  • Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)

Apache 2.0 License - See LICENSE for details. Copyright (c) 2009, 2010 Peter Cooper

In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.

http://github.com/peterc/pismo