Content Focus
This is a little gem that allows you to input raw HTML and extract the most relevant piece of content. This is useful when doing semantic analysis on HTML pages for example.
Right now, ContentFocus only supports ‘permanent content extraction’. This is the content that’s non-temporal on a page, like for example:
- About section
- Author information
- Article body
- Generic information block
The algorithm uses several ways of determining this and it will try to neglect irrelevant pieces of content (navigation, styling, etc.)
Example
require 'rubygems'
require 'content_focus'
content_focus = ContentFocus::HTML.new(html_data)
# Will return the most relevant content in text
static_text = content_focus.static_text
# Will return the most relevant block of content in a Hpricot HTML tree element
static_fragment = content_focus.static_fragment
Author
Dominiek ter Heide (Note: I wrote this a while back and thought this could be useful to some developers)