Content Focus

This is a little gem that allows you to input raw HTML and extract the most relevant piece of content. This is useful when doing semantic analysis on HTML pages for example.

Right now, ContentFocus only supports ‘permanent content extraction’. This is the content that’s non-temporal on a page, like for example:

  • About section
  • Author information
  • Article body
  • Generic information block

The algorithm uses several ways of determining this and it will try to neglect irrelevant pieces of content (navigation, styling, etc.)

Example


  require 'rubygems'
  require 'content_focus'
  
  content_focus = ContentFocus::HTML.new(html_data)
  
  # Will return the most relevant content in text
  static_text = content_focus.static_text
  
  # Will return the most relevant block of content in a Hpricot HTML tree element
  static_fragment = content_focus.static_fragment

Author

Dominiek ter Heide (Note: I wrote this a while back and thought this could be useful to some developers)