Class: Boilerpipe::SAX::BoilerpipeHTMLParser

Inherits:
Object
  • Object
show all
Defined in:
lib/boilerpipe/sax/boilerpipe_html_parser.rb

Class Method Summary collapse

Class Method Details

.parse(text) ⇒ Object



3
4
5
6
7
8
9
10
11
12
13
# File 'lib/boilerpipe/sax/boilerpipe_html_parser.rb', line 3

def self.parse(text)
  # strip out tags that cause issues
  text = Preprocessor.strip(text)

  # use nokogiri to fix any bad tags, errors - keep experimenting with this
  text = Nokogiri::HTML(text).to_html
  handler = HTMLContentHandler.new
  noko_parser = Nokogiri::HTML::SAX::Parser.new(handler)
  noko_parser.parse(text)
  handler.text_document
end