Class: WebPageParser::IndependentPageParserV1
- Inherits:
-
BaseParser
- Object
- BaseParser
- WebPageParser::IndependentPageParserV1
- Defined in:
- lib/web-page-parser/parsers/independent_page_parser.rb
Overview
IndependentPageParserV1 parses Independent news web pages,
Instance Attribute Summary
Attributes inherited from BaseParser
Instance Method Summary collapse
- #content ⇒ Object
- #date ⇒ Object
-
#guid_from_url ⇒ Object
Independent articles have a guid in the url (as of Jan 2014, a seven digit integer at the end of the url before the html extension).
- #html_doc ⇒ Object
- #title ⇒ Object
Methods inherited from BaseParser
#guid, #hash, #initialize, #page, #retrieve_page
Constructor Details
This class inherits a constructor from WebPageParser::BaseParser
Instance Method Details
#content ⇒ Object
34 35 36 37 38 39 40 41 42 43 44 |
# File 'lib/web-page-parser/parsers/independent_page_parser.rb', line 34 def content return @content if @content content = [] story_body = html_doc.css('div.articleContent p') story_body.each do |p| p.search('script,object').remove p = p.text content << p.strip.gsub(/\n+/,' ') if p end @content = content.select { |p| !p.empty? } end |
#date ⇒ Object
46 47 48 49 50 51 52 |
# File 'lib/web-page-parser/parsers/independent_page_parser.rb', line 46 def date return @date if @date if = html_doc.at_css('meta[property="article:published_time"]') @date = DateTime.parse(['content']) rescue nil end @date end |
#guid_from_url ⇒ Object
Independent articles have a guid in the url (as of Jan 2014, a seven digit integer at the end of the url before the html extension)
21 22 23 24 |
# File 'lib/web-page-parser/parsers/independent_page_parser.rb', line 21 def guid_from_url # get the last large number from the url, if there is one url.to_s.scan(/[0-9]{6,12}/).last end |
#html_doc ⇒ Object
26 27 28 |
# File 'lib/web-page-parser/parsers/independent_page_parser.rb', line 26 def html_doc @html_document ||= Nokogiri::HTML(page) end |
#title ⇒ Object
30 31 32 |
# File 'lib/web-page-parser/parsers/independent_page_parser.rb', line 30 def title @title ||= html_doc.css('div#main h1.title').text.strip end |