Class: DaimonSkycrawlers::SitemapParser
- Inherits:
-
Object
- Object
- DaimonSkycrawlers::SitemapParser
- Defined in:
- lib/daimon_skycrawlers/sitemap_parser.rb
Overview
Parser for sitemap.xml
Based on https://github.com/benbalter/sitemap-parser See also https://www.sitemaps.org/
urls = ["https://example.com/sitemap.xml"]
sitemap_parser = DaimonSkycrawlers::SitemapParser.new(urls)
sitemap_urls = sitemap_parser.parse
Defined Under Namespace
Classes: Error
Instance Method Summary collapse
-
#initialize(urls) ⇒ SitemapParser
constructor
A new instance of SitemapParser.
-
#parse ⇒ Array
Fetch and parse sitemap.xml.
Constructor Details
#initialize(urls) ⇒ SitemapParser
Returns a new instance of SitemapParser.
29 30 31 |
# File 'lib/daimon_skycrawlers/sitemap_parser.rb', line 29 def initialize(urls) @urls = urls end |
Instance Method Details
#parse ⇒ Array
Fetch and parse sitemap.xml
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/daimon_skycrawlers/sitemap_parser.rb', line 38 def parse hydra = Typhoeus::Hydra.new(max_concurrency: 1) sitemap_urls = [] @urls.each do |url| uri = URI(url) if uri.scheme && uri.scheme.start_with?("http") request = Typhoeus::Request.new(url, followlocation: true) request.on_complete do |response| sitemap_urls.concat(on_complete(response)) end hydra.queue(request) else if File.exist?(url) sitemap_urls.concat(extract_urls(File.read(url))) end end end loop do hydra.run break if hydra.queued_requests.empty? end sitemap_urls end |