Class: DaimonSkycrawlers::SitemapParser

Inherits:
Object
  • Object
show all
Defined in:
lib/daimon_skycrawlers/sitemap_parser.rb

Overview

Parser for sitemap.xml

Based on https://github.com/benbalter/sitemap-parser See also https://www.sitemaps.org/

urls = ["https://example.com/sitemap.xml"]
sitemap_parser = DaimonSkycrawlers::SitemapParser.new(urls)
sitemap_urls = sitemap_parser.parse

Defined Under Namespace

Classes: Error

Instance Method Summary collapse

Constructor Details

#initialize(urls) ⇒ SitemapParser

Returns a new instance of SitemapParser.

Parameters:

  • urls (Array)

    List of sitemap.xml URL



29
30
31
# File 'lib/daimon_skycrawlers/sitemap_parser.rb', line 29

def initialize(urls)
  @urls = urls
end

Instance Method Details

#parseArray

Fetch and parse sitemap.xml

Returns:

  • (Array)

    URLs in sitemap.xml



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/daimon_skycrawlers/sitemap_parser.rb', line 38

def parse
  hydra = Typhoeus::Hydra.new(max_concurrency: 1)
  sitemap_urls = []
  @urls.each do |url|
    uri = URI(url)
    if uri.scheme && uri.scheme.start_with?("http")
      request = Typhoeus::Request.new(url, followlocation: true)
      request.on_complete do |response|
        sitemap_urls.concat(on_complete(response))
      end
      hydra.queue(request)
    else
      if File.exist?(url)
        sitemap_urls.concat(extract_urls(File.read(url)))
      end
    end
  end
  loop do
    hydra.run
    break if hydra.queued_requests.empty?
  end
  sitemap_urls
end