Class: RightScraper::Main

Inherits:

Object

Object
RightScraper::Main

show all

Defined in:: lib/right_scraper/main.rb

Overview

Library main entry point. Instantiate this class and call the scrape method to download or update a remote repository to the local disk and run a scraper on the resulting files.

Note that this class was known as Scraper in v1-3 but the name was confusing due to the Scrapers module performing only a subset of the main Scraper class functionality.

Instance Attribute Summary collapse

#resources ⇒ Object readonly
(Array)

Scraped resources.

Instance Method Summary collapse

#errors ⇒ Object
(Array)

Error messages in case of failure.
#initialize(options = {}) ⇒ Main constructor

Initialize scrape destination directory.
#repo_dir(repo) ⇒ Object

Path to directory where given repo should be or was downloaded.
#scrape(repo, incremental = true, &callback) ⇒ Object

Scrape given repository, depositing files into the scrape directory.
#succeeded? ⇒ Boolean (also: #successful?)

Call errors to get error messages if false.
#warnings ⇒ Object
(Array)

Warnings or empty.

Constructor Details

#initialize(options = {}) ⇒ `Main`

Initialize scrape destination directory

Options

:kind: Type of scraper that will traverse directory for resources, one of :cookbook or :workflow
:basedir: Local directory where files are retrieved and scraped, use temporary directory if nil
:max_bytes: Maximum number of bytes to read from remote repo, unlimited if nil
:max_seconds: Maximum number of seconds to spend reading from remote repo, unlimited if nil

# File 'lib/right_scraper/main.rb', line 50

def initialize(options={})
  options = {
    :kind        => nil,
    :basedir     => nil,
    :max_bytes   => nil,
    :max_seconds => nil,
    :callback    => nil,
    :logger      => nil,
    :s3_key      => nil,
    :s3_secret   => nil,
    :s3_bucket   => nil,
    :errors      => nil,
    :warnings    => nil,
    :scanners    => nil,
    :builders    => nil,
  }.merge(options)
  @temporary = !options.has_key?(:basedir)
  options[:basedir] ||= Dir.mktmpdir
  options[:logger] ||= ::RightScraper::Loggers::Default.new
  @logger = options[:logger]
  @resources = []
  @options = options
end

Instance Attribute Details

#resources ⇒ `Object` (readonly)

(Array): Scraped resources



41
42
43

# File 'lib/right_scraper/main.rb', line 41

def resources
  @resources
end

Instance Method Details

#errors ⇒ `Object`

(Array): Error messages in case of failure



166
167
168

# File 'lib/right_scraper/main.rb', line 166

def errors
  @logger.errors
end

#repo_dir(repo) ⇒ `Object`

Path to directory where given repo should be or was downloaded

Parameters

repo(Hash|RightScraper::Repositories::Base): Remote repository corresponding to local directory

Return

String: Path to local directory that corresponds to given repository



161
162
163

# File 'lib/right_scraper/main.rb', line 161

def repo_dir(repo)
  RightScraper::Retrievers::Base.repo_dir(@options[:basedir], repo)
end

#scrape(repo, incremental = true, &callback) ⇒ `Object`

Scrape given repository, depositing files into the scrape directory. Update content of unique directory incrementally when possible with further calls.

Parameters

repo(Hash|RightScraper::Repositories::Base): Repository to be scraped

Note: repo can either be a Hash or a RightScraper::Repositories::Base instance.
      See the RightScraper::Repositories::Base class for valid Hash keys.

Block

If a block is given, it will be called back with progress information the block should take four arguments:

first argument is one of :begin, :commit, :abort which signifies what the scraper is trying to do and where it is when it does it
second argument is a symbol describing the operation being performed in an easy-to-match way
third argument is optional further explanation
fourth argument is the exception pending (only relevant for :abort)

Return

true: If scrape was successful
false: If scrape failed, call errors for information on failure

Raise

‘Invalid repository type’: If repository type is not known

# File 'lib/right_scraper/main.rb', line 100

def scrape(repo, incremental=true, &callback)
  errorlen = errors.size
  repo = RightScraper::Repositories::Base.from_hash(repo) if repo.is_a?(Hash)
  @logger.callback = callback
  begin
    # 1. Retrieve the files
    retriever = nil
    repo_dir_changed = false
    @logger.operation(:retrieving, "from #{repo}") do
      # note that the retriever type may be unavailable but allow the
      # retrieve method to raise any such error.
      retriever = repo.retriever(@options)
      repo_dir_changed = retriever.retrieve
    end

    # TEAL FIX: Note that retrieve will now return true iff there has been
    # a change to the last scraped repository directory for efficiency
    # reasons and only for retreiver types that support this behavior.
    #
    # Even if the retrieval is skipped due to already having the data on
    # disk we still need to scrape its resources only because of the case
    # of the metadata scraper daemon, which updates multiple repositories
    # of similar criteria.
    #
    # The issue is that a new repo can appear later with the same criteria
    # as an already-scraped repo and will need it's own copy of the
    # scraped resources. The easiest (but not most efficient) way to
    # deliver these is to rescrape the already-seen resources. This
    # becomes more expensive as we rely on generating "metadata.json" from
    # "metadata.rb" for cookbooks but is likely not expensive enough to
    # need to improve this logic.


    # 2. Now scrape if there is a scraper in the options
    @logger.operation(:scraping, retriever.repo_dir) do
      if @options[:kind]
        options = @options.merge({:ignorable_paths => retriever.ignorable_paths,
                                  :repo_dir        => retriever.repo_dir,
                                  :repository      => retriever.repository})
        scraper = RightScraper::Scrapers::Base.scraper(options)
        @resources += scraper.scrape
      end
    end
  rescue Exception
    # logger handles communication with the end user and appending
    # to our error list, we just need to keep going.
  ensure
    # ensure basedir is always removed if temporary (even with errors).
    ::FileUtils.remove_entry_secure(@options[:basedir]) rescue nil if @temporary
  end
  @logger.callback = nil
  errors.size == errorlen
end

#succeeded? ⇒ `Boolean` Also known as: successful?

Call errors to get error messages if false

Return

Boolean: true if scrape finished with no error, false otherwise.

Returns:

(Boolean)



180
181
182

# File 'lib/right_scraper/main.rb', line 180

def succeeded?
  errors.empty?
end

#warnings ⇒ `Object`

(Array): Warnings or empty



171
172
173

# File 'lib/right_scraper/main.rb', line 171

def warnings
  @logger.warnings
end

Class: RightScraper::Main

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Main

Options

Instance Attribute Details

#resources ⇒ Object (readonly)

Instance Method Details

#errors ⇒ Object

#repo_dir(repo) ⇒ Object

Parameters

Return

#scrape(repo, incremental = true, &callback) ⇒ Object

Parameters

Block

Return

Raise

#succeeded? ⇒ Boolean Also known as: successful?

Return

#warnings ⇒ Object

#initialize(options = {}) ⇒ `Main`

#resources ⇒ `Object` (readonly)

#errors ⇒ `Object`

#repo_dir(repo) ⇒ `Object`

#scrape(repo, incremental = true, &callback) ⇒ `Object`

#succeeded? ⇒ `Boolean` Also known as: successful?

#warnings ⇒ `Object`