Class: RightScraper::Main

Inherits:

Object

Object
RightScraper::Main

show all

Defined in:: lib/right_scraper/main.rb

Overview

Library main entry point. Instantiate this class and call the scrape method to download or update a remote repository to the local disk and run a scraper on the resulting files.

Note that this class was known as Scraper in v1-3 but the name was confusing due to the Scrapers module performing only a subset of the main Scraper class functionality.

Instance Attribute Summary collapse

#resources ⇒ Object readonly
(Array)

Scraped resources.

Instance Method Summary collapse

#base_dir ⇒ Object

base directory for any file operations.
#cleanup ⇒ Object

cleans up temporary files, etc.
#errors ⇒ Object
(Array)

Error messages in case of failure.
#freed_dir(repo) ⇒ Object

Path to directory where scanned artifacts can by copied out of containment due to lack of permissions to write to other directories.
#initialize(options = {}) ⇒ Main constructor

Initialize scrape destination directory.
#repo_dir(repo) ⇒ Object

Path to directory where given repo should be or was downloaded.
#retrieve(repo) ⇒ Object

Retrieves the given repository.
#scan(retrieved) ⇒ Object

Scans a local directory.
#scrape(repo, incremental = true, &callback) ⇒ Object deprecated Deprecated.

the newer methodology will perform these operations in stages
#succeeded? ⇒ Boolean (also: #successful?)

Was scraping successful? Call errors to get error messages if false.
#warnings ⇒ Object
(Array)

Warnings or empty.

Constructor Details

#initialize(options = {}) ⇒ `Main`

Initialize scrape destination directory

Options

:kind: Type of scraper that will traverse directory for resources, one of :cookbook or :workflow
:basedir: Local directory where files are retrieved and scraped, use temporary directory if nil
:max_bytes: Maximum number of bytes to read from remote repo, unlimited if nil
:max_seconds: Maximum number of seconds to spend reading from remote repo, unlimited if nil

# File 'lib/right_scraper/main.rb', line 51

def initialize(options={})
  options = ::RightSupport::Data::Mash.new(
    :kind        => nil,
    :basedir     => nil,
    :max_bytes   => nil,
    :max_seconds => nil,
    :logger      => nil,
    :s3_key      => nil,
    :s3_secret   => nil,
    :s3_bucket   => nil,
    :scanners    => nil,
    :builders    => nil,
  ).merge(options)
  @old_logger_callback = nil
  @temporary = !options.has_key?(:basedir)
  options[:basedir] ||= Dir.mktmpdir
  options[:logger] ||= ::RightScraper::Loggers::Default.new
  @logger = options[:logger]
  @resources = []
  options[:errors] = @logger.errors
  options[:warnings] = @logger.warnings

  # load classes from scanners and builders options, if necessary.
  [:scanners, :builders].each do |k|
    list = options[k] || []
    list.each_with_index do |clazz, index|
      unless clazz.kind_of?(::Class)
        list[index] = ::Object.const_get(clazz)
      end
    end
  end
  @options = options
end

Instance Attribute Details

#resources ⇒ `Object` (readonly)

(Array): Scraped resources



42
43
44

# File 'lib/right_scraper/main.rb', line 42

def resources
  @resources
end

Instance Method Details

#base_dir ⇒ `Object`

base directory for any file operations.



185
186
187

# File 'lib/right_scraper/main.rb', line 185

def base_dir
  @options[:basedir]
end

#cleanup ⇒ `Object`

cleans up temporary files, etc.

# File 'lib/right_scraper/main.rb', line 190

def cleanup
  @logger.callback = @old_logger_callback
  @old_logger_callback = nil
  ::FileUtils.remove_entry_secure(base_dir) rescue nil if @temporary
end

#errors ⇒ `Object`

(Array): Error messages in case of failure



215
216
217

# File 'lib/right_scraper/main.rb', line 215

def errors
  @logger.errors
end

#freed_dir(repo) ⇒ `Object`

Path to directory where scanned artifacts can by copied out of containment due to lack of permissions to write to other directories. the freed files can then be reused by subsequent scanners, etc.



210
211
212

# File 'lib/right_scraper/main.rb', line 210

def freed_dir(repo)
  ::File.expand_path('../freed', repo_dir(repo))
end

#repo_dir(repo) ⇒ `Object`

Path to directory where given repo should be or was downloaded

Parameters

repo(Hash|RightScraper::Repositories::Base): Remote repository corresponding to local directory

Return

String: Path to local directory that corresponds to given repository



203
204
205

# File 'lib/right_scraper/main.rb', line 203

def repo_dir(repo)
  RightScraper::Retrievers::Base.repo_dir(base_dir, repo)
end

#retrieve(repo) ⇒ `Object`

Retrieves the given repository. See #scrape for details.

# File 'lib/right_scraper/main.rb', line 132

def retrieve(repo)
  errorlen = errors.size
  unless repo.kind_of?(::RightScraper::Repositories::Base)
    repo = RightScraper::Repositories::Base.from_hash(::RightSupport::Data::Mash.new(repo))
  end
  retriever = nil

  # 1. Retrieve the files
  @logger.operation(:retrieving, "from #{repo}") do
    # note that the retriever type may be unavailable but allow the
    # retrieve method to raise any such error.
    retriever = repo.retriever(@options)
    retriever.retrieve
  end

  if errors.size == errorlen
    # create the freed directory with world-writable permission for
    # subsequent scan output for less-privileged child processes.
    freed_base_path = freed_dir(repo)
    ::FileUtils.rm_rf(freed_base_path) if ::File.exist?(freed_base_path)
    ::FileUtils.mkdir_p(freed_base_path)
    ::File.chmod(0777, freed_base_path)

    # the following hash is needed for running any subsequent scanners.
    {
      ignorable_paths: retriever.ignorable_paths,
      repo_dir: retriever.repo_dir,
      freed_dir: freed_base_path,
      repository: retriever.repository
    }
  else
    nil
  end
end

#scan(retrieved) ⇒ `Object`

Scans a local directory. See #scrape for details.

# File 'lib/right_scraper/main.rb', line 168

def scan(retrieved)
  errorlen = errors.size
  old_callback = @logger.callback
  options = ::RightSupport::Data::Mash.new(@options).merge(retrieved)
  repo = options[:repository]
  unless repo.kind_of?(::RightScraper::Repositories::Base)
    repo = RightScraper::Repositories::Base.from_hash(::RightSupport::Data::Mash.new(repo))
    options[:repository] = repo
  end
  @logger.operation(:scraping, options[:repo_dir]) do
    scraper = ::RightScraper::Scrapers::Base.scraper(options)
    @resources += scraper.scrape
  end
  errors.size == errorlen
end

#scrape(repo, incremental = true, &callback) ⇒ `Object`

Deprecated.

the newer methodology will perform these operations in stages

Scrapes and scans a given repository.

controlled externally instead of calling this all-in-one method.

Parameters

repo(Hash|RightScraper::Repositories::Base): Repository to be scraped

Note: repo can either be a Hash or a RightScraper::Repositories::Base instance.
      See the RightScraper::Repositories::Base class for valid Hash keys.

Block

If a block is given, it will be called back with progress information the block should take four arguments:

first argument is one of :begin, :commit, :abort which signifies what the scraper is trying to do and where it is when it does it
second argument is a symbol describing the operation being performed in an easy-to-match way
third argument is optional further explanation
fourth argument is the exception pending (only relevant for :abort)

Return

true: If scrape was successful
false: If scrape failed, call errors for information on failure

Raise

‘Invalid repository type’: If repository type is not known

# File 'lib/right_scraper/main.rb', line 112

def scrape(repo, incremental=true, &callback)
  @old_logger_callback = @logger.callback
  @logger.callback = callback
  errorlen = errors.size
  begin
    if retrieved = retrieve(repo, &callback)
      scan(retrieved, &callback)
    end
  rescue Exception
    # legacy logger handles communication with the end user and appending
    # to our error list; we just need to keep going. the new methodology
    # has no such guaranteed communication so the caller will decide how to
    # handle errors, etc.
  ensure
    cleanup
  end
  errors.size == errorlen
end

#succeeded? ⇒ `Boolean` Also known as: successful?

Was scraping successful? Call errors to get error messages if false

Return

Boolean: true if scrape finished with no error, false otherwise.

Returns:

(Boolean)



229
230
231

# File 'lib/right_scraper/main.rb', line 229

def succeeded?
  errors.empty?
end

#warnings ⇒ `Object`

(Array): Warnings or empty



220
221
222

# File 'lib/right_scraper/main.rb', line 220

def warnings
  @logger.warnings
end

Class: RightScraper::Main

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Main

Options

Instance Attribute Details

#resources ⇒ Object (readonly)

Instance Method Details

#base_dir ⇒ Object

#cleanup ⇒ Object

#errors ⇒ Object

#freed_dir(repo) ⇒ Object

#repo_dir(repo) ⇒ Object

Parameters

Return

#retrieve(repo) ⇒ Object

#scan(retrieved) ⇒ Object

#scrape(repo, incremental = true, &callback) ⇒ Object

Parameters

Block

Return

Raise

#succeeded? ⇒ Boolean Also known as: successful?

Return

#warnings ⇒ Object

#initialize(options = {}) ⇒ `Main`

#resources ⇒ `Object` (readonly)

#base_dir ⇒ `Object`

#cleanup ⇒ `Object`

#errors ⇒ `Object`

#freed_dir(repo) ⇒ `Object`

#repo_dir(repo) ⇒ `Object`

#retrieve(repo) ⇒ `Object`

#scan(retrieved) ⇒ `Object`

#scrape(repo, incremental = true, &callback) ⇒ `Object`

#succeeded? ⇒ `Boolean` Also known as: successful?

#warnings ⇒ `Object`