Class: RightScraper::Main

Inherits:
Object
  • Object
show all
Defined in:
lib/right_scraper/main.rb

Overview

Library main entry point. Instantiate this class and call the scrape method to download or update a remote repository to the local disk and run a scraper on the resulting files.

Note that this class was known as Scraper in v1-3 but the name was confusing due to the Scrapers module performing only a subset of the main Scraper class functionality.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ Main

Initialize scrape destination directory

Options

:kind

Type of scraper that will traverse directory for resources, one of :cookbook or :workflow

:basedir

Local directory where files are retrieved and scraped, use temporary directory if nil

:max_bytes

Maximum number of bytes to read from remote repo, unlimited if nil

:max_seconds

Maximum number of seconds to spend reading from remote repo, unlimited if nil



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/right_scraper/main.rb', line 50

def initialize(options={})
  options = ::RightSupport::Data::Mash.new(
    :kind        => nil,
    :basedir     => nil,
    :max_bytes   => nil,
    :max_seconds => nil,
    :logger      => nil,
    :s3_key      => nil,
    :s3_secret   => nil,
    :s3_bucket   => nil,
    :scanners    => nil,
    :builders    => nil,
  ).merge(options)
  @temporary = !options.has_key?(:basedir)
  options[:basedir] ||= Dir.mktmpdir
  options[:logger] ||= ::RightScraper::Loggers::Default.new
  @logger = options[:logger]
  @resources = []
  options[:errors] = @logger.errors
  options[:warnings] = @logger.warnings

  # load classes from scanners and builders options, if necessary.
  [:scanners, :builders].each do |k|
    list = options[k] || []
    list.each_with_index do |clazz, index|
      unless clazz.kind_of?(::Class)
        list[index] = ::Object.const_get(clazz)
      end
    end
  end
  @options = options
end

Instance Attribute Details

#loggerObject (readonly)

Returns the value of attribute logger.



41
42
43
# File 'lib/right_scraper/main.rb', line 41

def logger
  @logger
end

#resourcesObject (readonly)

Returns the value of attribute resources.



41
42
43
# File 'lib/right_scraper/main.rb', line 41

def resources
  @resources
end

Instance Method Details

#base_dirObject

base directory for any file operations.



195
196
197
# File 'lib/right_scraper/main.rb', line 195

def base_dir
  @options[:basedir]
end

#buildersObject

(Array)

scanners or empty



233
234
235
# File 'lib/right_scraper/main.rb', line 233

def builders
  return @options[:builders]
end

#cleanupObject

cleans up temporary files, etc.



200
201
202
# File 'lib/right_scraper/main.rb', line 200

def cleanup
  ::FileUtils.remove_entry_secure(base_dir) rescue nil if @temporary
end

#errorsObject

(Array)

Error messages in case of failure



223
224
225
# File 'lib/right_scraper/main.rb', line 223

def errors
  @logger.errors
end

#freed_dir(repo) ⇒ Object

Path to directory where scanned artifacts can by copied out of containment due to lack of permissions to write to other directories. the freed files can then be reused by subsequent scanners, etc.



218
219
220
# File 'lib/right_scraper/main.rb', line 218

def freed_dir(repo)
  ::File.expand_path('../freed', repo_dir(repo))
end

#repo_dir(repo) ⇒ Object

Path to directory where given repo should be or was downloaded

Parameters

repo(Hash|RightScraper::Repositories::Base)

Remote repository corresponding to local directory

Return

String

Path to local directory that corresponds to given repository



211
212
213
# File 'lib/right_scraper/main.rb', line 211

def repo_dir(repo)
  RightScraper::Retrievers::Base.repo_dir(base_dir, repo)
end

#retrieve(repo) ⇒ Object

Retrieves the given repository. See #scrape for details.



131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# File 'lib/right_scraper/main.rb', line 131

def retrieve(repo)
  errorlen = errors.size
  unless repo.kind_of?(::RightScraper::Repositories::Base)
    repo = ::RightSupport::Data::Mash.new(repo)
    repository_hash = repo.delete(:repository_hash)  # optional
    repo = RightScraper::Repositories::Base.from_hash(repo)
    if repository_hash && repository_hash != repo.repository_hash
      raise RightScraper::Error, "Repository hash mismatch: #{repository_hash} != #{repo.repository_hash}"
    end
  end

  retriever = nil

  # 1. Retrieve the files
  @logger.operation(:retrieving, "from #{repo}") do
    # note that the retriever type may be unavailable but allow the
    # retrieve method to raise any such error.
    retriever = repo.retriever(@options)
    retriever.retrieve
  end

  if errors.size == errorlen
    # create the freed directory with world-writable permission for
    # subsequent scan output for less-privileged child processes.
    freed_base_path = freed_dir(repo)
    ::FileUtils.rm_rf(freed_base_path) if ::File.exist?(freed_base_path)
    ::FileUtils.mkdir_p(freed_base_path)
    ::File.chmod(0777, freed_base_path)

    # the following hash is needed for running any subsequent scanners.
    {
      ignorable_paths: retriever.ignorable_paths,
      repo_dir: retriever.repo_dir,
      freed_dir: freed_base_path,
      repository: retriever.repository
    }
  else
    nil
  end
end

#scan(retrieved) ⇒ Object

Scans a local directory. See #scrape for details.



173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# File 'lib/right_scraper/main.rb', line 173

def scan(retrieved)
  errorlen = errors.size
  old_callback = @logger.callback
  options = ::RightSupport::Data::Mash.new(@options).merge(retrieved)
  repo = options[:repository]
  unless repo.kind_of?(::RightScraper::Repositories::Base)
    repo = ::RightSupport::Data::Mash.new(repo)
    repository_hash = repo.delete(:repository_hash)  # optional
    repo = RightScraper::Repositories::Base.from_hash(repo)
    if repository_hash && repository_hash != repo.repository_hash
      raise RightScraper::Error, "Repository hash mismatch: #{repository_hash} != #{repo.repository_hash}"
    end
    options[:repository] = repo
  end
  @logger.operation(:scraping, options[:repo_dir]) do
    scraper = ::RightScraper::Scrapers::Base.scraper(options)
    @resources += scraper.scrape
  end
  errors.size == errorlen
end

#scannersObject

(Array)

scanners or empty



238
239
240
# File 'lib/right_scraper/main.rb', line 238

def scanners
  return @options[:scanners]
end

#scrape(repo, incremental = true, &callback) ⇒ Object

Deprecated.

the newer methodology will perform these operations in stages

Scrapes and scans a given repository.

controlled externally instead of calling this all-in-one method.

Parameters

repo(Hash|RightScraper::Repositories::Base)

Repository to be scraped

Note: repo can either be a Hash or a RightScraper::Repositories::Base instance.
      See the RightScraper::Repositories::Base class for valid Hash keys.

Block

If a block is given, it will be called back with progress information the block should take four arguments:

  • first argument is one of :begin, :commit, :abort which signifies what the scraper is trying to do and where it is when it does it

  • second argument is a symbol describing the operation being performed in an easy-to-match way

  • third argument is optional further explanation

  • fourth argument is the exception pending (only relevant for :abort)

Return

true

If scrape was successful

false

If scrape failed, call errors for information on failure

Raise

‘Invalid repository type’

If repository type is not known



110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# File 'lib/right_scraper/main.rb', line 110

def scrape(repo, incremental=true, &callback)
  old_logger_callback = @logger.callback
  @logger.callback = callback
  errorlen = errors.size
  begin
    if retrieved = retrieve(repo)
      scan(retrieved)
    end
  rescue Exception
    # legacy logger handles communication with the end user and appending
    # to our error list; we just need to keep going. the new methodology
    # has no such guaranteed communication so the caller will decide how to
    # handle errors, etc.
  ensure
    @logger.callback = old_logger_callback
    cleanup
  end
  errors.size == errorlen
end

#succeeded?Boolean Also known as: successful?

Was scraping successful? Call errors to get error messages if false

Return

Boolean

true if scrape finished with no error, false otherwise.

Returns:

  • (Boolean)


247
248
249
# File 'lib/right_scraper/main.rb', line 247

def succeeded?
  errors.empty?
end

#warningsObject

(Array)

Warnings or empty



228
229
230
# File 'lib/right_scraper/main.rb', line 228

def warnings
  @logger.warnings
end