Class: RightScraper::Main
- Inherits:
-
Object
- Object
- RightScraper::Main
- Defined in:
- lib/right_scraper/main.rb
Overview
Library main entry point. Instantiate this class and call the scrape method to download or update a remote repository to the local disk and run a scraper on the resulting files.
Note that this class was known as Scraper in v1-3 but the name was confusing due to the Scrapers module performing only a subset of the main Scraper class functionality.
Instance Attribute Summary collapse
-
#logger ⇒ Object
readonly
Returns the value of attribute logger.
-
#resources ⇒ Object
readonly
Returns the value of attribute resources.
Instance Method Summary collapse
-
#base_dir ⇒ Object
base directory for any file operations.
-
#builders ⇒ Object
- (Array)
-
scanners or empty.
-
#cleanup ⇒ Object
cleans up temporary files, etc.
-
#errors ⇒ Object
- (Array)
-
Error messages in case of failure.
-
#freed_dir(repo) ⇒ Object
Path to directory where scanned artifacts can by copied out of containment due to lack of permissions to write to other directories.
-
#initialize(options = {}) ⇒ Main
constructor
Initialize scrape destination directory.
-
#repo_dir(repo) ⇒ Object
Path to directory where given repo should be or was downloaded.
-
#retrieve(repo) ⇒ Object
Retrieves the given repository.
-
#scan(retrieved) ⇒ Object
Scans a local directory.
-
#scanners ⇒ Object
- (Array)
-
scanners or empty.
-
#scrape(repo, incremental = true, &callback) ⇒ Object
deprecated
Deprecated.
the newer methodology will perform these operations in stages
-
#succeeded? ⇒ Boolean
(also: #successful?)
Was scraping successful? Call errors to get error messages if false.
-
#warnings ⇒ Object
- (Array)
-
Warnings or empty.
Constructor Details
#initialize(options = {}) ⇒ Main
Initialize scrape destination directory
Options
:kind
-
Type of scraper that will traverse directory for resources, one of :cookbook or :workflow
:basedir
-
Local directory where files are retrieved and scraped, use temporary directory if nil
:max_bytes
-
Maximum number of bytes to read from remote repo, unlimited if nil
:max_seconds
-
Maximum number of seconds to spend reading from remote repo, unlimited if nil
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/right_scraper/main.rb', line 50 def initialize(={}) = ::RightSupport::Data::Mash.new( :kind => nil, :basedir => nil, :max_bytes => nil, :max_seconds => nil, :logger => nil, :s3_key => nil, :s3_secret => nil, :s3_bucket => nil, :scanners => nil, :builders => nil, ).merge() @temporary = !.has_key?(:basedir) [:basedir] ||= Dir.mktmpdir [:logger] ||= ::RightScraper::Loggers::Default.new @logger = [:logger] @resources = [] [:errors] = @logger.errors [:warnings] = @logger.warnings # load classes from scanners and builders options, if necessary. [:scanners, :builders].each do |k| list = [k] || [] list.each_with_index do |clazz, index| unless clazz.kind_of?(::Class) list[index] = ::Object.const_get(clazz) end end end @options = end |
Instance Attribute Details
#logger ⇒ Object (readonly)
Returns the value of attribute logger.
41 42 43 |
# File 'lib/right_scraper/main.rb', line 41 def logger @logger end |
#resources ⇒ Object (readonly)
Returns the value of attribute resources.
41 42 43 |
# File 'lib/right_scraper/main.rb', line 41 def resources @resources end |
Instance Method Details
#base_dir ⇒ Object
base directory for any file operations.
195 196 197 |
# File 'lib/right_scraper/main.rb', line 195 def base_dir @options[:basedir] end |
#builders ⇒ Object
- (Array)
-
scanners or empty
233 234 235 |
# File 'lib/right_scraper/main.rb', line 233 def builders return @options[:builders] end |
#cleanup ⇒ Object
cleans up temporary files, etc.
200 201 202 |
# File 'lib/right_scraper/main.rb', line 200 def cleanup ::FileUtils.remove_entry_secure(base_dir) rescue nil if @temporary end |
#errors ⇒ Object
- (Array)
-
Error messages in case of failure
223 224 225 |
# File 'lib/right_scraper/main.rb', line 223 def errors @logger.errors end |
#freed_dir(repo) ⇒ Object
Path to directory where scanned artifacts can by copied out of containment due to lack of permissions to write to other directories. the freed files can then be reused by subsequent scanners, etc.
218 219 220 |
# File 'lib/right_scraper/main.rb', line 218 def freed_dir(repo) ::File.('../freed', repo_dir(repo)) end |
#repo_dir(repo) ⇒ Object
Path to directory where given repo should be or was downloaded
Parameters
- repo(Hash|RightScraper::Repositories::Base)
-
Remote repository corresponding to local directory
Return
- String
-
Path to local directory that corresponds to given repository
211 212 213 |
# File 'lib/right_scraper/main.rb', line 211 def repo_dir(repo) RightScraper::Retrievers::Base.repo_dir(base_dir, repo) end |
#retrieve(repo) ⇒ Object
Retrieves the given repository. See #scrape for details.
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
# File 'lib/right_scraper/main.rb', line 131 def retrieve(repo) errorlen = errors.size unless repo.kind_of?(::RightScraper::Repositories::Base) repo = ::RightSupport::Data::Mash.new(repo) repository_hash = repo.delete(:repository_hash) # optional repo = RightScraper::Repositories::Base.from_hash(repo) if repository_hash && repository_hash != repo.repository_hash raise RightScraper::Error, "Repository hash mismatch: #{repository_hash} != #{repo.repository_hash}" end end retriever = nil # 1. Retrieve the files @logger.operation(:retrieving, "from #{repo}") do # note that the retriever type may be unavailable but allow the # retrieve method to raise any such error. retriever = repo.retriever(@options) retriever.retrieve end if errors.size == errorlen # create the freed directory with world-writable permission for # subsequent scan output for less-privileged child processes. freed_base_path = freed_dir(repo) ::FileUtils.rm_rf(freed_base_path) if ::File.exist?(freed_base_path) ::FileUtils.mkdir_p(freed_base_path) ::File.chmod(0777, freed_base_path) # the following hash is needed for running any subsequent scanners. { ignorable_paths: retriever.ignorable_paths, repo_dir: retriever.repo_dir, freed_dir: freed_base_path, repository: retriever.repository } else nil end end |
#scan(retrieved) ⇒ Object
Scans a local directory. See #scrape for details.
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
# File 'lib/right_scraper/main.rb', line 173 def scan(retrieved) errorlen = errors.size old_callback = @logger.callback = ::RightSupport::Data::Mash.new(@options).merge(retrieved) repo = [:repository] unless repo.kind_of?(::RightScraper::Repositories::Base) repo = ::RightSupport::Data::Mash.new(repo) repository_hash = repo.delete(:repository_hash) # optional repo = RightScraper::Repositories::Base.from_hash(repo) if repository_hash && repository_hash != repo.repository_hash raise RightScraper::Error, "Repository hash mismatch: #{repository_hash} != #{repo.repository_hash}" end [:repository] = repo end @logger.operation(:scraping, [:repo_dir]) do scraper = ::RightScraper::Scrapers::Base.scraper() @resources += scraper.scrape end errors.size == errorlen end |
#scanners ⇒ Object
- (Array)
-
scanners or empty
238 239 240 |
# File 'lib/right_scraper/main.rb', line 238 def scanners return @options[:scanners] end |
#scrape(repo, incremental = true, &callback) ⇒ Object
the newer methodology will perform these operations in stages
Scrapes and scans a given repository.
controlled externally instead of calling this all-in-one method.
Parameters
- repo(Hash|RightScraper::Repositories::Base)
-
Repository to be scraped
Note: repo can either be a Hash or a RightScraper::Repositories::Base instance.
See the RightScraper::Repositories::Base class for valid Hash keys.
Block
If a block is given, it will be called back with progress information the block should take four arguments:
-
first argument is one of
:begin
,:commit
,:abort
which signifies what the scraper is trying to do and where it is when it does it -
second argument is a symbol describing the operation being performed in an easy-to-match way
-
third argument is optional further explanation
-
fourth argument is the exception pending (only relevant for
:abort
)
Return
- true
-
If scrape was successful
- false
-
If scrape failed, call errors for information on failure
Raise
- ‘Invalid repository type’
-
If repository type is not known
110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
# File 'lib/right_scraper/main.rb', line 110 def scrape(repo, incremental=true, &callback) old_logger_callback = @logger.callback @logger.callback = callback errorlen = errors.size begin if retrieved = retrieve(repo) scan(retrieved) end rescue Exception # legacy logger handles communication with the end user and appending # to our error list; we just need to keep going. the new methodology # has no such guaranteed communication so the caller will decide how to # handle errors, etc. ensure @logger.callback = old_logger_callback cleanup end errors.size == errorlen end |
#succeeded? ⇒ Boolean Also known as: successful?
Was scraping successful? Call errors to get error messages if false
Return
- Boolean
-
true if scrape finished with no error, false otherwise.
247 248 249 |
# File 'lib/right_scraper/main.rb', line 247 def succeeded? errors.empty? end |
#warnings ⇒ Object
- (Array)
-
Warnings or empty
228 229 230 |
# File 'lib/right_scraper/main.rb', line 228 def warnings @logger.warnings end |