Class: GeoCombine::Harvester
- Inherits:
-
Object
- Object
- GeoCombine::Harvester
- Defined in:
- lib/geo_combine/harvester.rb
Overview
Harvests Geoblacklight documents from OpenGeoMetadata for indexing
Instance Attribute Summary collapse
-
#ogm_path ⇒ Object
readonly
Returns the value of attribute ogm_path.
-
#schema_version ⇒ Object
readonly
Returns the value of attribute schema_version.
Class Method Summary collapse
-
.denylist ⇒ Object
Non-metadata repositories that shouldn’t be harvested.
-
.ogm_api_uri ⇒ Object
GitHub API endpoint for OpenGeoMetadata repositories.
Instance Method Summary collapse
-
#clone(repo) ⇒ Object
Clone a repository via git If the repository already exists, skip it.
-
#clone_all ⇒ Object
Clone all repositories via git Return the count of repositories cloned.
-
#docs_to_index ⇒ Object
Enumerable of docs to index, for passing to an indexer.
-
#initialize(ogm_path: ENV.fetch('OGM_PATH', 'tmp/opengeometadata'), schema_version: ENV.fetch('SCHEMA_VERSION', '1.0')) ⇒ Harvester
constructor
A new instance of Harvester.
-
#pull(repo) ⇒ Object
Update a repository via git If the repository doesn’t exist, clone it.
-
#pull_all ⇒ Object
Update all repositories Return the count of repositories updated.
Constructor Details
#initialize(ogm_path: ENV.fetch('OGM_PATH', 'tmp/opengeometadata'), schema_version: ENV.fetch('SCHEMA_VERSION', '1.0')) ⇒ Harvester
Returns a new instance of Harvester.
30 31 32 33 34 35 36 |
# File 'lib/geo_combine/harvester.rb', line 30 def initialize( ogm_path: ENV.fetch('OGM_PATH', 'tmp/opengeometadata'), schema_version: ENV.fetch('SCHEMA_VERSION', '1.0') ) @ogm_path = ogm_path @schema_version = schema_version end |
Instance Attribute Details
#ogm_path ⇒ Object (readonly)
Returns the value of attribute ogm_path.
10 11 12 |
# File 'lib/geo_combine/harvester.rb', line 10 def ogm_path @ogm_path end |
#schema_version ⇒ Object (readonly)
Returns the value of attribute schema_version.
10 11 12 |
# File 'lib/geo_combine/harvester.rb', line 10 def schema_version @schema_version end |
Class Method Details
.denylist ⇒ Object
Non-metadata repositories that shouldn’t be harvested
13 14 15 16 17 18 19 20 21 22 23 |
# File 'lib/geo_combine/harvester.rb', line 13 def self.denylist [ 'GeoCombine', 'aardvark', 'metadata-issues', 'ogm_utils-python', 'opengeometadata.github.io', 'opengeometadata-rails', 'gbl-1_to_aardvark' ] end |
.ogm_api_uri ⇒ Object
GitHub API endpoint for OpenGeoMetadata repositories
26 27 28 |
# File 'lib/geo_combine/harvester.rb', line 26 def self.ogm_api_uri URI('https://api.github.com/orgs/opengeometadata/repos?per_page=1000') end |
Instance Method Details
#clone(repo) ⇒ Object
Clone a repository via git If the repository already exists, skip it.
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/geo_combine/harvester.rb', line 76 def clone(repo) repo_path = File.join(@ogm_path, repo) repo_info = repository_info(repo) # Skip if exists; warn if archived or empty if File.directory? repo_path puts "Skipping clone to #{repo_path}; directory exists" return 0 end puts "WARNING: repository '#{repo}' is archived" if repo_info['archived'] puts "WARNING: repository '#{repo}' is empty" if repo_info['size'].zero? repo_url = "https://github.com/OpenGeoMetadata/#{repo}.git" Git.clone(repo_url, nil, path: ogm_path, depth: 1) puts "Cloned #{repo_url}" 1 end |
#clone_all ⇒ Object
Clone all repositories via git Return the count of repositories cloned.
96 97 98 |
# File 'lib/geo_combine/harvester.rb', line 96 def clone_all repositories.map(&method(:clone)).reduce(:+) end |
#docs_to_index ⇒ Object
Enumerable of docs to index, for passing to an indexer
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/geo_combine/harvester.rb', line 39 def docs_to_index return to_enum(:docs_to_index) unless block_given? Find.find(@ogm_path) do |path| # skip non-json and layers.json files next unless File.basename(path).include?('.json') && File.basename(path) != 'layers.json' doc = JSON.parse(File.read(path)) [doc].flatten.each do |record| # skip indexing if this record has a different schema version than what we want record_schema = record['gbl_mdVersion_s'] || record['geoblacklight_version'] next unless record_schema == @schema_version yield record, path end end end |
#pull(repo) ⇒ Object
Update a repository via git If the repository doesn’t exist, clone it.
59 60 61 62 63 64 65 66 |
# File 'lib/geo_combine/harvester.rb', line 59 def pull(repo) repo_path = File.join(@ogm_path, repo) clone(repo) unless File.directory? repo_path Git.open(repo_path).pull puts "Updated #{repo}" 1 end |
#pull_all ⇒ Object
Update all repositories Return the count of repositories updated
70 71 72 |
# File 'lib/geo_combine/harvester.rb', line 70 def pull_all repositories.map(&method(:pull)).reduce(:+) end |