Class: GeoCombine::Harvester

Inherits:
Object
  • Object
show all
Defined in:
lib/geo_combine/harvester.rb

Overview

Harvests Geoblacklight documents from OpenGeoMetadata for indexing

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(ogm_path: ENV.fetch('OGM_PATH', 'tmp/opengeometadata'), schema_version: ENV.fetch('SCHEMA_VERSION', '1.0')) ⇒ Harvester

Returns a new instance of Harvester.



30
31
32
33
34
35
36
# File 'lib/geo_combine/harvester.rb', line 30

def initialize(
  ogm_path: ENV.fetch('OGM_PATH', 'tmp/opengeometadata'),
  schema_version: ENV.fetch('SCHEMA_VERSION', '1.0')
)
  @ogm_path = ogm_path
  @schema_version = schema_version
end

Instance Attribute Details

#ogm_pathObject (readonly)

Returns the value of attribute ogm_path.



10
11
12
# File 'lib/geo_combine/harvester.rb', line 10

def ogm_path
  @ogm_path
end

#schema_versionObject (readonly)

Returns the value of attribute schema_version.



10
11
12
# File 'lib/geo_combine/harvester.rb', line 10

def schema_version
  @schema_version
end

Class Method Details

.denylistObject

Non-metadata repositories that shouldn’t be harvested



13
14
15
16
17
18
19
20
21
22
23
# File 'lib/geo_combine/harvester.rb', line 13

def self.denylist
  [
    'GeoCombine',
    'aardvark',
    'metadata-issues',
    'ogm_utils-python',
    'opengeometadata.github.io',
    'opengeometadata-rails',
    'gbl-1_to_aardvark'
  ]
end

.ogm_api_uriObject

GitHub API endpoint for OpenGeoMetadata repositories



26
27
28
# File 'lib/geo_combine/harvester.rb', line 26

def self.ogm_api_uri
  URI('https://api.github.com/orgs/opengeometadata/repos?per_page=1000')
end

Instance Method Details

#clone(repo) ⇒ Object

Clone a repository via git If the repository already exists, skip it.



76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/geo_combine/harvester.rb', line 76

def clone(repo)
  repo_path = File.join(@ogm_path, repo)
  repo_info = repository_info(repo)

  # Skip if exists; warn if archived or empty
  if File.directory? repo_path
    puts "Skipping clone to #{repo_path}; directory exists"
    return 0
  end
  puts "WARNING: repository '#{repo}' is archived" if repo_info['archived']
  puts "WARNING: repository '#{repo}' is empty" if repo_info['size'].zero?

  repo_url = "https://github.com/OpenGeoMetadata/#{repo}.git"
  Git.clone(repo_url, nil, path: ogm_path, depth: 1)
  puts "Cloned #{repo_url}"
  1
end

#clone_allObject

Clone all repositories via git Return the count of repositories cloned.



96
97
98
# File 'lib/geo_combine/harvester.rb', line 96

def clone_all
  repositories.map(&method(:clone)).reduce(:+)
end

#docs_to_indexObject

Enumerable of docs to index, for passing to an indexer



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/geo_combine/harvester.rb', line 39

def docs_to_index
  return to_enum(:docs_to_index) unless block_given?

  Find.find(@ogm_path) do |path|
    # skip non-json and layers.json files
    next unless File.basename(path).include?('.json') && File.basename(path) != 'layers.json'

    doc = JSON.parse(File.read(path))
    [doc].flatten.each do |record|
      # skip indexing if this record has a different schema version than what we want
      record_schema = record['gbl_mdVersion_s'] || record['geoblacklight_version']
      next unless record_schema == @schema_version

      yield record, path
    end
  end
end

#pull(repo) ⇒ Object

Update a repository via git If the repository doesn’t exist, clone it.



59
60
61
62
63
64
65
66
# File 'lib/geo_combine/harvester.rb', line 59

def pull(repo)
  repo_path = File.join(@ogm_path, repo)
  clone(repo) unless File.directory? repo_path

  Git.open(repo_path).pull
  puts "Updated #{repo}"
  1
end

#pull_allObject

Update all repositories Return the count of repositories updated



70
71
72
# File 'lib/geo_combine/harvester.rb', line 70

def pull_all
  repositories.map(&method(:pull)).reduce(:+)
end