Class: Uc3DmpId::Comparator

Inherits:
Object
  • Object
show all
Defined in:
lib/uc3-dmp-id/comparator.rb

Overview

Class that compares incoming data from an external source to the DMP It determines if they are likely related and applies a confidence rating

Constant Summary collapse

MSG_MISSING_DMPS =
'No DMPs were defined. Expected an Array of OpenSearch documents!'
STOP_WORDS =
%w[a an and if of or the then they].freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**args) ⇒ Comparator

Expecting an Array of OpenSearch documents as :dmps in the :args

Raises:



22
23
24
25
26
27
28
29
30
# File 'lib/uc3-dmp-id/comparator.rb', line 22

def initialize(**args)
  @logger = args[:logger]
  @details_hash = {}

  @dmps = args.fetch(:dmps, [])

  @logger&.debug(message: 'Comparator DMPs', details: @dmps)
  raise ComparatorError, MSG_MISSING_DMPS if @dmps.empty?
end

Instance Attribute Details

#dmpsObject

See the bottom of this file for a hard-coded crosswalk between Crossref funder ids and ROR ids Some APIs do not support ROR fully for funder ids, so we need to be able to reference both



19
20
21
# File 'lib/uc3-dmp-id/comparator.rb', line 19

def dmps
  @dmps
end

#loggerObject

See the bottom of this file for a hard-coded crosswalk between Crossref funder ids and ROR ids Some APIs do not support ROR fully for funder ids, so we need to be able to reference both



19
20
21
# File 'lib/uc3-dmp-id/comparator.rb', line 19

def logger
  @logger
end

Instance Method Details

#compare(hash:) ⇒ Object

Compare the incoming hash with the DMP details that were gathered during initialization.

The incoming Hash should match the documents found in OpenSearch. For example:

"people": ["john doe", "[email protected]"],
"people_ids": ["https://orcid.org/0000-0000-0000-ZZZZ"],
"affiliations": ["example college"],
"affiliation_ids": ["https://ror.org/00000zzzz"],
"funder_ids": ["https://doi.org/10.13039/00000000000"],
"funders": ["example funder (example.gov)"],
"funder_opportunity_ids": ["485yt8325ty"],
"grant_ids": [],
"funding_status": "planned",
"dmp_id": "doi.org/11.22222/A1B2c3po",
"title": "example data management plan",
"visibility": "private",
"featured": 0,
"description": "the example project abstract",
"project_start": "2022-01-03",
"project_end": "2024-12-23",
"created": "2023-08-07",
"modified": "2023-08-07",
"registered": "2023-08-07"

rubocop:disable Metrics/AbcSize, Metrics/CyclomaticComplexity, Metrics/PerceivedComplexity



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/uc3-dmp-id/comparator.rb', line 57

def compare(hash:)
  scoring = []
  return scoring unless hash.is_a?(Hash) && !hash['title'].nil?

  @dmps.each do |dmp|
    @logger&.debug(message: 'Incoming external work', details: hash)
    # Compare the grant ids. If we have a match return the response immediately since that is
    # a very positive match!
    response = { confidence: 'None', score: 0, notes: [] }
    response = _grants_match?(array: hash.fetch('grant_ids', []), dmp:, response:)
    scoring << response if response[:confidence] != 'None'
    next if response[:confidence] != 'None'

    # Compare the people involved, their affiliations and any funding opportunity numbers
    response = _opportunities_match?(array: hash.fetch('funder_opportunity_ids', []), dmp:, response:)
    response = _orcids_match?(array: hash.fetch('people_ids', []), dmp:, response:)
    response = _last_name_match?(hash:, dmp:, response:)
    response = _affiliation_match?(hash:, dmp:, response:)

    # Only process the following if we had some matching people, affiliations or opportunity nbrs
    response = _repository_match?(hash:, dmp:, response:) if response[:score].positive?
    response = _text_match?(type: 'title', text: hash['title'], dmp:, response:) if response[:score].positive?
    response = _text_match?(type: 'abstract', text: hash['description'], dmp:, response:) if response[:score].positive?
    # If the score is less than 3 then we have no confidence that it is a match
    # next if response[:score] <= 2

    # Set the confidence level based on the score
    response[:dmp_id] = "DMP##{dmp['dmp_id']}"
    response[:confidence] = if response[:score] > 10
                              'High'
                            else
                              (response[:score] > 5 ? 'Medium' : 'Low')
                            end
    @logger&.debug(message: "Found a match!", details: { dmp: dmp, analysis: response })
    scoring << response
  end

  # TODO: introduce a tie-breaker here (maybe the closes to the project_end date)
  scoring.compact.sort { |a, b| b[:score] <=> a[:score] }&.first
end