Samvera::NestingIndexer

The Samvera::NestingIndexer gem generates the graph related attributes related to querying lineage of objects. It maps a PreservationDocument (e.g. a Fedora object) to an IndexDocument (e.g. a SOLR document) by mapping a PreservationDocument's direct parents to ancestral paths for the given PreservationDocument.

Background
Concepts
Examples
Adapters
Considerations
Public API for Semantic Versioning
Installation

Background

This gem began as a sandbox to work through the reindexing strategy for CurateND Collections. This gem separates the logic of walking a graph, building the attributes necessary for nested collection maintenance, and persistence concerns. In other words, testing this gem does not require spinning up expensive persistence layers (e.g. SOLR, Fedora, or even an RDBMS).

Notation

When B is a member of A, I am using the A ={ B notation. When C is a member of B and B is a member of A, I'll chain these together A ={ B ={ C.

Concepts

While indexing objects, we work with two types of documents:

PreservationDocument - a light-weight representation of a Fedora object
IndexDocument - a light-weight representation of a SOLR document object

The four attributes to consider for indexing the graph:

id - the unique identifier for a document
parent_ids - the ids for of the parents of the given document
pathnames - the paths to traverse from a root document to the given document
ancestors - the pathnames to each ancestral node of the given node (e.g. pathname to my parent, pathname to my grandparent)

See Samvera::NestingIndexer::Documents::IndexDocument for further discussion.

To reindex a single document, use the Samvera::NestingIndexer.reindex_relationships method.

To reindex all documents, use the Samvera::NestingIndexer.reindex_all! method. Warning: This is a very slow process.

From a node's pathname(s), you can query all of that node's descendants (both direct and indirect).

Examples

Given the following PreservationDocuments:

PID	Parents
A	-
B	-
C	A
D	A, B
E	C
F	D

When we reindex the above PreservationDocuments, then we will get the following IndexDocuments:

PID	Parents	Pathnames	Ancestors
A	-	[A]	[]
B	-	[B]	[]
C	A	[A/C]	[A]
D	A, B	[A/D, B/D]	[A, B]
E	C	[A/C/E]	[A, A/C]
F	D	[A/D/F, B/D/F]	[A, A/D, B, B/D]

For more scenarios, look at the Reindex PID and Descendants specs.

Given I want to find the direct descendants of A, then I can query all nodes with that have a parent of A.
Given I want to find the direct and indirect descendants of A, then I can query all notes that have an ancestor entry of A.

Adapters

An AbstractAdapter provides the method interface for others to build against.

The InMemory adapter is a reference implementation (and used to ease testing overhead).

CurateND has implemented the following adapter for its LibraryCollection indexing.

To define the adapter for your application:

# In an application initializer (e.g. config/samvera_indexer_config.rb)
Samvera::NestingIndexer.configure do |config|
  config.adapter = MyCustomAdapter
end

To best ensure you have implemented the adapter to spec:

# In the spec for MyCustomAdapter
require 'samvera/nesting_indexer/adapters/interface_behavior_spec'
RSpec.describe MyCustomAdapter
  it_behaves_like 'a Samvera::NestingIndexer::Adapter'
  # Your implementation details here
end

See CurateND for Notre Dame's adaptor configuration.

Sequence Diagram for Reindexing a Single Document

The following sequence diagram documents the interactions in Samvera::NestingIndexer::RelationshipReindexer.

Reindex Relationship Diagram

See the text-based version of Reindex Relationship diagram, leveraging the Mermaid syntax.

See documentation/README.md for details on updating the diagram.

Considerations

Given a single object A, when we reindex A, we:

Find the parent objects of A to calculate the ancestors and pathnames
Iterate through each descendant, in a breadth-first process, to reindex it (and each descendant's descendants).

This is a potentially time consumptive process and should not be run within the request cycle.

Cycle Detections

When dealing with nested graphs, there is a danger of creating an cycle (e.g. A ={ B ={ A). Samvera::NestingIndexer implements two guards to short-circuit the indexing of cyclic graphs:

Enforcing a maximum nesting depth of the graph
Checking that an object is not its own ancestor (Samvera::NestingIndexer::RelationshipReindexer#guard_against_possiblity_of_self_ancestry)

The ./spec/features/reindex_pid_and_descendants_spec.rb contains examples of behavior.

NOTE: These guards to prevent indexing cyclic graphs do not prevent the underlying preservation document from creating its own cyclic graph.

Detecting Possible Cycles Before Indexing

Given an up to date index and a document, then it is valid to nest the given document beneath any document that:

Is not the given document
Does not have one or more pathnames that includes the given document's ID

For examples of determining if we can nest a document within another document, see the demonstration of nesting.

In implementations, you'll likely want to write queries that answer:

Find the valid IDs that I can nest within?
Find the valid IDs in which I can nest within and am not already nested within?

Public API for Semantic Versioning

Any method with @api public documentation is part of the semantically versioned API. This means you can assume:

the method will exist with the same effective signature (perhaps with changes to defaults)
the method can be called by upstream consumers
deprecations will be announced prior to major version upgrades

Any module or class with @api public documentation is part of the semantically versioned API. This means you can assume:

the module or class can be accessed upstream
all public methods are assumed to be @api public unless otherwise stated
deprecations will be announced prior to major version upgrades

Installation

Add the following to your Gemfile:

gem 'samvera-nesting_indexer'

Bundle up your gems (e.g. bundle).

Then run rails generator install samvera:nesting_indexer:install.

TODO

[X] Incorporate additional logging
[ ] Build methods to allow for fanning out the reindexing. At present, when we reindex a node and its "children", we run that entire process within a single context. Likewise, we run a single process when reindexing EVERYTHING.
[ ] Promote from samvera-labs to samvera via the promotion process.
[ ] Write adapter method to assist in guarding against self-ancestry. We could probably expose a base adapter that has the method through use of the other adapter methods.
[ ] Add templates to generate the configuration