Samvera::NestingIndexer

Build Status Test Coverage Code Climate Documentation Status APACHE 2 License

The Samvera::NestingIndexer gem is responsible for indexing the graph relationship of objects. It maps a PreservationDocument to an IndexDocument by mapping a PreservationDocument's direct parents into the paths to get from a root document to the given PreservationDocument.

Background

This is a sandbox to work through the reindexing strategy as it relates to CurateND Collections. At this point the code is separate to allow for raid testing and prototyping (no sense spinning up SOLR and Fedora to walk an arbitrary graph).

Notation

When B is a member of A, I am using the A ={ B notation. When C is a member of B and B is a member of A, I'll chain these together A ={ B ={ C.

Concepts

As we are indexing objects, we have two types of documents:

  1. PreservationDocument - a light-weight representation of a Fedora object
  2. IndexDocument - a light-weight representation of a SOLR document object

We have four attributes to consider for indexing the graph:

  1. id - the unique identifier for a document
  2. parent_ids - the ids for all of the parents of a given document
  3. pathnames - the paths to traverse from a root document to the given document
  4. ancestors - the pathnames to each node that is an ancestor of the given node (e.g. pathname to my parent, pathname to my grandparent)

See Samvera::NestingIndexer::Documents::IndexDocument for further discussion.

To reindex a single document, we leverage the Samvera::NestingIndexer.reindex_relationships method.

To reindex all of the documents, we leverage the Samvera::NestingIndexer.reindex_all! method. Warning: This is a very slow process.

With a node's pathname(s), we are able to query what are all of my descendants (both direct and indirect) by way of the ancestors.

Examples

Given the following PreservationDocuments:

PID Parents
A -
B -
C A
D A, B
E C
F D

If we were to reindex the above PreservationDocuments, we will generate the following IndexDocuments:

PID Parents Pathnames Ancestors
A - [A] []
B - [B] []
C A [A/C] [A]
D A, B [A/D, B/D] [A, B]
E C [A/C/E] [A, A/C]
F D [A/D/F, B/D/F] [A, A/D, B, B/D]

For more scenarios, look at the Reindex PID and Descendants specs.

  • Given I want to find the direct descendants of A, then I can query all nodes with that have a parent of A.
  • Given I want to find the direct and indirect descendants of A, then I can query all notes that have an ancestor entry of A.

Adapters

An AbstractAdapter provides the method interface for others to build against.

The InMemory adapter is a reference implementation (and used to ease testing overhead).

CurateND has implemented the following adapter for its LibraryCollection indexing.

To define the adapter for your application:

# In an application initializer (e.g. config/samvera_indexer_config.rb)
Samvera::NestingIndexer.configure do |config|
  config.adapter = MyCustomAdapter
end

To best ensure you have implemented the adapter to spec:

# In the spec for MyCustomAdapter
require 'samvera/nesting_indexer/adapters/interface_behavior_spec'
RSpec.describe MyCustomAdapter
  it_behaves_like 'a Samvera::NestingIndexer::Adapter'
end

See CurateND for Notre Dame's adaptor configuration.

Sequence Diagram for Reindexing a Single Document

The following sequence diagram documents the interactions in Samvera::NestingIndexer::RelationshipReindexer.

Reindex Relationship Diagram

See the text-based version of Reindex Relationship diagram, leveraging the Mermaid syntax.

Considerations

Given a single object A, when we reindex A, we:

  • Find the parent objects of A to calculate the ancestors and pathnames
  • Iterate through each descendant, in a breadth-first process, to reindex it (and each descendant's descendants).

This is a potentially time consumptive process and should not be run within the request cycle.

Cycle Detections

When dealing with nested graphs, there is a danger of creating an cycle (e.g. A ={ B ={ A). Samvera::NestingIndexer implements two guards to short-circuit the indexing of cyclic graphs:

  • Enforcing a maximum nesting depth of the graph
  • Checking that an object is not its own ancestor (Samvera::NestingIndexer::RelationshipReindexer#guard_against_possiblity_of_self_ancestry)

The ./spec/features/reindex_pid_and_descendants_spec.rb contains examples of behavior.

NOTE: These guards to prevent indexing cyclic graphs do not prevent the underlying preservation document from creating its own cyclic graph.

Detecting Possible Cycles Before Indexing

Given an up to date index and a document, then it is valid to nest the given document beneath any document that:

  • Is not the given document
  • Does not have one or more pathnames that includes the given document's ID

For examples of determining if we can nest a document within another document, see the demonstration of nesting.

In implementations, you'll likely want to write a queries that answer:

  • What are the valid IDs that I can nest within?
  • What are the valid IDs in which I can nest within and am not already nested within?

TODO

  • [X] Incorporate additional logging
  • [ ] Build methods to allow for fanning out the reindexing. At present, when we reindex a node and its "children", we run that entire process within a single context. Likewise, we run a single process when reindexing EVERYTHING.
  • [ ] Promote from samvera-labs to samvera via the promotion process.
  • [ ] Write adapter method to assist in guarding against self-ancestry. We could probably expose a base adapter that has the method through use of the other adapter methods.