Robot code for accessioning and preservation of Web Archiving Service Seed and Crawl objects.
General Robot Documentation
Check the Wiki in the robot-master repo.
To run, use the
lyber-core infrastructure, which uses
bundle exec controller boot
to start all robots defined in
The WAS robots depend on some java projects:
- to extract metadata from web archiving ARC and WARC files, used by wasCrawlPreassemblyWF.
- to index WARC materials for the Stanford Web Archiving Portal, used by cdx-generator step in wasCrawlDisseminationWF
These java projects use jenkinsqa to create deployment artifacts, which are then deployed with capistrano via
config/deploy.rb (see lines 40-54).
was_robot_suite houses these java artifacts in the
Various other dependencies can be teased out of
config/environments/example.rb and shared_configs (was-robotsxxx branches)
See consul pages in Web Archival portal, esp Web Archiving Development Documentation
Preassembly workflow for web archiving crawl objects (that include (W)ARCs files) to extract and create metadata stream. It consists of 6 robots:
build-was-crawl-druid-tree: this robot reads the crawl object content (ARCs or WARCs and logs) from directory defined by crawl object label, then it builds druid tree, and copy the content to the druid tree content directory.
metadata_extractor: this robot extracts the metadata from the (W)ARCs files using java jar file. The output is an xml includes metadata for the (W)ARCs and general information about the other files.
content_metadata_generator: this robot generates content metadata based on the xml created from
desc_metadata_generator: this robot generates desc metadata based on the xml created from
technical_metadata_generator: this robot generates technical metadata based on the xml created from
end_was_crawl_preassembly: initiates the accessionWF (of common-accessioning).
Dissemination workflow for web archiving crawl objects. It is kicked off automagically by the last step in the common-accession/end-accession, as that reads the disseminationWF that is suitable for this object type based on APO. It consists of 3 robots:
cdx-generator: performs the basic indexing for the WARC/ARC files and generates CDX files (Web Archiving index files used by WayBack Machine). Generates 1 CDX file for each WARC file; the generated CDX files will be copied to
cdx-merge-sort-publish: performs two main tasks: 1) Merge the individual cdx files that are generated in the previous step with the main index file 2) Sort the new generated index file
path-indexer: Creates an inverted index for each WARC file and its physical location in the desk for the WayBack machine.
Preassembly workflow for web archiving seed objects. It starts with the output of the registration process (via was-registrar service) which is a source xml file that contains the metadata for the seed object. The metadata source xml file is expected to be in the appropriate xml format, which will then be converted using XSLT.
It consists of 5 robots:
build-was-seed-druid-tree: reads the seed object source xml file from
/was_unaccessioned_data/seeddirectory and creates the druid tree under
/dor/workspace. The content folder contains the
source.xmlthat has been generated by was-registrar.
desc-metadata-generator: generates the descMetadata in MODS format for the seed object. The process includes processing the
source.xmlwith a predefined XSLT based on the metadata source. For example, if the source.xml has element, the robot will match it with descMetadata_AIT.xslt.
thumbnail-generator: captures a screenshot for the first memento using PhantomJS and includes it as the main image for the object. This image will be used in Argo and SearchWorks. If the robot fails to generate a thumbnail, it shows as an error in Argo.
content-metadata-generator: generates contentMetadata.xml for the thumbnail by processing the contentMetadata.XSLT template against the available thumbnail.jp2.
end-was-seed-preassembly: initiates the accessionWF (of common-accessioning) and opens/closes version for the old object.
This workflow provides the connection between the SDR and the actual web archiving components. It consists of 1 robot:
update-thumbnail-generator: sends the information about the seed object URI and DRUID to
Worfklow to route web archiving objects to the wasSeedDisseminationWF or wasCrawlDisseminationWF based on content type. Note that the wasDisseminationWF itself is fired off by the accessionWF by using the custom
<administrativeMetadata> ... <dissemination> <workflow id="wasDisseminationWF"/> </dissemination> </administrativeMetadata>
It consists of 1 robot:
start_special_dissemination: chooses the proper disseminationWF (seed or crawl) based on the WAS object type.