Borealis Preservation Plan

Introduction

Borealis, the Canadian Dataverse Repository, is a digital research data repository available to researchers at participating Canadian universities and research organizations with infrastructure hosted by the University of Toronto Libraries (UTL) in Ontario, Canada. The Borealis Preservation Plan outlines the objectives, roles and responsibilities, strategies, and actions for preserving the digital files uploaded by users and stored in the repository. Borealis uses the open-source Dataverse software, developed and maintained by the Institute of Quantitative Social Science (IQSS) at Harvard University and community members from around the world. The Preservation Plan complements strategies, policies and procedures for curation and preservation that Participating Institutions may also have in place for datasets in institutional collections within Borealis.

Definitions

Archivematica: an open source, standards-based processing tool for creating well-formed packages for preservation storage. Archivematica performs signature-based file format identification, validation and characterization functions; can normalize copies of files to preservation and access formats; and creates preservation metadata files using the METS and PREMIS standards.

BagIt: a set of formatting conventions that guide creating checksums for, and verifying the fixity of, collections of files. Files contained in a BagIt-formatted directory (commonly called a “bag”) include a manifest of checksums that can be used to ensure that the contents of the directory have retained fixity after transfer or in storage.

Bit-level preservation: one type of digital preservation strategy. This strategy is focused on ensuring that files retain fixity in storage and that files are stored in multiple locations to protect against accidental loss or corruption. Bit-level preservation does not guarantee any form of future usability/accessibility based on the contents or format of the files in question.

Checksum: a unique numeric or alphanumeric string produced by running a checksum-generating algorithm against a file. When the contents of the file are altered in any way, the checksum value will change, indicating that the file no longer has fixity and therefore should be replaced from a good copy. Checksum algorithms include MD5, SHA-1 and SHA-256.

Dataset: a dataset is a container for a group of related files. For example, a dataset can include the original source data, code, and/or documentation related to a single study or publication. A dataset must also include metadata added by the user to describe the files, including a title, author(s), description and subject.

Dataverse: the open-source research data repository software application with which the Borealis repository is hosted and operated. Dataverse is developed by the Institute for Quantitative Social Science (IQSS) at Harvard University.

Digital preservation: “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary” (DPC Glossary). Digital preservation activities can include active and ongoing monitoring of files and formats, regular fixity checks, and refreshing of storage media.

Fixity: the quality of knowing that a digital file has not been altered or changed. Fixity is established via computing a checksum. Fixity information can help establish the integrity of files via evidence that files have remained physically unchanged over time.

Ontario Library Research Cloud (OLRC): a five-node community cloud storage network maintained by Scholars Portal which Borealis makes use of as part of its operations. The OLRC uses the OpenStack Swift software to connect five storage nodes located at the University of Toronto, the University of Guelph, the University of Ottawa, York University, and Queen’s University. All data stored in the OLRC is replicated across three of the five nodes for reliability and integrity. If one of these copies becomes unreadable, a new copy is created by the system from the two remaining good copies. The OLRC service also includes access to the DuraCloud software for advanced preservation management for packages stored in the OLRC. Additional information about the security of the OLRC is contained in Technology Infrastructure and Security Information.

Permafrost: a hosted digital preservation service offered by Scholars Portal to members of the Ontario Council of University Libraries (OCUL). Permafrost pairs Archivematica with the OLRC to provide access to a technical infrastructure, support, and training to enable OCUL members to actively process digital objects for long-term preservation and access.

Objectives

The objectives of UTL’s preservation activities for the Borealis repository are as follows:

  1. Ensure a minimum level of fixity assurance for all files uploaded by registered users.
    • The priority for this strategy is to protect against the loss of data in the form of accidental deletion, corruption or modification of user-submitted content over time
    • Commonly referred to as “bit-level preservation,” this strategy does not guarantee any form of future usability/accessibility based on the intellectual contents or format of the files in question, but is focused on monitoring the integrity of the whole repository and remediating any errors that may arise in a uniform, scalable, and efficient manner
  2. Store files uploaded by users using a secure, reliable and scalable preservation storage strategy.
  3. Install and maintain all preservation features that are core to the Dataverse application, resulting in selected preservation metadata and format conversion for tabular data uploads.
    • As described under “Strategies: Level 1” below, the Dataverse application includes features that support the preservation of the intellectual contents of data files via file format identification and format conversion for tabular data files
  4. Support Participating Institutions who wish to export independent packages of selected dataset files and metadata from institutional collections in Borealis. See “Strategies: Level 2” below for more details.

Roles and Responsibilities

Users: responsible for uploading data files and metadata to the Borealis repository, as well as viewing and downloading data files and metadata accessible in the repository. Users must adhere to the repository’s Terms of Use as well as any policies and procedures governing their use of the service as set by Participating Institutions.

Participating Institutions: responsible for administering the use of Borealis at their institution. Institutions subscribe to Borealis via consortial agreements and are allocated storage space and administrative rights for staff to manage their institutional collection within the Borealis repository. Institutions are responsible for oversight of the data uploaded to their institutional collection by setting collections policies and deposit guidelines, administering users and user rights, and handling takedown and copyright decisions. Institutions may also validate data deposits for quality and completeness via curation activities, including determining preferred file formats for deposit or supporting depositors with advice on file format conversions. Institutions for which Borealis collection deposits comprise a part of their institutional collections may also design and implement additional preservation policies and procedures for their collection or for selected sub-collections or datasets within that collection.

University of Toronto Libraries: responsible for the technical maintenance and administration of the Borealis repository software and service. UTL ensures the Dataverse application is functional, secure, and updated. UTL also maintains the connected storage infrastructure for datasets, liaises with designated contacts at Participating Institutions, and makes available guides and training to Participating Institutions. UTL maintains no oversight over the quality, completeness, or format of files uploaded by users but will assist in identifying and remediating fixity issues in collaboration with Participating Institutions as they arise.

Strategies

Level 1

Description: The first level of preservation combines two broad sets of activities: bit-level preservation via regular independent fixity checking and safe storage in the OLRC, and maintaining the preservation-supporting features that are part of the Dataverse application. As technical service provider, UTL is not directly responsible for validating the contents or quality of user-uploaded files. This level of preservation addresses Objectives 1, 2 and 3: that user-uploaded files are safe from loss and that minimum level preservation functions are run as a necessary precursor to additional preservation strategies.

Scope: All data files deposited by registered users to Borealis. This includes files associated with draft and restricted datasets and different versions of files uploaded by users. It does not include files generated by the Dataverse application itself, such as derivatives, thumbnails and citation metadata files.

Term: UTL will maintain Level 1 preservation activities for as long as an institution is a subscriber to the Borealis service. As designated in the Access and Service agreement signed between the University of Toronto and Participating Institutions, UTL commits to maintaining data deposits for 6 months after termination of the agreement. However, UTL will support any processes for dataset export as required by subscribers.

Activities:

  • Primary storage of all data files in the OLRC
  • Daily backup of all files to tape using IBM Tivoli Storage Manager (TSM)
    • For active files:
      • Seven versions of a file are available for restore for 30 days
      • If a file has not been modified for over 30 days, the most recent version of the file is retained permanently in backup
      • The six previous versions of a file are discarded after 30 days
    • For deleted files:
      • The most recent version of a deleted file is available for restore for 60 days
    • Two copies of the tape backup are retained onsite and one copy is retained offsite
  • Regular independent fixity validation checks
    • When users upload files to the Dataverse application, MD5 checksums are automatically generated and stored in the Dataverse database
    • The Dataverse Native API includes the Physical Files Validation in a Dataset API call to download a file from storage and validate its checksum against the value stored in the database
    • UTL runs this API call against all files with an assigned File ID every 30 days
    • The record of each fixity check (both positive and negative) is stored in an internal MySQL database
    • Any errors identified during this process will be triaged for correction by retrieving a copy of affected files from the backup or communication with the Participating Institution and/or depositing User(s)
  • Maintenance of additional preservation-supporting functionality available as part of the Dataverse application:
    • File format identification using JHOVE
    • Transformation of tabular data formats into non-proprietary tabular text data files (.tab) upon ingest
    • Generation of UNFs (Universal Numeric Fingerprints) for tabular data files
      • UNFs are designed to validate the semantic content of tabular data regardless of format and are assigned at the dataset and file level
      • The Dataverse application provides a UNF when tabular data ingest has been successful, and as a result UNFs (and the derived .tab files) do not require subsequent checks unless this value is missing, in which case individual users are notified of failed ingest by the Dataverse application

Level 2

Description: This level of preservation is intended for Participating Institutions who require advanced preservation processing and/or the export of independent preservation packages for inclusion in institutional collections and storage in additional preservation environments. Advanced preservation functions may be conducted when Borealis is paired with the Archivematica workflow tool for preservation processing. Archivematica can create independent preservation packages of datasets in any Dataverse repository, and its workflow includes additional functions such as signature-based file format identification, file format validation, characterization and normalization. Independent packages created by Archivematica would then be sent to a preservation storage location of choice. Alternatively, Institutions may opt to create and accept exports of packages from any Dataverse application in BagIt format. Additional information on these features and functionality is described below.

Scope: Participating Institutions are responsible for determining which datasets are eligible for additional processing and export. Administrators, curators or other designates at Participating Institutions may select the complete contents of their institutional collections or a subset as guided by internal appraisal and selection criteria.

Activities:

  • UTL will assist Participating Institutions in the setup and maintenance of connections to Archivematica instances
    • If Participating Institutions are using Permafrost, the functional connection between Borealis and Archivematica will be set up as part of Permafrost technical support activities
    • Datasets processed via Permafrost would be subject to the functionality and limitations of this service
    • If Participating Institutions are using another hosted Archivematica service, or a locally-hosted version of Archivematica, UTL will provide advice and consultation on setup to connect Borealis and Archivematica
  • UTL will assist Participating Institutions wishing to export BagIt-formatted packages from Borealis
    • BagIt exports contain user-uploaded files (except in the case of tabular data uploads, where only the converted .tab version is retained in the bag), and metadata in the form of a JSON-LD serialized OAI-ORE map file and DataCite XML file
    • BagIt packages produced by the Dataverse application are conformant with the Research Data Alliance BagIt profile
    • Exports are conducted at the dataset level using an API call and would be conducted by or at the direction of Administrators, curators or other designates at Participating Institutions
    • Bags may be deposited to file system location or space in DuraCloud as required

Acknowledgements

Thank you to the former Dataverse North Policy Working Group for creating an initial policy framework for Borealis which informed the structure and approach for this document. The Alliance RDM Preservation Expert Group’s report Preservation for Dataverse in Canada: Recommendations provided key requirements for the preservation strategies outlined above. Additional sources of inspiration were the Texas Digital Library Digital Preservation Policy and the Harvard Dataverse Preservation Policy.

Published June 23, 2022