Borealis Technology Infrastructure and Security Information

Overview

Borealis, the Canadian Dataverse Repository, is a digital research data repository available to researchers at participating Canadian universities and research organizations with infrastructure hosted by the University of Toronto Libraries (UTL) in Ontario, Canada. UTL commits to maintaining an information technology (IT) environment that appropriately protects the availability, privacy, confidentiality, and integrity of all content and personal information. The Borealis Technology Infrastructure and Security Information document outlines general information, technical infrastructure, application security, and storage and backup details.

General Information

User Accounts and Access

Any member of the public is able to search, view, and download unrestricted data without a user account.

Users who wish to view and download restricted data must have an account.

Users who wish to create collections and/or datasets and upload data and metadata must have a user account affiliated with a participating institution.

Users without an institutional affiliation can create a standard account on the service. Users with an institutional affiliation can create a standard account, or create an institutional account if their institution has registered for the Research & Scholarship Entity Profile through the Canadian Access Federation (CAF), an identity management service for Canadian research institutions run by CANARIE. Institutional accounts use Shibboleth (single-sign on) login architecture.

Agreements with Participating Institutions

UTL has agreements with regional university library consortia to provide Borealis as a service to over 50 participating Canadian post-secondary institutions.

Participating institutions are:

  • Provided with an initial amount of storage space (between 1 TB and 10 TB, depending on the agreement with the institution), in the form of an institutional collection, with the option to increase that space as needed;
  • Provided with administrative access to their institutional collection which allows designated support personnel the ability to view and manage all collections and datasets within their institutional collection;
  • Able to create, maintain, and enforce policies and procedures related to their institutional collection and the data within it;
  • Provided with technical support by the Borealis team.

UTL will continue to provide all services outlined in each participating institution’s service level agreement up to and including six (6) months after the termination of the agreement.

User Analytics

The open-source, web analytics software platform Matomo has been installed in order to track and analyze traffic. An IP-restricted detailed dashboard is available to Borealis staff, which provides real-time and/or longitudinal information regarding traffic to and across the service provider’s websites. Matomo analytics data are stored on local servers and are not shared with any third-party.

In addition to the dashboard, Matomo generates a monthly analytics report, which includes detailed summaries on visits to the main webpage, plus visits to the Metrics page, the Data Curation tool, and the Data Explorer tool.

In addition to user analytics provided by Matomo, a metrics report can be viewed by any user. The metrics report includes the number of downloads per month, the number of datasets within the top 15 collections, the size of the top 15 collections, the distribution of file types uploaded, and the distribution of the subject categories used to describe datasets. These metrics can be viewed for the entire service, or by institutional collection and can be downloaded into a spreadsheet for further analysis.

For more information on how personal information is collected, stored, and deleted, please refer to the Privacy Statement.

Metadata Harvesting

All metadata associated with published collections and datasets are harvestable by other digital repositories and global search engines, as per the Terms of Use. The Dataverse platform supports open APIs and the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which allows published collections and datasets to be harvested by other digital repositories and systems for the purpose of global data discovery.

Borealis metadata is regularly crawled and published by the Federated Research Data Repository (FRDR), Google Dataset Search, DataCite, Web of Science Data Citation Index, Mendeley Data, and the Harvard Dataverse.

Borealis also works with the Shared Access Research Ecosystem (SHARE) to integrate public datasets into open web discovery services. The SHARE notification system is a higher education-based initiative that strengthens the effort to identify, discover, and track research output.

Technical Infrastructure

The Dataverse software for Borealis is hosted by UTL using a suite of technologies including:

  • Linux CentOS and Ubuntu operating systems
  • Payara application server
  • PostgreSQL for the application database
  • Java Runtime environment and application code language
  • Solr for indexing
  • SMTP server for sending emails for password resets and other notifications
  • Persistent identifier service: DOI and handle support
  • Shibboleth for user authentication, connected to the Canadian Access Federation (CAF)
  • Openstack Swift Object storage with Amazon Web Services S3 emulation for all Dataverse data storage (OLRC)
  • Chef, Jenkins, and Ansible for installation and deployment
  • KVM Qemu Virtual machines running on Dell r730xd servers with SSD disk drives
  • HAProxy for load balancing and Apache web server
  • Nagios, ELK stack, Grafana, Munin and PHP Server Monitor to monitor services and server health and analyze logs
  • Integrated tools:
    • R to ingest .RData files as tabular data and export tabular data as .RData files
    • ImageMagick to generate thumbnail previews
    • JHOVE for file format identification
    • Dropbox integration for uploading files from Dropbox API
    • jq for parsing JSON output used by the installation script
  • External tools for data exploration, analysis and curation by end-users:
    • Metrics tool for collection and dataset metrics
    • File Previewers to view certain file types directly in their web browser
    • Data Explorer to list the variables in a tabular files and search, chart, and conduct cross tabulation analysis
    • Data Curation Tool to view summary statistics for variables and to create and edit variable-level metadata in tabular files.

About Dataverse Software Development and Oversight

The Dataverse software is supported and developed by the Institute for Quantitative Social Science (IQSS) at Harvard University. A dedicated team supports the continuous development of the application, alongside support from community developers, experts in data curation and data preservation, and users.

New releases of the Dataverse software are continuous. The software’s development is informed by a strategic roadmap including and incorporating feedback from community members.

The current version of the Dataverse software for Borealis can be found in the bottom right-hand corner of every page on the platform. Specific information about the current version of Dataverse can be found on GitHub.

Dataverse Software Standards

Borealis employs a variety of widely used community standards for metadata export:

  • Dublin Core
  • DDI (Data Documentation Initiative Codebook 2.5)
  • DDI HTML Codebook (A more human-readable, HTML version of the DDI Codebook 2.5 metadata export, added in Dataverse software version 4.16)
  • DataCite 4
  • OAI-ORE (added in Dataverse software version 4.11)
  • OpenAIRE (added in Dataverse software version 4.14)
  • Schema.org JSON-LD (added in Dataverse software version 4.8.4)

Additional standards for application functionality and data access/deposit employed:

  • OAI-PMH for harvesting to improve global data discovery
  • SWORD API for data deposit from other applications
  • Support for WC3 Provenance JSON files (added in Dataverse software version 4.9)
  • A robust and well-documented suite of additional APIs for interacting with and managing the application
  • Ability to export RDA-compliant OAI-ORE Bags (added in Dataverse software version 4.11)

Application Security

All Dataverse installations are guided by the instructions in the Securing your installation and Network ports sections of the Dataverse installation guide, among others dealing with the security of the application. These pages include documentation on securing Solr and API endpoints, forcing HTTPS, and using proxies, all to ensure the application is adequately secured from external threats. Borealis staff promptly act on Dataverse security advisory notices sent by IQSS.

The Borealis installation of the Dataverse software is located on servers at the University of Toronto. Data centres at the University of Toronto follow both the Policy on Information Technology and the Policy on Information Security and the Protection of Digital Assets. All digital assets at the University of Toronto are required to follow the Information Security Standard, which provides a set of baseline controls and minimum standards for information security at the University. These standards are endorsed by the University’s Information Security Council and are aligned with the National Institute of Standards and Technology (NIST) 800-171 for the protection of data. These standards also include an Incident Security Response Plan.

The Information Security and Enterprise Architecture department at the University of Toronto, as per the policy for digital assets, has also developed a procedure for reporting an information security incident or event and a set of guidelines for the U of T community to mitigate risks associated with information security. These guidelines include recommendations and requirements for the protection of data centres at the University of Toronto.

User Authentication

Borealis has both remote and local authentication methods enabled.

  • Remote authentication uses managed authentication protocols via Shibboleth single sign-on architecture through the Canadian Access Federation (CAF), an identity management service for Canadian research institutions run by CANARIE. Remote authentication is enabled for users if their institution has registered for the Research & Scholarship Entity Profile.
  • For local authentication accounts, the passwords are stored as salted hashes and make use of hashing algorithms. They also make use of strong password requirements (added in Dataverse software version 4.8).

Reporting Security Issues

To communicate security-related issues regarding Borealis, notify the Service Provider.

Storage and Backup

About the Ontario Library Research Cloud

The Ontario Library Research Cloud (OLRC) is a collaboration between Ontario’s university libraries, through the Ontario Council of University Libraries (OCUL), to build a high-capacity, geographically-distributed storage and computing network using proven and scalable open-source cloud technologies.

More information about OLRC hardware and software can be found on the OLRC User Guide. At any time, all data contained within Borealis is stored on at least three (3) of the five (5) OLRC nodes to ensure continuous access and efficient recovery of data due to technical issues, natural disasters, or other damaging events.

Access to the OLRC

Data stored in the OLRC can only be accessed via specific, designated IP addresses. Only systems administration staff have direct access to the Borealis data stored in the OLRC.

The OLRC uses ORION (the Ontario higher-education research network) and GTANet (the research, education, health and public sector community network in the greater Toronto area) to connect the five (5) storage nodes via a virtual private network. Access to the OLRC is controlled by proxy servers located at the University of Toronto, via ORION. All proxy server connections use SSL, are authenticated, and are restricted to authorized IP addresses.

Data Ownership

Ownership of data in the OLRC follows the same Terms of Use as Borealis. In other words, while data published via Borealis is stored on the OLRC, ownership remains with the user(s) who posted it as per the licensing terms they provided.

OLRC Data Centres

The OLRC has five (5) data storage nodes at York University, the University of Guelph, Queen’s University, the University of Ottawa, and the University of Toronto. Each of these universities have set up an OLRC node within one of their existing institutional data centres. Security associated with each data centre is based on best practices and IT policies created and enforced by host institutions. Each data centre also has redundant power and cooling systems to prevent data loss or damage due to power supply issues.

Data Centre Security

At a minimum, each data centre is only accessible, via a secure keycard, to qualified and approved IT institutional support personnel. Each data centre has also implemented standard security protocols such as firewalls to limit inbound and outbound traffic to specific ports and to/from specific domains. All data stored on the OLRC, including all data from Borealis, is encrypted at rest.

The data stored on the OLRC is contained within a private VLAN that connects the configured nodes (i.e., the current five (5) institutions). The VLAN is operated by ORION and only they can add/remove access to the private network on the direction of system administrators.

Maintenance and Security Updates

System administrators keep all OLRC software and operating systems up-to-date and regularly refresh hardware. They receive regular alerts regarding security threats and critical security patches are applied as soon as possible. All software updates are tested in a development environment before being deployed in production.

Data Backups

All data stored in Borealis is synced to a Network File System (NSF) disk nightly. From that location, all data is sent to the Tivoli Storage Manager (TSM) tape storage at the University of Toronto Data Centre. The TSM backup policy stipulates:

  • Up to 9 versions of each file are stored for up to 30 days
  • If a file is deleted, the latest version is stored for up to 60 days
  • Two copies of the tape backup are retained onsite and one copy is retained offsite

For information about data preservation strategies and activities, please see the Borealis Preservation Plan.

Published June 23, 2022