About WCSA+DC – Workset Creation for Scholarly Analysis + Data Capsules (WCSA+DC)

Full proposal available for download at this link

Project Leadership

Principal Investigator: J. Stephen Downie, Co-Director of HTRC & Professor and Associate Dean for Research, School of Information Sciences, University of Illinois at Urbana-Champaign (Illinois).

Co-Principal Investigator: Beth Plale, Co-Director, HTRC, Director, Data to Insight Center & Professor of Informatics and Computing, Indiana University.

Co-Principal Investigator: Timothy Cole, Mathematics Librarian & Professor of Library and Information Science, Center for Informatics Research in Science and Scholarship (CIRSS), at the School of Information Sciences, Illinois.

Key Research Partners

James Pustejovsky, TJX Feldberg Professor of Computer Science, Brandeis University

Kevin Page, Senior Researcher, Oxford e-Research Centre, University of Oxford

Ted Underwood, Professor of English and Information Science, Illinois

Annika Hinze, Associate Professor, Computer Science, University of Waikato

Executive Summary

The HathiTrust Digital Library comprises the digitized representations of 13.68 million volumes, 6.84 million book titles, 359,528 serial titles, and 4.79 billion pages. Approximately 39% of the items in the HathiTrust corpus are digital representations of print volumes in the public domain. The remaining 61% are works under copyright. Because of copyright restrictions, scholars have come to see this 61% of the HathiTrust collection of volumes as sitting behind a “copyright wall” that makes it next to impossible for them to have meaningful access to their content.

The HathiTrust Research Center (HTRC) is the research arm of the HathiTrust. The HTRC is a collaboration between the University of Illinois and Indiana University. HTRC has been developing models and tools to help scholars conduct interesting new analyses of works found in the HathiTrust corpus. To maximize accessibility to the entire corpus (regardless of copyright status), the HTRC has been prototyping tools to facilitate large-scale analyses under a “non- consumptive research” paradigm. Under this paradigm, analytic algorithms can be applied to that 61% of the HathiTrust collection that has been blocked off by the copyright wall. Once the analyses are run, only results are returned to researchers. Thus, restricted material is never directly “consumed” by scholars.

The project being proposed here builds upon, extends and integrates two developmental research threads that HTRC has been working on for the past several years aimed at making non- consumptive research using the HT corpus a reality. The first thread originates from work that was conducted in the Workset Collections for Scholarly Analysis (WCSA): Prototyping Project, funded by the Andrew W. Mellon Foundation (1 July 2013 – 20 September 2015). The second thread continues the work of the Data Capsules (DC) project, previously supported by the Alfred P. Sloan Foundation (2011-2014).

Informally, worksets can be understood to consist of two parts: 1) References to the actual data that is used in a given computational analysis. The actual data could be a whole volume, a given page, an image, or anything other type of possible input; and, 2) Metadata elements that describe the workset itself. This metadata helps in the management of worksets through the research cycle, from their conception, their various stages in the analysis process, their archiving, their citation, all the way to their retrieval and subsequent use by later scholars. HTRC Data Capsules provide the scholar with a virtual machine with two modes: a maintenance mode during which a user can access the network and install software freely, but cannot access copyrighted data; and secure mode where copyrighted texts become accessible to the user while the network access and file system access is highly constrained. (The Data Capsule intentionally drops network access for the virtual machine once the environment is configured to prevent data leakage during data analysis. The running analysis software cannot open network channels and can only access limited, predefined, areas of the storage system to prevent data copying and the loading of malicious code.)

The primary objective of the WCSA+DC project is the seamless integration of the workset model and tools with the Data Capsule framework to provide non-consumptive research access HathiTrust’s massive corpus of data objects, securely and at scale, regardless of copyright status. That is, we plan to surmount the copyright wall on behalf of scholars and their students.

Notwithstanding the substantial preliminary work that has been done on both the WCSA and DC fronts, they are both still best characterized as being in the prototyping stages. It is our intention to that this proposed Phase I of the project devote an intense two-year burst of effort to move the suite of WCSA and DC prototypes from the realm of proof-of-concept to that of a firmly integrated at-scale deployment. We plan to concentrate our requested resources on making sure our systems are as secure and robust at scale as possible.

Phase I will engage four external research partners. Two of the external partners, Kevin Page (Oxford) and Annika Hinze (Waikato) were recipients of WCSA prototyping sub-awards. We are very glad to propose extending and refining aspects of their prototyping work in the context of WCSA+DC. Two other scholars, Ted Underwood (Illinois) and James Pustejovsky (Brandeis) will play critical roles in Phase I as active participants in the development and refinement of the tools and systems from their particular user-scholar perspectives: Underwood, Digital Humanities (DH); Pustejovsky, Computational Linguistics (CL).

Goals

The four key outcomes and benefits of the WCSA+DC, Phase I project are:

The deployment of a new Workset Builder tool that enhances search and discovery across the entire HTDL by complementing traditional volume-level bibliographic metadata with new metadata derived from a variety of sources at various levels granularity.
The creation of Linked Open Data resources to help scholars find, select, integrate and disseminate a wider range of data as part of their scholarly analysis life-cycle.
A new Data Capsule framework that integrates worksets, runs at scale, and does both in a secure, non-consumptive, manner.
A set of exemplar pre-built Data Capsules that incorporate tools commonly used by both the DH and CL communities that scholars can then customize to their specific needs.