Outcomes – Workset Creation for Scholarly Analysis + Data Capsules (WCSA+DC)

The WCSA+DC project officially ended on August 31, 2018. Outcomes from the project are described below.

Publications & Presentations

The full list of publications and presentations from this project is available on the Publications & Presentations page.

Code Repositories

Links to code repositories for software resulting from WCSA+DC can be found on the Code Repositories page.

Summary

Work on the “Worksets for Scholarly Analysis + Data Capsules: Phase 1” (WCSA+DC) project were centered around 4 main goals:

The deployment of a new Workset Builder tool that enhances search and discovery across the entire HTDL by complementing traditional volume-level bibliographic metadata with new metadata derived from a variety of sources at various levels granularity.
The creation of Linked Open Data resources to help scholars find, select, integrate and disseminate a wider range of data as part of their scholarly analysis life-cycle.
A new Data Capsule framework that integrates worksets, runs at scale, and does both in a secure, non-consumptive, manner.
A set of exemplar pre-built Data Capsules that incorporate tools commonly used by both the DH and CL communities that scholars can then customize to their specific needs.

Deployment of New Workset Builder

HTRC’s security policy evolved during the WCSA+DC project years, necessitating a modified plan for a new Workset Builder (WSB) 2.0 to replace a deprecated 1.0 version. With sensitivity to storing full text data in multiple locations, HTRC decided to move forward with a WSB 2.0 built on data with no restriction. Additionally, a new software review and implementation process, the HTRC Analytics Enhance Proposal (HAEP) process, was created and instated. As a result of both factors, the functions of WSB 2.0 were modularized into individual service improvements, deployed individually, with continued improvements planned for the WSB 2.0 prototype after grant-end. Three developments from WCSA+DC form the suite of tools that make up WSB 2.0: 1) a search and retrieval interface built on the Extracted Features (EF) Version 1.5 Dataset ingested into Solr; 2) a new workset import functionality from HathiTrust’s Collection Builder; and, 3) a Virtuoso RDF triple store that currently holds bibliographic records as triples for the entire HT corpus, as well as triples representing worksets from the publically accessible HTRC Analytics website (https://analytics.hathitrust.org). Each of these tools are spoken about in more detail below.

The WSB 2.0 search and retrieval interface has been successfully deployed to the HTRC development environment. It is currently undergoing beta testing and security review. The WSB 2.0 search tool is a Solr 7.4 installation that uses the unigram bag-of-words data available in the EF 1.5 Dataset. It provides new search access to both volume-level (15.7 million) and page-level (5.8 billion) metadata files to allow for workset building at different levels of granularity. Users can now also download individual EF volume or page files or a bundle of files representing their complete workset. As a Solr 7.4 installation, advanced users can programmatically build sophisticated queries using its standard API. This prototype also enables users to browse and view page-level EF data (text tokens by sorted either by document region, token frequencies or part of speech) for each page, with a link back to the HathiTrust page viewer for public domain volumes. New levels of faceting, including genre and Library of Congress classification, were also implemented, allowing for new and more nuanced methods of workset creation. Additionally, the WSB 2.0 search tool supports the ability to create worksets of pages rather than volumes, a key development influenced by recommendations from domain expert partners.

In addition to the Extracted Features Solr index, functionality supporting the import of HathiTrust Collections (https://babel.hathitrust.org/cgi/mb?colltype=updated) to the Analytics site has also been implemented, and is available for exploration under the “Create A Workset” heading on Analytics. The Virtuoso triple store is spoken about in detail under the next heading, but this platform will enable contextual browsing as well as serendipitous discovery, both workset building methods that have been deemed as desirable by scholars.

Creation of Linked Open Data Resources

Creating infrastructure and software to leverage Linked Open Data (LOD) to improve HTRC services has been a key piece of the WCSA+DC project. LOD presents novel and potentially more efficient ways of searching and retrieving HathiTrust volumes and pages, especially at scale. To these ends, the WCSA+DC project teams deployed a Virtuoso RDF triplestore, populated with BIBframe XML records for each volume in HathiTrust, generated from MARC records. Using HTRC’s workset model (http://doi.org/10.5334/johd.3), worksets were implemented as RDF objects in the triplestore, with lists of included volume IDs along with workset-level metadata (e.g. creator, creation date, and creator-submitted description of the workset). This proof-of-concept triple store allows for eventual increased incorporation of contextual browsing and serendipitous discovery into HTRC’s information seeking model. Additionally, Virtuoso’s store of workset objects will enable the eventual implementation of search by and search within workset queries, which will allow a new form of workset building previously unsupported by HTRC. To simplify interactions with the triple store, an API was developed on top of pre-canned SPARQL queries that allow for browsing of worksets and volume metadata within worksets, and discovery of worksets containing specified volumes. The triple store has been actively connected to HTRC’s Analytics Gateway page since September of 2018, and has stored all of the worksets created by users since then.

In collaboration with Key Research Partners at the Oxford e-Research Centre, led by Dr. Kevin Page, LOD has also been leveraged to develop a proof-of-concept, cross-corpora workset builder, that enables users to create worksets combining material in the HathiTrust and Early English Books Online Text Creation Partnership (EEBO-TCP). This is implemented using LOD and federated SPARQL queries over HTRC RDF triples, EEBO RDF created in Oxford from the EEBO-TCP TEI headers, and ‘bridging’ triples reconciled using external authorities (e.g. VIAF) and entity reconciliation. Prior to development, this work required an extensive survey and analysis of existing bibliographic ontologies, including MADSRDF/MODSRDF, Bibframe, schema.org, BIBO (http://bibliontology.com/), and FaBiO (https://sparontologies.github.io/fabio/current/fabio.html), regarding their suitability for building and parameterising worksets. Of these, only Bibframe was primarily developed specifically for library-centered use cases in mind and, thereby it was chosen for implementation.

The team at Oxford e-Research Centre also collaborated with HTRC to develop a model to characterize the information-seeking needs of users in large-scale digital libraries, and evaluated that model against the workset model to assess the ability to meet identified user needs for workset building (the focus of papers presented at the ACM Joint Conference for Digital Libraries in 2017, http://dx.doi.org/10.1109/JCDL.2017.7991583, and 2018, http://dx.doi.org/10.1145/3197026.3203886). Further extensions were made to this model to afford for LOD resources, which are detailed in a paper presented at the ASIS&T 2018 annual meeting, with the proceedings forthcoming.

A New Data Capsule Framework

The Data Capsule framework is a controlled compute environment for conducting computational analysis of restricted data while also protecting the data from unintended uses or uses prohibited by law, policy or licensing agreement. The Data Capsule framework, implemented as a set of policies and technologies that together enable controlled access and use of the copyrighted texts of the HathiTrust, has made considerable progress as a result of this award in its availability, stability, scalability, and usability. It is actively serving a growing group of researchers with analysis access to HathiTrust as as part of the production software release 4.0 of the HTRC.

The primary areas of contribution are in customized capsule solutions, enhanced user experience, scale of capacity and functionality. Each of these three areas is in response to one or more proposed activities in the original proposal. The individual contributions are described in more detail in the paragraphs below.

Customized Capsule solutions

With the new availability of Capsule activity using the copyrighted content of HathiTrust, we conceptualized and built several classes of Capsules:

Demo Capsule: is a smaller Capsule that has access to HathiTrust public domain content only. Results cannot be released from the Capsule.
Research Capsule (public domain): customizable Capsule up to 4 cores and 16GB memory; derived data release is allowed pending review. Access to HT public domain only by default. Researcher must agree to term of use;
Research Capsule (copyright): above plus additional information required on intent of use; review/approval is required to access full HT corpus. This option is currently limited to HT members only.

Enhanced user experience

We developed a novel software package called HTRC Workset Toolkit which makes it easier for researchers to import/export their workset to/from their Capsule, and makes it easier to connect their analysis tools to the HathiTrust collection. The HTRC Workset Toolkit works with the JSON-LD description of Worksets. The researcher’s Capsule additionally now comes enriched with pre-installed sample data, with the Voyant data exploration tool, and with a rich set of other user requested analysis tools and packages, such as Anaconda, Mallet, R, InPho TopicExplorer, and number of popular Python libraries like GenSim, numpy, scipy, pandas and nltk. These tools and packages were included after collaboration with WCSA+DC Key Research Partners. Finally, a researcher interface to their Capsule has been upgraded to use encrypted clientless VNC and SSH connections, which allow users to seamlessly and securely access data capsules in different computing environment without installing specific VNC or SSH clients.

Scale Capsule functionality

Through contributions of hardware by University of Illinois and Indiana University, the Data Capsule service can now can provision Capsules from a pool of 120 cores and 640GB memory. Researchers can now spec out a single Capsule with up to 4 cores and 16GB memory through the standard web interface, with additional resource needs exceeding the capacity can be requested and addressed through special handling. The data capsule threat model has been evaluated against the multi-server hosting environment for verified security and reliability.

Topic modeling is notoriously computationally intensive. Through a combination of hardware upgrades and software refactoring, we were able to increase the capacity of topic modeling inside a research Capsule to approximately 500 volumes per GB of RAM allocated. This enabled analysis of up to 8,000 volumes in a 16GB capsule. With an average volume size of approximately 150,000 words, we can now analyze up to 2.4 billion words in a single Capsule.

Exemplar Data Capsules

Progress on the creation of a set of exemplar Data Capsules to meet needs of computational linguistics (CL) and digital humanities (DH) users was slightly modified as the Data Capsule service evolved. Three default Capsule formats were instated, as mentioned above, while HTRC, in conjunction with key research partners at University of Illinois, Brandeis University and University of Waikato, moved forward with identifying standard software packages and tools that could be included with all Data Capsules. This model allows a user access to a number of domain-specific tools and test data within every Capsule.

Computational Linguistics domain

Serving as expert users in the area of Computational Linguistics (CL), the team at Brandeis University, led by Prof. James Pustejovsky, was tasked with showing a proof-of-concept integration of the LAPPS Grid / Galaxy platform and workflow, a project on which Pustejovsky is a Principal Investigator along with WCSA+DC Advisory Board member Prof. Nancy Ide (Vassar College), within the HTRC Data Capsule environment. The Brandeis team successfully achieved this, integrating the LAPPS Grid natural language processing (NLP) tools and cloud platform using the Galaxy web front-end in a Docker container that can be installed within an off-the-shelf Data Capsule. Analysis tools included support basic text processing (sentence split, tokenization, parts-of-speech tagging) as well as information extraction (entity recognition, relation extraction) and linguistic analysis (syntactic parsing, anaphora resolution). The LAPPS Grid tools were then integrated with the HTRC Workset Toolkit, a library written for retrieving and interacting with HT texts and metadata within the Data Capsule.

With software installed and running, the LAPPS Grid NLP tools could be evaluated against HT data, as well as modified and evolved to include common functionalities used by Digital Humanities researchers, such as entity and relation extraction. As part of this process, datasets were developed for evaluation and tweaking of NLP tools with regard to HT data and success of the newly incorporated DH tools.

Digital Humanities domain

Prof. Ted Underwood (University of Illinois) served as a Digital Humanities (DH) domain expert, and was tasked with using the new Data Capsule infrastructure for his projects on gender in fiction and character in biography. The centerpiece of Prof. Underwood’s work on the grant was research on characterization in nineteenth- and twentieth-century fiction, published in the Journal of Cultural Analytics as “The Transformation of Gender in English-Language Fiction” (available here: http://culturalanalytics.org/2018/02/the-transformation-of-gender-in-english-language-fiction/). This research required natural language processing on connected text, so it couldn’t be accomplished with HTRC’s EF Dataset, and was previously impossible to explore due to copyright restrictions on much of the fiction in HathiTrust. This piece on character was well received, with journalistic coverage in Smithsonian (https://www.smithsonianmag.com/arts-culture/what-big-data-can-tell-us-about-women-and-novels-180968153/), The Economist (https://www.economist.com/prospero/2018/03/08/machines-are-getting-better-at-literary-analysis), and The Washington Post (https://www.washingtonpost.com/news/posteverything/wp/2018/07/30/how-computational-analysis-is-teaching-us-to-read-in-new-ways/?utm_term=.aa84c7f01b39). Parts of that research will also be used in a forthcoming book from Dr. Underwood, Distant Horizons: Digital Evidence and Literary Change (Chicago: University of Chicago Press, 2019).

Prof. Underwood was also able to explore additional projects during the WCSA+DC grant period, including a similar analysis of “characters” in biography, focusing on comparing the “characters” in biographies to fictional characters, as well as a project on book reviews, for which he was awarded a fellowship at the National Humanities Center. This project produced several interesting results that are still being written up by the project team, and, with the project on gender in fiction, served as a useful pilot of the enhanced Data Capsule framework, informing the inclusion of a number of standard DH tools (mentioned fully under the “A New Data Capsule Framework” heading) as well as technical resources for off-the-shelf Data Capsules. Prof. Underwood’s project on book reviews is currently ongoing, and has employed the WSB 2.0 prototype to identify his workset, with analysis occurring within the Data Capsule environment. This exciting project will be one of the first to use HTRC services from workset building stage to the final analysis and publication phase.

Concept tags

The University of Waikato team, led by Assoc. Prof. Annika Hinze, was tasked with testing and implementing a prototype of their Capisco concept tagging system within a Data Capsule to test and potentially develop a workflow for eventual wider release of this tool as a standard inclusion in the Data Capsule. This process involves seeding and tagging of concepts. Seeding represents the process of disambiguating terms that are flagged from the tokens of the full text because they appear in a Concepts-in-Context (CiC) network the Waikato team generated from Wikipedia data. The seeds generated are then semantically compared to each other to disambiguate into cogent concepts represented by the text data. This process is detailed, as well as a comparison of various seeding strategies, in a paper presented at the ACM Joint Conference for Digital Libraries 2018 meeting, and available online here: https://doi.org/10.1145/3197026.3203874.

Once seeds are successfully disambiguated, they were evaluated and improved, then added as tags to pages and volumes. These tags enable eventual implementation of search and retrieval, as well as results filtering, by concept. This has been illustrated in a small, diverse test corpus, with a sample query (searching for the concept ‘Bank’) of this set available for exploration in the WSB 2.0 prototype here: https://solr1.ischool.illinois.edu/solr-ef/index.html?solr-col=dbbridge-fict1055-htrc-configs-storeall&solr-key-q=0132F306CB4EB6FCAD97EB51280D12984#search-results-anchor.

A full description of project outcomes along with challenges and lessons learned can be accessed in the full final report (in .pdf).