HathiTrust Research Center Expands Text Mining Corpus

Good news for text and data mining researchers! After years of court cases and policymaking, the entire 16-million-item collection of the HathiTrust Digital Library, including content in-copyright, is available for text and data mining. (Yay!)

Previously, only non-copyrighted, public domain materials were able to be used with HTRC Analytics’ suite of tools. The restriction obviously limited ability to do quality computational research on modern history; most out-of-copyright items are texts created before 1923. With this update, everyone can perform text analysis on the full corpus with different tools. HathiTrust is membership-based, so some restrictions apply to non-member institutions and independent scholars alike (Illinois is a member institution). With the passage of this new policy, only one service, the HTRC Data Capsule (a virtual computing environment), retains members-only access to the full corpus for requesters with an established research need. There are over 140 member institutions, including University of Illinois.

Here’s a quick overview of HTRC’s tools and access permissions (from HTRC’s Documentation).

  • HTRC Algorithms: a set of tools for assembling collections of digitized text from the HathiTrust corpus and performing text analysis on them. Including copyrighted items for ALL USERS.
  • Extracted Features Dataset: dataset allowing non-consumptive analysis on specific features extracted from the full text of the HathiTrust corpus. Including copyrighted items for ALL USERS.
  • HathiTrust+Bookworm: a tool for visualizing and analyzing word usage trends in the HathiTrust corpus. Including copyrighted items for ALL USERS.
  • HTRC Data Capsule: a secure computing environment for researcher-driven text analysis on the HathiTrust corpus. All users may access public domain items. Access to copyrighted items is available ONLY to member-affiliated researchers.

Fair Use to the Rescue!

How is this possible? Through both the Fair Use section of the Copyright Act and HathiTrust’s policy of allowing only non-consumptive research. Fair Use protects use of copyrighted materials for educational, research, and transformative purposes. Non-consumptive research means that researchers can glean information about works without actually being able to read (consume) them. You can see the end result (topic models, word and phrase statistics, etc.), without seeing the entirety of the work for human reading. Allowing computational research only on a corpus protects rights holders, and benefits researchers. A researcher can perform text analysis on thousands of texts without reading them all, which is the basis of computational text analysis anyway! Our Copyright Librarian, Sara Benson, recently discussed how Fair Use factors into HathiTrust’s definition of non-consumptive research.

Ready to use HTRC Analytics for text mining? Check out their Getting Started with HTRC Guide for some simple, guided start-up activities.

For general information about the digital library, see our guide on HathiTrust.

Spotlight: JSTOR Labs Text Analyzer

JSTOR Labs has recently rolled out a beta version of a JSTOR Text Analyzer. The purpose of the Text Analyzer is different than other text analyzers (such as Voyant). The JSTOR Text Analyzer will mine documents you drop into its easy-to-use interface, and then breaks it down by topics and terms, which it will then search JSTOR with. The result? A list of JSTOR articles that relate to your research topic and help fill your bibliography.

So, how does it work?

You simply drag and drop a file– their demo file is an article named “Retelling the American West in the Museum” –, copy and paste text, or select a file from your computer and input it into the interface. What you drag and drop does not, necessarily, have to be an academic article. In fact, after inputting a relatively benign image for this blog, the Text Analyzer gave me remarkably useful results, relating to blogging and learning, the digital humanities and libraries.

Results from the Commons Knowledge blog image.

After you drop your file into JSTOR, your analysis is broken down into terms. These terms are further broken down into topics, people, locations, and organizations. JSTOR deems which terms it believes are the most important and prioritizes them, and even gives specific weight to the most important terms. However, you can customize all of these options by choosing words from the identified terms to become prioritized terms, adding or deleting prioritized terms, and changing the weight of prioritized terms. For example, here are the automatic terms and results from the demo article:

The automatic terms and results from the demo article.

However, I’m going to remove article’s author from being a prioritized term, add Native Americans and Brazilian art to the prioritized terms, and change the weight of these terms so that the latter two are the most important. This is how my terms and results list will look:

The new terms and results list.

As you can see, the results completely changed!

While the JSTOR Text Analyzer doesn’t necessarily function in ways similar to other text analyzers, its ability to find key terms will help you not only find articles on JSTOR, but use those terms in other databases. Further, it can help you think strategically about search strategies on JSTOR, and see which search terms yield (perhaps unexpectedly) the most useful results for you. So while the JSTOR Text Analyzer is still in beta, it has the potential to be an incredibly useful tool for researchers, and we’re excited to see where it goes from here!

Meet Eleanor Dickson, the Visiting HathiTrust Digital Humanities Specialist

Photo of Eleanor Dickson

This latest installment in our series of interviews with Scholarly Commons experts and affiliates features Eleanor Dickson, the Visiting HathiTrust Research Center Digital Humanities Specialist.


What is your background education and work experience? What led you to this field?

I have a B.A. in English and History with a minor in Italian studies. As an undergraduate I worked at a library which was a really fun experience. I also took an archival research trip to Florida for my undergraduate thesis research and realized I wanted to do what the archivist was doing. I have a Masters in Science in Information Studies from the University of Texas at Austin, and completed a postgraduate fellowship at the university archives / Emory Center for Digital Scholarship. And now I’m here!

What is your research agenda?

I research scholarly practice in humanities and digital scholarship, specifically digital humanities with a focus on the needs and practices in large scale text analysis.I also sometimes help with the development of train the trainer curriculum for librarians so librarians can be better equipped with the skills needed to teach patrons about their options when it comes to digital scholarship.

Do you have any favorite work-related duties?

My favorite work-related duties are talking to researchers and hearing about what they are up to. I am fascinated by the different processes, methods, and resources they’re using. With HathiTrust I get to talk to researchers across the country about text analysis projects.

What are some of your favorite underutilized resources that you would recommend to researchers?

I wish more people came to the Digital Humanities Savvy Researcher workshops. If people have suggestions for what they want to see PLEASE LET US KNOW.

(To see what Savvy Researcher workshops might tickle your fancy click here to check out our complete workshop calendar.)

If you could recommend only one book to beginning researchers in your field, what would you recommend?

Debates in Digital Humanities, which is an open access book available free online!

Need assistance with a Digital Humanities project? E-mail Eleanor Dickson or the Scholarly Commons.