HathiTrust Research Center Expands Text Mining Corpus

Good news for text and data mining researchers! After years of court cases and policymaking, the entire 16-million-item collection of the HathiTrust Digital Library, including content in-copyright, is available for text and data mining. (Yay!)

Previously, only non-copyrighted, public domain materials were able to be used with HTRC Analytics’ suite of tools. The restriction obviously limited ability to do quality computational research on modern history; most out-of-copyright items are texts created before 1923. With this update, everyone can perform text analysis on the full corpus with different tools. HathiTrust is membership-based, so some restrictions apply to non-member institutions and independent scholars alike (Illinois is a member institution). With the passage of this new policy, only one service, the HTRC Data Capsule (a virtual computing environment), retains members-only access to the full corpus for requesters with an established research need. There are over 140 member institutions, including University of Illinois.

Here’s a quick overview of HTRC’s tools and access permissions (from HTRC’s Documentation).

  • HTRC Algorithms: a set of tools for assembling collections of digitized text from the HathiTrust corpus and performing text analysis on them. Including copyrighted items for ALL USERS.
  • Extracted Features Dataset: dataset allowing non-consumptive analysis on specific features extracted from the full text of the HathiTrust corpus. Including copyrighted items for ALL USERS.
  • HathiTrust+Bookworm: a tool for visualizing and analyzing word usage trends in the HathiTrust corpus. Including copyrighted items for ALL USERS.
  • HTRC Data Capsule: a secure computing environment for researcher-driven text analysis on the HathiTrust corpus. All users may access public domain items. Access to copyrighted items is available ONLY to member-affiliated researchers.

Fair Use to the Rescue!

How is this possible? Through both the Fair Use section of the Copyright Act and HathiTrust’s policy of allowing only non-consumptive research. Fair Use protects use of copyrighted materials for educational, research, and transformative purposes. Non-consumptive research means that researchers can glean information about works without actually being able to read (consume) them. You can see the end result (topic models, word and phrase statistics, etc.), without seeing the entirety of the work for human reading. Allowing computational research only on a corpus protects rights holders, and benefits researchers. A researcher can perform text analysis on thousands of texts without reading them all, which is the basis of computational text analysis anyway! Our Copyright Librarian, Sara Benson, recently discussed how Fair Use factors into HathiTrust’s definition of non-consumptive research.

Ready to use HTRC Analytics for text mining? Check out their Getting Started with HTRC Guide for some simple, guided start-up activities.

For general information about the digital library, see our guide on HathiTrust.

Meet Eleanor Dickson, the Visiting HathiTrust Digital Humanities Specialist

Photo of Eleanor Dickson

This latest installment in our series of interviews with Scholarly Commons experts and affiliates features Eleanor Dickson, the Visiting HathiTrust Research Center Digital Humanities Specialist.


What is your background education and work experience? What led you to this field?

I have a B.A. in English and History with a minor in Italian studies. As an undergraduate I worked at a library which was a really fun experience. I also took an archival research trip to Florida for my undergraduate thesis research and realized I wanted to do what the archivist was doing. I have a Masters in Science in Information Studies from the University of Texas at Austin, and completed a postgraduate fellowship at the university archives / Emory Center for Digital Scholarship. And now I’m here!

What is your research agenda?

I research scholarly practice in humanities and digital scholarship, specifically digital humanities with a focus on the needs and practices in large scale text analysis.I also sometimes help with the development of train the trainer curriculum for librarians so librarians can be better equipped with the skills needed to teach patrons about their options when it comes to digital scholarship.

Do you have any favorite work-related duties?

My favorite work-related duties are talking to researchers and hearing about what they are up to. I am fascinated by the different processes, methods, and resources they’re using. With HathiTrust I get to talk to researchers across the country about text analysis projects.

What are some of your favorite underutilized resources that you would recommend to researchers?

I wish more people came to the Digital Humanities Savvy Researcher workshops. If people have suggestions for what they want to see PLEASE LET US KNOW.

(To see what Savvy Researcher workshops might tickle your fancy click here to check out our complete workshop calendar.)

If you could recommend only one book to beginning researchers in your field, what would you recommend?

Debates in Digital Humanities, which is an open access book available free online!

Need assistance with a Digital Humanities project? E-mail Eleanor Dickson or the Scholarly Commons.

Save the Date! HathiTrust Research Center UnCamp

The HathiTrust Research Center (HTRC) is hosting its third annual HTRC UnCamp in March at the University of Michigan Palmer Commons. The UnCamp is part hands-on coding and demonstration, part inspirational use-cases, part community building, and part informational, all structured in the dynamic setting of an un-conference programming format. It has visionary speakers mixed with boot-camp activities and hands-on sessions with HTRC infrastructure and tools.

When: March 30-31st, 2015, 8:00am – 5:00pm

Where: University of Michigan Palmer Commons

100 Washtenaw Avenue, Ann Arbor, MI 48109-2218

Who should attend? The HTRC UnCamp is targeted to digital humanities tool developers, researchers, librarians of HathiTrust member institutions, and graduate students. Attendees will be asked for their input in planning sessions, so please plan to register early!

Registration will be open the first week of February.

As it becomes available, additional information about the UnCamp will be posted to http://www.hathitrust.org/htrc_uncamp2015

Questions? Contact Ryan Dubnicek, HTRC Executive Assistant, at rdubnic2@illinois.edu

Hope to see you in Ann Arbor!