Yuerong Hu: Improving the Scholarly Usability of Cultural Data Provisions for Digital Humanities Research

Title: Improving the Scholarly Usability of Cultural Data Provisions for Digital Humanities Research: Case Studies of Books and Book Reviews
Session Lead: Yuerong Hu
Time: 10 am – 11 am, Thursday, 2022-02-24
Location: Zoom


In recent years, the increasing availability of cultural datasets has opened up unprecedented research opportunities to various scholarly communities. Digital libraries and social media platforms have become two essential sources of research materials for digital humanities (DH) and cultural analytics (CA) studies. However, there’ve emerged many issues and gaps in deploying the curated datasets retrieved from these sources for scholarly research. In this talk, we will reveal and discuss some of these limitations and challenges through two case studies. In the first study, we will review how we improved the representativeness and scholarly usability of an English literature dataset of 178,381 volumes curated by the HathiTrust Research Center (HTRC) to measure the change of three literature genres (fiction, drama, and poetry). Specifically, we analyzed and effectively overcame three common limitations: duplicate volumes, uneven distribution of data, and optical character recognition (OCR) errors. In the second study, we will examine the complexities associated with user-generated book reviews collected from two social reading platforms: Goodreads based in the U.S. and Douban based in China. We conducted three exemplar experiments to shed light on the temporal changes, cross-cultural divergence, and power dynamics of crowd opinions about books. Both case studies empirically demonstrate the underlying selection bias, limitations, and complexities of cultural data provisions. Based on what we’ve learned, we suggest that stakeholders of cultural data provisions should flag and address these problems to optimize their usability in the context of CA and DH research. Researchers and scholars working with such datasets should also scrutinize more dimensions of the datasets used to evaluate and improve their scholarly representativeness and interpretability.

Readings: [Box-Folder]
[1] Hu, Y., Jiang, M., Underwood, T., & Downie, J. S. (2020, August). Improving digital libraries’ provision of digital humanities datasets: A case study of htrc literature dataset. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (pp. 405-408).
[2]Organisciak, P., Schmidt, B. M., & Downie, J. S. (2022). Giving shape to large digital libraries through exploratory data analysis. Journal of the Association for Information Science and Technology, 73(2), 317-332.