Celebrating Frederick Douglass with Crowdsourced Transcriptions

A flier advertising the Transcribe-a-thon, which includes a photo of Frederick Douglass

On February 14, 2018, the world celebrated Frederick Douglass’ 200th birthday. Douglass, the famed Black social reformer, abolitionist, writer and statesman, did not know the date of his birth, and chose the date of Februar

 

y 14, 1818 to celebrate his birthday. This year, to celebrate the 200th anniversary of his birth, Colored Conventions, the Smithsonian Transcription Center, and the National Museum of African American History & Culture partnered together to host a Transcribe-a-thon of the Freedmen’s Bureau Papers in Douglass’ honor.

The Freedmen’s Bureau Papers consist of 2 million digitized papers through a partnership between the Smithsonian Transcription Center and the National Museum of African American History and Culture. It is the largest crowdsourcing initiative ever hosted by the Smithsonian. The Freedmen’s Bureau helped solve the everyday problems of formerly enslaved individuals, from obtaining clothing and food to helping find lost family members. The Bureau operated from 1865-1872 and closed due to opposition from Congress and President Andrew Johnson.

The Transcribe-a-thon was held on February 14th from 12-3 PM EST. According to the Smithsonian Transcription Center, over 779 pages of the Freedmen’s Bureau Papers were transcribed during this time, 402 pages were reviewed and approved, and 600 new volunteers registered for the project. Over sixty institutions hosted Transcribe-a-thon locations, many of which bought birthday cakes in Douglass’ honor from African American-owned bakeries in their area. Meanwhile, Colored Conventions livestreamed participants during the event.If you’re interested in seeing more from Douglass Day 2018, check out the Smithsonian Transcription Center’s Twitter Moment.

The Douglass Day Transcribe-a-thon was a fantastic example of people coming together and doing fantastic digital humanities work together, and for a great cause. While crowdsourced transcription projects are not new, the enthusiasm for Douglass Day is certainly unique and infectious, and we’re so excited to see where this project goes in the future and to get involved ourselves!

 

What Storify Shutting Down Means to Us

The Storify logo.

You may have heard that popular social media story platform Storify will be shutting down on May 16, 2018. Open to the public since 2011, it has hosted everything from academic conference tweet round-ups to “Dear David”, the ongoing saga of Buzzfeed writer Adam Ellis and the ghost that haunts his apartment. So it shocked long-time users in December when Storify suddenly announced that it would be shutting down in just a few months.

Already, Storify is no longer allowing new accounts to be created, and by May 1st, users won’t be able to create new stories. On May 16th, everything disappears. Storify will continue on with Storify 2, a feature of Livefyre, but will require you to purchase a Livefyre license for access. But the fact is that many users cannot or will not pay for Livefyre. Essentially, Storify will cease to exist on May 16th to most people.

So… what does this mean?

Of course, it means that you need to export anything that you have stored on Storify and want to save. (They provide instructions for exporting content on their shutting down FAQ.) More than that, however, we need to talk about how we are relying on services to archive our materials online and how that is a dangerous long-term preservation strategy.

The fact is, free Internet services can change in an instant, and without consulting their user base. As we have seen with Storify — as well as other services like Google Reader — what seems permanent can disappear quickly. When it comes to long-term digital preservation, we cannot solely depend on them as our only means of preservation.

That is not to say that we cannot use free digital tools like Storify. Storify was a great way to collect Tweets, present stories, and get information out to the public. And if you or your institution did not have the funds or support to create a long-term preservation plan, Storify was a great stop-gap until then. But digital preservation is a marathon, not a race, and we will need to continue to find new, innovative ways to ensure that digital material remains accessible.

When I heard Storify was shutting down, I went to our Scholarly Commons intern Matt Pitchford, whose research is on social media and who has a real stake in maintaining digital preservation, for his take on the issue. (You can read about Matt’s research here and here.) Here’s what Matt had to say:

Thinking about [Storify shutting down] from a preservation perspective, I think it reinforces the need to develop better archival tools along two dimensions: first, along the lines of navigating the huge amounts of data and information online (like how the Library of Congress has that huge Twitter archive, but no means to access it, and which they recently announced they will stop adding to). Just having all of Storify’s data wouldn’t make it navigable. Second, that archival tools need to be able to “get back” to older forms of data. There is no such thing as a “universally constant” medium. PDFs, twitter, Facebook posts, or word documents all may disappear over time too, despite how important they seem to our lives right now. Floppy disks, older computer games or programs, and even recently CDs, aren’t “accessible” in the way they used to be. I think the same is eventually going to be true of social media.
Matt brings up some great issues here. Storify shutting down could simply be a harbinger of more change online. Social media spaces come and go (who else remembers MySpace and LiveJournal?), and even the nature of posts change (who else remembers when Tweets were just 140 characters?). As archivists, librarians, and scholars, we will have to adopt, adapt, and think quickly in order to stay ahead of forces that are out of our control.
And most importantly, we’ll have to save backups of everything we do.

Spotlight: Unexpected Surprises in the Internet Archive

Image of data banks with the Internet Archive logo on them.

The Internet Archive.

For most of us, our introduction to the Internet Archive was the Wayback Machine, a search engine that can show you snapshots of websites from the past. It’s always fun to watch a popular webpage like Google evolve from November 1998 to July 2004 to today, but there is so much more that the Internet Archive has to offer. Today, I’m going to go through a few highlights from the Internet Archive’s treasure trove of material, just to get a glimpse at all of the things you can do and see on this amazing website.

Folksoundomy: A Library of Sound

Folksoundomy is the Internet Archives’ collection of sounds, music and speech. The collection is collaboratively created and tagged, with many participants from outside of the library sphere. There are more than 155,960 items in the Folksoundomy collection, with items that range in date back to the invention of Thomas Edison’s invention of recorded sound in 1877. From Hip Hop Mixtapes to Russian audiobooks, sermons to stand-up comedy, music, podcasts, radio shows and more, Folksoundomy is an incredible resource for scholars looking at the history of recorded sound.

TV News Archive

With over 1,572,000 clips collected since 2009, the TV News Archive includes everything from this morning’s news to curated series of fact-checked clips. Special collections within the TV News Archive include Understanding 9/11, Political Ads, and the TV NSA Clip Library. With the ability to search closed captions from US TV new shows, the Internet Archive provides a unique research opportunity for those studying modern US media.

Software Library: MS-DOS Games

Ready to die of dysentery on The Oregon Trail again? Now is your chance! The Internet Archive’s MS-DOS Games Software Library uses an EM-DOSBOX in-browser emulator that lets you go through and play games that would otherwise seem lost to time. Relive your childhood memories or start researching trends in video games throughout the years with this incredible collection of playable games!

National Security Internet Archive (NSIA)

Created in March 2015, the NSIA collects files from muckracking and national security organizations, as well as historians and activists. With over 2 million files split into 36 collections, the NSIA helps collect everything from CIA Lessons Learned from Czechoslovakia to the UFO Files, a collection of declassified UFO files from around the world. Having these files accessible and together is incredibly helpful to researchers studying the history of national security, both in the US and elsewhere in the world.

University of Illinois at Urbana-Champaign

That’s right! We’re also on the Internet Archive. The U of I adds content in several areas: Illinois history, culture and natural resources; US railroad history; rural studies and agriculture; works in translation; as well as 19th century “triple-decker” novels and emblem books. Click on the above link to see what your alma mater is contributing to the Internet Archive today!

Of course, this is nowhere near everything! With Classic TV CommercialsGrateful Dead, Community Software and more, it’s definitely worth your time to see what on the Internet Archive will help you!

Spotlight: Library of Congress Labs

The Library of Congress Labs banner.

It’s always exciting when an organization with as much influence and reach as the Library of Congress decides to do something different. Library of Congress Labs is a new endeavor by the LoC, “a place to encourage innovation with Library of Congress digital collections”. Launched on September 19, 2017, Labs is a place of experimentation,and will host a rotating selection of “experiments, projects, events and resources” as well as blog posts and video presentations.

In this post, I’ll just be faffing around the Labs website, focusing on the “Experiments” portion of the site. (We’ll look at “LC for Robots” in another post.) As of writing (10/3/17), there are three “Experiments” on the site — Beyond Words, Calling All Storytellers, and #AsData Poster Series. Right now, Calling All Storytellers is just asking for people’s ideas for the website, so I’ll briefly go over Beyond Words and #As Data Poster Series and give my thoughts on them.

Beyond Words

Beyond Words is a crowd-sourced transcription system for the LoC’s Chronicling America digitized newspaper collection. Users are invited to mark, transcribe, and verify World War I newspapers. Tasks are split, so the user only does one task at a time. Overall, however, it’s pretty similar to other transcription efforts already on the Internet; though, the tools tend to be better-working, less-clunky, and clearer than some other efforts I’ve seen.

#AsData Poster Series

The #AsData Poster Series is a poster series by artist Oliver Baez Bendorf,  commissioned by the LoC for their Collections as Data Summit in September 2016. the posters are beautiful and artistic, and represent the themes of the summit. One aspect that I like about this page, is that it’s not just the posters themselves, but includes more information, like an interview with the artist. That being said, it does seem like a bit of a placeholder.

While I was excited to explore the experiments, I’m hopeful to see more innovative ideas from the Library of Congress. The Labs “Experiments” have great potential, and it will be interesting to stay tuned and where they go next.

Keep an eye on Commons Knowledge in the next few weeks, when we talk about the “LC for Robots” Labs page!

Spotlight: PastPin

The PastPin logo.

Who? What? Where? When? and Why? While these make up a catchy song from Spy Kids, they’re also questions that can get lost when looking at digital images, especially when metadata is missing. PastPin wants to help answer these questions, by tagging the location and time of vintage images on Flickr Commons, with the hope that one day they will be searchable through the Where? and When? of the images themselves. By doing this, PastPin wants to create new ways to do research using public domain images online.

Created by Geopast — a genealogy service — PastPin uses 6,806,043 images from 115 cultural institutions hosted on Flickr. When a user brings up the PastPin website, they’ll be prompted with images that PastPin believes come from your geographic area. When you click on an image, you can then search a map for its specific location and enter in a date, which is then saved. The image then becomes searchable by PastPin users through the entered information. The hope is that all of these images will be identified, so that all users can search through location or date.

Some images are easier to geolocate and date than others. PastPin pulls in metadata and written descriptions from Flickr, so images that are published by an institution — such as the University Laboratory High School, like several images I encountered — may already have this information readily available, making it easy to type that into the map and save it. Other images become more difficult to locate or date because they lack that information, and take more outside knowledge to suss out. PastPin also lacks adequate guidelines for locations, in particular. As many of the images that come from the University of Illinois are from digitized books, are they looking for the location of where the book was printed? Or of the library it resides in? It’s unclear.

PastPin faces what would seem like a Herculean feat. As I’m writing this, only 1.79% of the nearly seven million images have been located so far, and 2.13% have been dated. Today, there have been 18 updates, including two that I made, so the work moves slowly.

Still, PastPin is an awesome example of the power of crowd-sourced projects, and the potential of new thinking to change the way that we do research. The Internet creates so many new opportunities for kinds of research, and the ability to search through public domain images in new ways is just one of them.

Do you know of other websites that are trying to crowd source data? How about websites that are trying to push research into new directions? Let us know in the comments!

An Introduction to Traditional Knowledge Labels and Licenses

NOTE: While we are discussing matters relating to the law, this post is not meant as legal advice.

Overview

Fans of Mukurtu CMS, a digital archeology platform, as well as intellectual property nerds may already be familiar with Traditional Knowledge labels and licenses, but for everyone else here’s a quick introduction. Traditional Knowledge labels and licenses, were specifically created for researchers and artists working with or thinking of digitizing materials created by indigenous groups. Although created more educational, rather than legal value, these labels aim to allow indigenous groups to take back some control over their cultural heritage and to educate users about how to incorporate these digital heritage items in a more just and culturally sensitive way. The content that TK licenses and labels cover extends beyond digitized visual arts and design to recorded and written and oral histories and stories. TK licenses and labels are also a standard to consider when working with any cultural heritage created by marginalized communities. They also provide an interesting way to recognize ownership and the proper use of work that is in the public domain. These labels and licenses are administered by Local Contexts, an organization directed by Jane Anderson, a professor at New York University and Kim Christen, a professor at Washington State University. Local Contexts is dedicated to helping Native Americans and other indigenous groups gain recognition for, and control over, the way their intellectual property is used. This organization has received funding from sources including the National Endowment for Humanities, and the World Intellectual Property Organization.

Traditional knowledge, or TK, labels and licenses are a way to incorporate protocols for cultural practices into your humanities data management and presentation strategies. This is especially relevant because indigenous cultural heritage items are traditionally viewed by Western intellectual property laws as part of the public domain. And, of course, there is a long and troubling history of dehumanizing treatment of Native Americans by American institutions, as well as a lack of formal recognition of their cultural practices, which is only starting to be addressed. Things have been slowly improving; for example, the Native American Graves and Repatriation Act of 1990 was a law specifically created to address institutions, such as museums, which owned and displayed people’s relative’s remains and related funerary art without their permission or the permission of their surviving relatives (McManamon, 2000). The World Intellectual Property Organization’s Intergovernmental Committee on Intellectual Property and Genetic Resources, Traditional Knowledge and Folklore (IGC) has began to address and open up conversations about these issues in hopes of coming up with a more consistent legal framework for countries to work with; though, confusingly, most of what Traditional Knowledge labels and licenses apply to are considered “Traditional Cultural Expressions” by WIPO (“Frequently Asked Questions,” n.d.).

To see these labels and licenses in action, take a look at how how these are used is the Mira Canning Stock Route Project Archive from Australia (“Mira Canning Stock Route Project Archive,” n.d.).

The main difference between TK labels and licenses is that TK labels are an educational tool for suggested use with indigenous materials, whether or not they are legally owned by an indigenous community, while TK licenses are similar to Creative Commons licenses — though less recognized — and serve as a customizable supplement to traditional copyright law for materials owned by indigenous communities (“Does labeling change anything legally?,” n.d.).

The default types of TK licenses are: TK Education, TK Commercial, TK Attribution, TK Noncommercial.

Four proposed TK licenses

TK Licenses so far (“TK Licenses,” n.d.)

Each license and label, as well as a detailed description can be found on the Local Contexts site and information about each label is available in English, French, and Spanish.

The types of TK labels are: TK Family, TK Seasonal, TK Outreach, TK Verified, TK Attribution, TK Community Use Only, TK Secret/Sacred, TK Women General, TK Women Restricted, TK Men General, TK Men Restricted, TK Noncommercial, TK Commercial, TK Community Voice, TK Culturally Sensitive (“Traditional Knowledge (TK) Labels,” n.d.).

Example:

TK Women Restricted (TK WR) Label

A TK Women Restricted Label.

“This material has specific gender restrictions on access. It is regarded as important secret and/or ceremonial material that has community-based laws in relation to who can access it. Given its nature it is only to be accessed and used by authorized [and initiated] women in the community. If you are an external third party user and you have accessed this material, you are requested to not download, copy, remix or otherwise circulate this material to others. This material is not freely available within the community and it therefore should not be considered freely available outside the community. This label asks you to think about whether you should be using this material and to respect different cultural values and expectations about circulation and use.” (“TK Women Restricted (TK WR),” n.d.)

Wait, so is this a case where a publicly-funded institution is allowed to restrict content from certain users by gender and other protected categories?

The short answer is that this is not what these labels and licenses are used for. Local Contexts, Mukurtu, and many of the projects and universities associated with the Traditional Knowledge labels and licensing movement are publicly funded. From what I’ve seen, the restrictions are optional, especially for those outside the community (“Does labeling change anything legally?,” n.d.). It’s more a way to point out when something is meant only for members of a certain gender, or to be viewed during a time of year, than to actually restrict something only to members of a certain gender. In other words, the gender-based labels for example are meant for the type of self-censorship of viewing materials that is often found in archival spaces. That being said, some universities have what is called a Memorandum of Understanding between a university and an indigenous community, which involve universities agreeing to respect the Native American culture. The extent to which this goes for digitized cultural heritage held in university archives, for example, is unclear, though most Memorandum of Understanding are not legally binding (“What is a Memorandum of Understanding or Memorandum of Agreement?,” n.d.) . Overall, this raises lots of interesting questions about balancing conflicting views of intellectual property and access and public domain.

Works Cited:

Does labeling change anything legally? (n.d.). Retrieved August 3, 2017, from http://www.localcontexts.org/project/does-labeling-change-anything-legally/
Frequently Asked Questions. (n.d.). Retrieved August 3, 2017, from http://www.wipo.int/tk/en/resources/faqs.html
McManamon, F. P. (2000). NPS Archeology Program: The Native American Graves Protection and Repatriation Act (NAGPRA). In L. Ellis (Ed.), Archaeological Method and Theory: An Encyclopedia. New York and London: Garland Publishing Co. Retrieved from https://www.nps.gov/archeology/tools/laws/nagpra.htm
Mira Canning Stock Route Project Archive. (n.d.). Retrieved August 3, 2017, from http://mira.canningstockrouteproject.com/
TK Licenses. (n.d.). Retrieved August 3, 2017, from http://www.localcontexts.org/tk-licenses/
TK Women Restricted (TK WR). (n.d.). Retrieved August 3, 2017, from http://www.localcontexts.org/tk/wr/1.0
What is a Memorandum of Understanding or Memorandum of Agreement? (n.d.). Retrieved August 3, 2017, from http://www.localcontexts.org/project/what-is-a-memorandum-of-understandingagreement/

Further Reading:

Christen, K., Merrill, A., & Wynne, M. (2017). A Community of Relations: Mukurtu Hubs and Spokes. D-Lib Magazine, 23(5/6). https://doi.org/10.1045/may2017-christen
Educational Resources. (n.d.). Retrieved August 3, 2017, from http://www.localcontexts.org/educational-resources/
Lord, P. (n.d.). Unrepatriatable: Native American Intellectual Property and Museum Digital Publication. Retrieved from http://www.academia.edu/7770593/Unrepatriatable_Native_American_Intellectual_Property_and_Museum_Digital_Publication
Project Description. (n.d.). Retrieved August 3, 2017, from http://www.sfu.ca/ipinch/about/project-description/

Acknowledgements:

Thank you to the Rare Book and Manuscript Library and Melissa Salrin in the iSchool for helping me with my questions about indigenous and religious materials in archives and special collections at public institutions, you are the best!

What To Do When OCR Software Doesn’t Seem To Be Working

Optical character recognition can enhance your research!

While optical character recognition (OCR) is a powerful tool, it’s not a perfect one. Inputting a document into an OCR software doesn’t necessarily mean that the software will actually output something useful 100% of the time. Though most documents come out without a hitch, we have a few tips on what to do if your document just isn’t coming out.

Scanning Issues

The problem may be less with your program and more with your initial scan. Low-quality scans are less likely to be read by OCR software. Here are a few considerations to keep in mind when scanning a document you will be using OCR on:

  • Make sure your document is scanned at 300 DPI
  • Keep your brightness level at 50%
  • Try to keep your scan as straight as possible

If you’re working with a document that you cannot create another scan for, there’s still hope! OCR engines with a GUI tend to have photo editing tools in them. If your OCR software doesn’t have those tools, or if their provided tools aren’t cutting it, try using a photo manipulation tool such as Photoshop or GIMP to edit your document. Also, remember OCR software tends to be less effective when used on photographs than on scans.

Textual Issues

The issues you’re having may not stem from the scanning, but from the text itself. These issues can be more difficult to solve, because you cannot change the content of the original document, but they’re still good tips to know, especially when diagnosing issues with OCR.

  • Make sure that your document is in a language, and from a period that your OCR software recognizes; not all engines are trained to recognize all languages
  • Low contrast in documents can reduce OCR accuracy; contrast can be adjusted in a photo manipulation tool
  • Text created prior to 1850 or with a typewriter can be more difficult for OCR software to read
  • OCR software cannot read handwriting; while we’d all like to digitize our handwritten notes, OCR software just isn’t there yet

Working with Digital Files

Digital files can, in many ways, be more complicated to use OCR software on, just because someone else may have made the file. This means that a file is lower-quality to begin with, or that whoever scanned the file may have made errors. Most likely, you will run into scenarios that are easy fixes using photo manipulation tools. But there will be times that the images you come across just won’t work. It’s frustrating, but you’re not alone. Check out your options!

Always Remember that OCR is Imperfect

Even with perfect documents that you think will yield perfect results, there will be a certain percentage of mistakes. Most OCR software packages have an error rate between 97-99% per character. While this may seem like it’s not many errors, in a page with 1,800 characters, there will be between 18 and 54 errors. In a 300 page book with 1,800 characters per page, that’s between 5,400 and 16,200. So always be diligent and clean up your OCR!

The Scholarly Commons

Here at the Scholarly Commons, we have Adobe Acrobat Pro installed on every computer, and ABBYY FineReader installed on several. We can also help you set up Tesseract on your own computer. If you would like to learn more about OCR, check out our LibGuide and keep your eye open for our next Making Scanned Text Machine Readable through Optical Character Recognition Savvy Researcher workshop!

Spotlight: Open Culture

The Open Culture logo.

The Internet is the world’s hub for culture. You can find anything and everything from high-definition scans of sixteenth-century art to pixel drawings created yesterday. However, actually finding that content — and knowing which content you are free to use and peruse — can prove a difficult task to many. That’s why Open Culture has made it its mission to “bring together high-quality cultural & entertainment media for the worldwide lifelong learning community.”

Run by Lead Editor Dan Colman, director & associate dean of Stanford’s Continuing Studies Program, Open Culture finds cultural resources that include online courses, taped lectures, movies, language lessons, recordings, book lists, syllabi, eBooks, audio books, text books, K-12 resources, art and art images, music and writing tips, among many other resources. The website itself does not host any of the content; rather, Colman and his team scour the Internet looking for these resources, some of which may seem obvious, but also including many resources that are obscure. Posting daily, the Open Culture team writes articles ranging from “Stevie Nicks “Shows Us How to Kick Ass in High-Heeled Boots” in a 1983 Women’s Self Defense Manual” to “John F. Kennedy Explains Why Artists & Poets Are Indispensable to American Democracy (October 26th, 1963”. Open Culture finds content that is useful, whimsical, timely, or all three.

The Open Culture website itself can be a little difficult to navigate. Links to content can seem hidden in the article format of Open Culture, and the various lists on the right side of the screen are clunky and require too much scrolling. However, the content that you find on the site more than makes up for the website design.

Have you used Open Culture before? Do you have other ways to find cultural resources on the web? Let us know in the comments!

Choosing GIMP as a Photoshop Alternative

The GIMP logo.

Image manipulation is a handy skill, but sinking time and money into Adobe Photoshop may not be an option for some people. If you’re looking for an alternative to Photoshop, GIMP is a great bet. Available for almost every operating system, GIMP is open source and free with lots of customization and third party plugin options.

One of the major aspects you lose when moving from Photoshop to GIMP is the loss of a major community and widespread knowledge of the software. While GIMP has its dedicated loyalists and a staff, they lack the same kind of institutional power that Adobe has to answer questions, fix bugs, and provide support. While Lynda.com does provide tutorials on GIMP, there are fewer overall resources for tutorials and help than Photoshop.

That being said, GIMP can still be a more powerful tool than Photoshop, especially if you have a programming background (or can convince someone else to do some programming for you). Theoretically, you could add or subtract any features that you so choose by changing the GIMP source code, and you are free to distribute a version of GIMP with those changes to whomever you choose.

There are a number of pros/cons for choosing GIMP over Photoshop, so here’s a handy list.

GIMP Pros:

  • Free
  • Highly customizable and flexible (with coding expertise)
  • Motivated user community run by volunteers
  • High usability
  • Easier to contact leadership regarding issues

GIMP Cons:

  • Less recognized
  • Changes are more slowly implemented
  • No promise that the software will always be maintained in perpetuity

Of course, there are more pros and cons to using GIMP, but this will give you a basic idea of the pros and cons of switching over to this open-source software.

For more information on GIMP, you can check out the GIMP Wiki, which is maintained by GIMP developers, or The GTK+ Project, which is a toolkit for the creation of graphical user interfaces (GUI). GIMP also provides a series of Tutorials. If you’re still loyal to Adobe, you can look at the Adobe products available on the UIUC WebStore, as well as tutorials on Lynda.com.

Do you have opinions on GIMP vs. Photoshop? Let us know in the comments! And stop by the Scholarly Commons, where you can use either (or both!) software for free.

Topic Modeling and the Future of Ebooks

Ebook by Daniel Sancho CC BY 2.0

This semester I’ve had the pleasure of taking a course on Issues in Scholarly Communication with Dr. Maria Bonn at the University of Illinois iSchool. While we’ve touched on a number of fascinating issues in this course, I’ve been particularly interested in JSTOR Labs’ Reimagining the Monograph Project.

This project was inspired by the observation that, while scholarly journal articles have been available in digital form for some time now, scholarly books are now just beginning to become available in this format. Nevertheless, the nature of long form arguments, that is, the kinds of arguments you find in books, differs in some important ways from the sorts of materials you’ll find in journal articles. Moreover, the ways that scholars and researchers engage with books are often different from the ways in which they interact with papers. In light of this, JSTOR Labs has spearheaded an effort to better understand the different ways that scholarly books are used, with an eye towards developing digital monographs that better suit these uses.

Topicgraph logo

In pursuit of this project, the JSTOR Labs team created Topicgraph, a tool that allows researchers to see, at a glance, what topics are covered within a monograph. Users can also navigate directly to pages that cover the topics in which they are interested. While Topicgraph is presented as a beta level tool, it provides us with a clear example of the untapped potential of digital books.

A topic graph for Suburban Urbanites

Topicgraph uses a method called topic modeling, which is used in natural language processing. Topic modeling will examine text, and then create different topics that are discussed in that text based on the terms being used. Terms that are used in proximity to one another at a frequent rate are thought to serve as an indicator that various topics are being discussed.

Users can explore Topicgraph by using JSTOR Labs’ small collection of open access scholarly books that span a number of different disciplines, or by by uploading their own PDFs for Topicgraph to analyze.

If you would like to learn how to incorporate topic modeling or other forms of text analysis into your research, contact the Scholarly Commons or visit us in the Main Library, room 306.