Using an Art Museum’s Open Data

*Edits on original idea and original piece by C. Berman by Billy Tringali

As a former art history student, I’m incredibly interested in the how the study of art history can be aided by the digital humanities. More and more museums have started allowing the public to access a portion of their data. When it comes to open data, museums seem to be lagging a bit behind other cultural heritage institutions, but many are providing great open data for humanists.

For art museums, the range of data provided ranges. Some museums are going the extra mile to give a lot of their metadata to the public. Others are picking and choosing aspects of their collection, such as the Museum of Modern Art’s Exhibition and Staff Histories.

Many museums, especially those that collect modern and contemporary art, can have their hands tied by copyright laws when it comes to the data they present. A few of the data sets currently available from art museums are the Cooper Hewitt’s Collection Data, the Minneapolis Institute of Arts metadata, the Rijksmuseum API, the Tate Collection metadata, and the Getty Vocabularies.

The Metropolitan Museum of Art has recently released all images of the museum’s public domain works under a Creative Commons Zero license.

More museum data can be found here!

Exploring Data Visualization #4

In this monthly series, I share a combination of cool data visualizations, useful tools and resources, and other visualization miscellany. The field of data visualization is full of experts who publish insights in books and on blogs, and I’ll be using this series to introduce you to a few of them. You can find previous posts by looking at the Exploring Data Visualization tag.

Welcome back to this blog series! Here are some of the things I read in May:

a cartoon image of a few buildings, above two cartoon characters, one who is pointing and saying "We missed people here," while the other character shrugs and says "We can't do anything about it"

from Alvin Chang at Vox, “How Republicans are undermining the 2020 census, explained with a cartoon”

1) Alvin Chang, Senior Graphics Reporter at Vox, “covers policy by making explainers with charts and cartoons.” This month he explained the precarious state of the upcoming 2020 U.S. Census.

a dual-axis line chart overlaid with a stick figure drawing of a confused person misreading the chart's data

from Lisa Charlotte Rost at Uncharted, “Why not to use two axes, and what to use instead”

2) Lisa Charlotte Rost, a designer for Datawrapper, explains why dual-axis charts are almost always terrible, and what you can use instead.

text saying "The Wisdom and/or Madness of Crowds," surrounded by a cartoon rendering of a network graph

“The Wisdom and/or Madness of Crowds,” a game created by Nicky Case

3) Play this cute game! Nicky Case combines the logic of network graphs with the science of crowds in an “explorable” that shows why some crowds generate wisdom, while others create madness.

I hope you enjoyed this data visualization news! If you have any data visualization questions, please feel free to email me and set up an appointment at the Scholarly Commons.

Exploring Data Visualization #3

In this monthly series, I share a combination of cool data visualizations, useful tools and resources, and other visualization miscellany. The field of data visualization is full of experts who publish insights in books and on blogs, and I’ll be using this series to introduce you to a few of them. You can find previous posts by looking at the Exploring Data Visualization tag.

Welcome back to this blog series! Here are some of the things I read in April:

a photograph of a knit pattern in a very strange shape, using green yarn

“Make Caows and Shapcho” pattern knit by MeganAnn (https://www.ravelry.com/projects/MeganAnn/skyknit-the-collection)

1) Janelle Shane, who has created a new kind of humor based on neural networks, trained a neural network to generate knitting patterns. Experienced knitters then attempted these patterns so we can see what the computer generated, ranging from reasonable to silly to downright creepy creations.

map showing that many areas of the United States get their first leaf earlier than in the past

from NASA Earth Observatory, “Spring is Arriving Earlier in National Parks”

2) Considering we had snowfall in April, you might not think spring began early this year (I know I don’t!). But broadly speaking, climate change has caused spring to begin earlier and earlier across the United States. The NASA Earth Observatory looked at data published in 2016 to create maps that visualize how climate change has changed the timing of spring.

3) If you want to learn a new tool but aren’t sure what to choose, have a look at Nathan Yau’s suggestions in his post What I Use to Visualize Data. He even divides his list into categories based on where he is in the process, such as initial data processing versus final visualizations.

Using Reddit’s API to Gather Text Data

The Reddit logo.

I initially started my research with an eye to using digital techniques to analyze an encyclopedia that collects a number of conspiracy theories in order to determine what constitute typical features of conspiracy theories. At this point, I realize there were two flaws in my original plan. First, as discussed in a previous blog post, the book I selected failed to provide the sort of evidence I required to establish typical features of conspiracy theories. Second, the length of the book, though sizable, was nowhere near large enough to provide a corpus that I could use a topic model on in order to derive interesting information.

My hope is that I can shift to online sources of text in order to solve both of these problems. Specifically, I will be collecting posts from Reddit. The first problem was that my original book merely stated the content of a number of conspiracy theories, without making any effort to convince the reader that they were true. As a result, there was little evidence of typical rhetorical and argumentative strategies that might characterize conspiracy theories. Reddit, on the other hand, will provide thousands of instances of people interacting in an effort to convince other Redditors of the truth or falsity of particular conspiracy theories. The sorts of strategies that were absent from the encyclopedia of conspiracy theories will, I hope, be present on Reddit.
The second problem was that the encyclopedia failed to provide a sufficient amount of text. Utilizing Reddit will certainly solve this problem; in less than twenty-four hours, there were over 1,300 comments on a recent post alone. If anything, the solution to this problem represents a whole new problem: how to deal with such a vast (and rapidly changing) body of information.

Before I worry too much about that, it is important that I be able to access the information in the first place. To do this, I’ll need to use Reddit’s API. API stands for Application Programming Interface, and it’s essentially a tool for letting a user interact with a system. In this case, the API allows a user to access information on the Reddit website. Of course, we can already do this with an web browser. The API, however, allows for more fine-grained control than a browser. When I navigate to a Reddit page with my web browser, my requests are interpreted in a very pre-scripted manner. This is convenient; when I’m browsing a website, I don’t want to have to specify what sort of information I want to see every time a new page loads. However, if I’m looking for very specific information, it can be useful to use an API to hone in on just the relevant parts of the website.

For my purposes, I’m primarily interested in downloading massive numbers of Reddit posts, with just their text body, along with certain identifiers (e.g., the name of the poster, timestamp, and the relation of that post to other posts). The first obstacle to accessing the information I need is learning how to request just that particular set of information. In order to do this, I’ll need to learn how to write a request in Reddit’s API format. Reddit provides some help with this, but I’ve found these other resources a bit more helpful. The second obstacle is that I will need to write a program that automates my requests, to save myself from having to perform tens of thousands of individual requests. I will be attempting to do this in Python. While doing this, I’ll have to be sure that I abide by Reddit’s regulations for using its API. For example, a limited number of requests per minute are allowed so that the website is not overloaded. There seems to be a dearth of example code on the Internet for text acquisition of this sort, so I’ll be posting a link to any functional code I write in future posts.

Whimsical Data

Photograph of a Yorkshire terrier in a field of yellow flowers.

It’s finally springtime!

It’s April! After what felt like eternity, it’s starting to warm up here at the University of Illinois at Urbana-Champaign. So today, in celebration of spring, we’re going to take a look at few whimsical data sets that have made us laugh, smile, and think.

Dogs of NYC

Dogs of NYC was published by the NYC Department of Health and Mental Hygiene in 2013. The department collected data on 50,000 New York dogs, including their name, gender, breed, birth date, dominant, secondary and third color, and whether they are spayed/neutered or a guard dog, along with the borough they live in and their zip code. WYNC used this data to explore dog names and breeds by area, and Kaylin Pavlik used the data to show the relationship between dog names and dog breeds.

What made us laugh: How high the TF-IDF score for the name Pugsley was for Pugs as compared to other breeds.

What made us think: Does the perceived danger of a dog breed influence what people name them?

UK Government Hospitality wine cellar annual statement

Each year, the UK publishes an annual statement on the Government Wine Cellar, which they describe as being “used to support the work of Government Hospitality in delivering business hospitality for all government ministers and departments”. The first report was published in July 2014, and the latest was published in September 2017.

What made us laugh: Government Hospitality has an an advisory committee that meets four times a year and are known as Masters of Wine. They are unpaid.

What made us think: With threats to government transparency across the globe, it is nice to see data that some may brush off as inconsequential, but actually deals with large sums of money.

Most Popular Christmas Toys According to Search Data

Published by Reckless in November 2017, this data set shows search data based on the Toys R Us catalog (RIP) that shows which toys, video games, and board games were most popular among different age groups. Favorite toys included the Barbie Dreamhouse, Furby Connect, Razor Crazy Cart, and R2D2 Interactive Robotic Droid.

What made us laugh: The Silly Sausage game was one of the most searched board games during this period.

What made us think: Toys play a pivotal role during childhood development. It’s a little astonishing to see that, despite all of her critics, Barbie still reigns supreme in the 2-4 year-old age group.

Do you have a favorite data set? Let us know in the comments!

Exploring Data Visualization #2

In this monthly series, I share a combination of cool data visualizations, useful tools and resources, and other visualization miscellany. The field of data visualization is full of experts who publish insights in books and on blogs, and I’ll be using this series to introduce you to a few of them. You can find previous posts by looking at the Exploring Data Visualization tag.

Welcome back to this blog series! Here are some of the things I read in March:

Chart showing that the sons of black families from the top 1 percent had about the same chance of being incarcerated on a given day as the sons of white families earning $36,000

From The New York Times, “Extensive Data Shows Punishing Reach of Racism for Black Boys”

1) The New York Times took data from a recent study about income inequality and designed a variety of compelling data visualizations. The article text and the visualizations complement each other to convey the pervasive insidiousness of racism, especially for black boys.

A chart legend with the categories

From Elijah Meeks, “Color Advice for Data Visualization with D3.js”

2) D3.js is an open JavaScript library that you can use to visualize data. A data visualization engineer at Netflix (what an interesting job!), Elijah Meeks provides some great advice when picking your colors in D3. More importantly, these tips are helpful no matter what visualization tool you use.

A demonstration of selecting bins for histograms, showing too few, too many, and just the right number

From Mikhail Popov, “Plotting the Course Through Charted Waters”

3) Want to learn some data visualization basics? Mikhail Popov from Wikimedia conducted a data visualization literacy workshop for Wikimedia Foundation’s All Hands 2018 staff conference, and he made the entire workshop available online.

I hope you enjoyed this data visualization news! If you have any data visualization questions, please feel free to email me and set up an appointment at the Scholarly Commons.

Meet Aaron King, Scholarly Commons GIS Consultant

picture of Aaron King, GIS Consultant

This latest installment of our series of interviews with Scholarly Commons experts and affiliates features Aaron King, GIS Consultant at the Scholarly Commons. Welcome, Aaron!


What is your background and work experience?

I am from Wisconsin originally, and studied Ecology and Evolutionary Biology at University of Wisconsin-Whitewater. I focused on wolf and carnivore species populations in northern Wisconsin and in Yellowstone. Then my senior year, I stayed on to study Geography, which led to my career in GIS. I worked as a GIS analyst for one year while finishing up my geography degree. Afterwards, I worked at National Geographic in Washington D.C. Then, I worked as a GIS Analyst and Consultant for Intalytics in Ann Arbor, Michigan, while going to school for a Master’s in GIS and Bachelor of Science in Physics at Eastern Michigan University. I did a stint for Department of Defense in Madison, Wisconsin. Afterwards, I took time off to become a kayak guide, and decided to finish my schooling here at the University of Illinois.

Currently I work with Remote Sensing of the environment and geostatistics.

What led you to your field?

My background in environmental and climate science, as well as my love for geography led me into this field. I believe satellite data can be used a tool to expand this research and hopefully contribute to science and helping the world as a whole.

What is your research agenda?

I plan doing research on phenology, using a variety of data science methods. Additionally, I want to explore wildfire risk, and possibly look into health characteristics of greenspaces. Currently I am pursuing my Master’s, and I hope to continue my PhD here as well.

Do you have any favorite work-related duties?

When you get into research or your field, your knowledge blinders become very focused on what you are doing. Being in a position like this allows me to think past what I know, and explore areas of GIS that I normally do not think about, reflecting the endless possibilities of GIS. Plus, I just find it fascinating what other people are working on, and I love being part of it.

What are some of your favorite underutilized resources that you would recommend?

Programs for GIS outside of ESRI. There are a wealth of programs, free and open-source, that work just as well but are different than the standard ESRI programs. ESRI is a great option, but the amount of data and programs out there to help you with your problem is staggering. The other resource I would recommend in taking some coding lessons like through DataCamp, codeacademy, SoloLearn, or Lynda, because having that underlying knowledge of how programs work helps you understand.

If you could recommend only one book to researchers starting out in the GIS field, what would it be?

There are many great books about GIS. But the book you need to read to get into geography, which is the foundation of GIS, is How to Lie with Maps by Mark Monmonier.

Honorable mention: The Nature of Maps by Arthur Robinson and Barbara Bartz Petchenik.

Note: both books are available through the University Library, here and here.

What fields can use GIS research methods?

I had a professor, in my first class, ask us this same question. His answer was, “There is not a science or business that can’t utilize GIS in some way. Your job is to find it.”

Are there any big names in your field that people should know about?

Dr. Mei-Po Kwan (she works here, tell her I say hi), Dr. Waldo Tobler, Dr. Mathew Zook, William Morris Davis, Immanuel Kant, Arthur Robinson, Michael Jordan (seriously he studied geography, look it up!).

To schedule a consultation with Aaron, contact sc@library.illinois.edu.

Data Purchase Program is Accepting Applications!

The Library’s Ninth Data Purchase Program Round is Accepting Applications!

Through the Library’s Data Purchase Program, the University Library accepts applications from campus researchers to purchase data. All applications must meet the following minimum criteria, in addition to others listed in the full program announcement.

  • The dataset must cost less than $5,000;
  • The dataset must be used for research; and
  • The Library must be able to make the data available for use by everyone at UIUC.

For some examples of past data requests supported by the Data Purchase Program, please explore the list on this page: https://www.library.illinois.edu/sc/dppdatasets

The deadline for first consideration is May 28, 2018, but applications that come in later will be considered based on availability of funds and whether the purchase can be completed by June 30, 2019.

If you have questions about the program or need help identifying data for your research, please contact the Scholarly Commons at sc@library.illinois.edu. We look forward to connecting you with the data you need!

Exploring Data Visualization

Hi everyone! As mentioned in an earlier post, I’m Megan Ozeran, the Data Analytics & Visualization Librarian in the Scholarly Commons. In this new monthly series, I will share a combination of cool data visualizations, useful tools and resources, and other visualization miscellany. The field of data visualization is full of experts who publish insights in books and on blogs, and I’ll be using this series to introduce you to a few of them.

To jump-start this series, here are a few items for February:

A Tableau dashboard analyzing baseball data with regard to African American players

Created by Yoshihito Kimura, “African American baseball players have consisitently [sic] contributed to win”

1) data.world hosted weekly data visualization events related to Black History Month. See the data and the visualizations that people have created by clicking on the dataset links on their Black History Month page. The visualization above was contributed to the Baseball Demographics project.

A movie passes the Lena Waithe Test if there's a black woman in the work, who's in a position of power, and she's in a healthy relationship.

From FiveThirtyEight, “The Next Bechdel Test”

2) FiveThirtyEight, known for telling data-rich stories with visualizations, has made it easier than ever to download their data. For instance, you can download the data behind their article “The Next Bechdel Test” and experiment with how you might visualize it differently.

An 8-bit graphic of a millennial with the caption, "Follow me as I make my way toward a stable financial future."

From HuffPost, “FML”

3) “Why millennials are facing the scariest financial future of any generation since the Great Depression.” This long, intense article combines writing and data visualization in a brand new way. I recommend viewing it in a computer browser because the mobile version may not be as easy to read.

I hope you enjoyed this data visualization news! If you have any data visualization questions, please feel free to email me and set up an appointment at the Scholarly Commons.

Endangered Data Week is Coming

The Endangered Data Week logo

Did you know that Endangered Data Week is happening from February 26-March 2? Endangered Data Week is a collaborative effort to help highlight on public datasets that are in danger of being deleted, repressed, mishandled, or lost. Inspired by recent events that have shown how fragile publicly administered data is, Endangered Data Week hopes to promote care for endangered collections by publicizing datasets and increasing engagement with them, and through advocating for political activism.

The Endangered Data Week organizes hope to cultivate a broad community of supporters for access to public data, and who advocate for open data policies and help cultivate data skills and competencies among students and colleagues. During Endangered Data Week, librarians, scholars and activists will use the #EndangeredData Twitter hashtag, as well as host events across the country.

While this is the first year of Endangered Data Week, the organizers hope to work both on the momentum of similar movements, such as Sunshine Week, Open Access Week, and the #DataRescue, and to continue organizing events into the future.

What are you doing during Endangered Data Week? Let us know in the comments!