Data Feminism and Data Justice

“Data” can seem like an abstract term – What counts as data? Who decides what is counted? How is data created? What is it used for?

Outline of a figure surrounded by a pie chart, speach bubble, book, bar chart, and venn diagram to represent different types of data

“Data”. Olena Panasovska. Licensed under a CC BY license.

These questions are some of the ones you might ask when applying a Data Feminist framework to you research. Data Feminism goes beyond looking at the mechanics and logistics of data collection and analysis to undercover the influences of structural power and erasure in the collection, analysis, and application of data.

Data Feminism was developed by Catherine D’Ignazio and Lauren Kline, authors of the book Data Feminism. Their ideas are grounded in the work of Kimberle Crenshaw, the legal scholar credited with developing the concept of intersectionality. Using this lens, they seek to undercover the ways data science has caused harm to marginalized communities and the ways data justice can be used to remedy those harms in partnership with the communities we aim to help.

The Seven Principles of Data Feminism include:

  • Examine power
  • Challenge power
  • Rethink binaries and hierarchies
  • Elevate emotion and embodiment
  • Embrace pluralism
  • Consider context
  • Make labor visible

Applying data feminist principles to your research might involve working with local communities to co-create consent forms, using data collection to fill gaps in available data about marginalized groups, prioritizing the use of open source, community-created tools, and properly acknowledging and compensating people involved in all stages of the research process. At the heart of this work is the questioning of whose interests drive research and how we can reorient those interests around social justice, equity, and community.

The Feminist Data Manifest-No, authored in part by Anita Say Chan, Associate Professor in the School of Information Sciences and the College of Media, provides additional principles to commit to in data feminist research. These resources, and the scholars and communities engaged in this work, demonstrate how data and research can be used to advance justice, reject neutrality, and prioritize those who have historically experienced the greatest harm at the hands of researchers.

The Data + Feminism Lab at the Massachusetts Institute of Technology, directed by D’Ignazio, is a research organization that “uses data and computational methods to work towards gender and racial equity, particularly as they relate to space and place”. They are members of the Design Justice Network, which seeks to bring together people interested in research that centers marginalized people and aims to address the ways research and data are used to cause harm. These groups provide examples for how to engage in data feminist and data-justice inspired research and action.

Learning how to use tools like SPSS and NVivo is an important aspect of data-related research, but thinking about the seven principles of Data Feminism can inspire us to think critically about our work and engage more fully in our communities.  For more information about data feminism, check out these resources:

Lightning Review: The GIS Guide to Public Domain Data

One of the first challenges encountered by anyone seeking to start a new GIS project is where to find good, high quality geospatial data. The field of geographic information science has a bit of a problem in which there are simultaneously too many possible data sources for any one researcher to be familiar with all of them, as well as too few resources available to help you navigate them all. Luckily, The GIS Guide to Public Domain Data is here to help!

The front cover of the book "The GIS Guide to Public Domain Data" by Joseph J. Kerski and Jill Clark. Continue reading

Using Article Citations to Find Data for Social Science

Whether we like it or not, using quantitative measures in social science research has become increasingly important for getting your work published and recognized. If you’ve never used data before and don’t even know where to start this can seem a little daunting. The good news is: You most likely won’t have to collect your own data. There is so much data already out there but the hard part can be finding it. In this post I will explain one strategy for finding social science data: using article citations.

Looney Toons' Wiley Coyote searching a landscape with binoculars

You don’t have to look too far to find the right data

Continue reading

Featured Resource: BTAA Geoportal

We at the University of Illinois are lucky to have a library that offers access to more journals and databases than any one person could ever hope to make their way though. The downside of this much access, however, is that it can be easy for resources to get lost in the weeds. For the typical student, once you are familiar with a few databases or methods of searching for information, you tend to not seek out more unless you absolutely need to.

This week, we wanted to fight back against that tendency just a little bit, by introducing you to a database which many readers may not have heard of before but contains a veritable treasure trove of useful geographical information, the Big 10 Academic Alliance Geoportal.

This resource is a compilation of geospatial content from the 12 universities that make up the BTAA. Types of content available include maps (many of which are historic), aerial imagery, and geospatial data. Researchers with a specific need for one of those can easily navigate from the Geoportal homepage to a more specific resource page by selecting the type of information they are looking for here:

A screenshot from the BTAA Geoportal, displaying icons to click on for "Geospatial Data," "Maps," and "Aerial Imagery."

Alternatively, if you don’t particularly care about the type of data you find but rather are looking for data in a particular region, you can use the map on the left side of the display to easily zoom in to a particular part of the world and see what maps and other resources are available.

A screenshot from the BTAA Geoportal showing a world map with numbers in orange, yellow, and green circles scattered around the map.

The numbers on the map represent the number of maps or other data in the Geoportal localized in each rough region of the world, for example, there are 310 maps for Europe, and 14 maps for the Atlantic Ocean. As you zoom in on the map, your options get more specific, and the numbers break down to smaller geographic regions: 

A close-up of Europe on the same map as above, showing that the one "310" circle on the world map is now divided into many smaller numbered circles around the continent.

When the map is zoomed in close enough that there is only one piece of data for a particular area, the circled numbers are replaced with a blue location icon, such as the ones displayed over Iceland, Sweden, and the Russia-Finland border above. Clicking on one of these icons will take you to a page with the specific image or data source represented on the map. For example, the icon over Iceland takes us to the following page:

A screenshot from the BTAA Geoportal showing a historic map of Iceland with some metadata below.

Information is provided about what type of resource you’re looking at, who created it, what time period it is from, as well as which BTAA member institution uploaded the map (in this case, the University of Minnesota). 

Other tools on the home page, including a search bar and lists of places and subjects represented in the Geoportal, mean that no matter what point you’re starting from you should have no problem finding the data you need!

The Geoportal also maintains a blog with news, featured items and more, so be sure to check it out and keep up-to-date on all things geospatial!

Do you have questions about using the Geoportal, or finding other geospatial data? Stop by the Scholarly Commons or shoot us an email at, we’ll be happy to help you!

Cool Text Data – Music, Law, and News!

Computational text analysis can be done in virtually any field, from biology to literature. You may use topic modeling to determine which areas are the most heavily researched in your field, or attempt to determine the author of an orphan work. Where can you find text to analyze? So many places! Read on for sources to find unique text content.

Woman with microphone

Genius – the song lyrics database

Genius started as Rap Genius, a site where rap fans could gather to annotate and analyze rap lyrics. It expanded to include other genres in 2014, and now manages a massive database covering Ariana Grande to Fleetwood Mac, and includes both lyrics and fan-submitted annotations. All of this text can be downloaded and analyzed using the Genius API. Using Genius and a text mining method, you could see how themes present in popular music changed over recent years, or understand a particular artist’s creative process.

homepage of, with Ohio highlighted, 147,692 unique cases. 31 reporters. 713,568 pages scanned.

Homepage of – the case law database

The Caselaw Access Project (CAP) is a fairly recent project that is still ongoing, and publishes machine-readable text digitized from over 40,000 bound volumes of case law from the Harvard Law School Library. The earliest case is from 1658, with the most recent cases from June 2018. An API and bulk data downloads make it easy to get this text data. What can you do with huge amounts of case law? Well, for starters, you can generate a unique case law limerick:

Wheeler, and Martin McCoy.
Plaintiff moved to Illinois.
A drug represents.
Pretrial events.
Rocky was just the decoy.

Check out the rest of their gallery for more project ideas.

Newspapers and More

There are many places you can get text from digitized newspapers, both recent and historical. Some newspaper are hundreds of years old, so there can be problems with the OCR (Optical Character Recognition) that will make it difficult to get accurate results from your text analysis. Making newspaper text machine readable requires special attention, since they are printed on thin paper and have possibly been stacked up in a dusty closet for 60 years! See OCR considerations here, but the newspaper text described here is already machine-readable and ready for text mining. However, with any text mining project, you must pay close attention to the quality of your text.

The Chronicling America project sponsored by the Library of Congress contains digital copies of newspapers with machine-readable text from all over the United States and its territories, from 1690 to today. Using newspaper text data, you can analyze how topics discussed in newspapers change over time, among other things.

newspapers being printed quickly on a rolling press

Looking for newspapers from a different region? The library has contracts with several vendors to conduct text mining, including Gale and ProQuest. Both provide newspaper text suitable for text mining, from The Daily Mail of London (Gale), to the Chinese Newspapers Collection (ProQuest). The way you access the text data itself will differ between the two vendors, and the library will certainly help you navigate the collections. See the Finding Text Data library guide for more information.

The sources mentioned above are just highlights of our text data collection! The Illinois community has access to a huge amount of text, including newspapers and primary sources, but also research articles and books! Check out the Finding Text Data library guide for a more complete list of sources. And, when you’re ready to start your text mining project, contact the Scholarly Commons (, and let us help you get started!

Wikidata and Wikidata Human Gender Indicators (WHGI)

Wikipedia is a central player in online knowledge production and sharing. Since its founding in 2001, Wikipedia has been committed to open access and open editing, which has made it the most popular reference work on the web. Though students are still warned away from using Wikipedia as a source in their scholarship, it presents well-researched information in an accessible and ostensibly democratic way.

Most people know Wikipedia from its high ranking in most internet searches and tend to use it for its encyclopedic value. The Wikimedia Foundation—which runs Wikipedia—has several other projects which seek to provide free access to knowledge. Among those are Wikimedia Commons, which offers free photos; Wikiversity, which offers free educational materials; and Wikidata, which provides structured data to support the other wikis.

The Wikidata logo

Wikidata provides structured data to support Wikimedia and other Wikimedia Foundation projects

Wikidata is a great tool to study how Wikipedia is structured and what information is available through the online encyclopedia. Since it is presented as structured data, it can be analyze quantitatively more easily than Wikipedia articles. This has led to many projects that allow users to explore data through visualizations, queries, and other means. Wikidata offers a page of Tools that can be used to analyze Wikidata more quickly and efficiently, as well as Data Access instructions for how to use data from the site.

The webpage for the Wikidata Human Gender Indicators project

The home page for the Wikidata Human Gender Indicators project

An example of a project born out of Wikidata is the Wikidata Human Gender Indicators (WHGI) project. The project uses metadata from Wikidata entries about people to analyze trends in gender disparity over time and across cultures. The project presents the raw data for download, as well as charts and an article written about the discoveries the researchers made while compiling the data. Some of the visualizations they present are confusing (perhaps they could benefit from reading our Lightning Review of Data Visualization for Success), but they succeed in conveying important trends that reveal a bias toward articles about men, as well as an interesting phenomenon surrounding celebrities. Some regions will have a better ratio of women to men biographies due to many articles being written about actresses and female musicians, which reflects cultural differences surrounding fame and gender.

Of course, like many data sources, Wikidata is not perfect. The creators of the WHGI project frequently discovered that articles did not have complete metadata related to gender or nationality, which greatly influenced their ability to analyze the trends present on Wikipedia related to those areas. Since Wikipedia and Wikidata are open to editing by anyone and are governed by practices that the community has agreed upon, it is important for Wikipedians to consider including more metadata in their articles so that researchers can use that data in new and exciting ways.

An animated gif of the Wikipedia logo bouncing like a ball

New Uses for Old Technology at the Arctic World Archive

In this era of rapid technological change, it is easy to fall into the mindset that the “big new thing” is always an improvement on the technology that came before it. Certainly this is often true, and here in the Scholarly Commons we are always seeking innovative new tools to help you out with your research. However, every now and then it’s nice to just slow down and take the time to appreciate the strengths and benefits of older technology that has largely fallen out of use.

A photo of the arctic

There is perhaps no better example of this than the Arctic World Archive, a facility on the Norwegian archipelago of Svalbard. Opened in 2017, the Arctic World Archive seeks to preserve the world’s most important cultural, political, and literary works in a way that will ensure that no manner of catastrophe, man-made or otherwise, could destroy them.

If this is all sounding familiar to you, that’s because you’ve probably heard of the Arctic World Archive’s older sibling, the Svalbard Global Seed Vault. The Global Seed Vault, which is much better known and older than the Arctic World Archive, is an archive seeds from around the world, meant to ensure that humanity would be able to continue growing crops and making food in the event of a catastrophe that wipes out plant life.

Indeed, the two archives have a lot in common. The World Archive is housed deep within a mountain in an abandoned coal mine that once served as the location of the seed vault, and was founded to be for cultural heritage what the seed vault is for crops. But the Arctic World Archive has made truly innovative use of old technology that makes it a truly impressive site in its own right.

A photo of the arctic

Perhaps the coolest (pun intended) aspect of the Arctic World Archive is the fact that it does not require electricity to operate. It’s extreme northern location (it is near the northernmost town of at least 1,000 people in the world) means that the temperature inside the facility is naturally very cold year-round. As any archivist or rare book librarian who brings a fleece jacket to work in the summer will happily tell you, colder temperatures are ideal for preserving documents, and the ability to store items in a very cold climate without the use of electricity makes the World Archive perfect for sustainable, long-term storage.

But that’s not all: in a real blast from the past, all information stored in this facility is kept on microfilm. Now, I know what you’re thinking: “it’s the 21st century, grandpa! No one uses microfilm anymore!”

It’s true that microfilm is used by a very small minority of people nowadays, but nevertheless it offers distinct advantages that newer digital media just can’t compete with. For example, microfilm is rated to last for at least 500 years without corruption, whereas digital files may not last anywhere near that long. Beyond that, the film format means that the archive is totally independent from the internet, and will outlast any major catastrophe that disrupts part or all of our society’s web capabilities.

A photo of a seal

The Archive is still growing, but it is already home to film versions of Edvard Munch’s The Scream, Dante’s The Divine Comedy, and an assortment of government documents from many countries including Norway, Brazil, and the United States.

As it continues to grow, its importance as a place of safekeeping for the world’s cultural heritage will hopefully serve as a reminder that sometimes, older technology has upsides that new tech just can’t compete with.

Using an Art Museum’s Open Data

*Edits on original idea and original piece by C. Berman by Billy Tringali

As a former art history student, I’m incredibly interested in the how the study of art history can be aided by the digital humanities. More and more museums have started allowing the public to access a portion of their data. When it comes to open data, museums seem to be lagging a bit behind other cultural heritage institutions, but many are providing great open data for humanists.

For art museums, the range of data provided ranges. Some museums are going the extra mile to give a lot of their metadata to the public. Others are picking and choosing aspects of their collection, such as the Museum of Modern Art’s Exhibition and Staff Histories.

Many museums, especially those that collect modern and contemporary art, can have their hands tied by copyright laws when it comes to the data they present. A few of the data sets currently available from art museums are the Cooper Hewitt’s Collection Data, the Minneapolis Institute of Arts metadata, the Rijksmuseum API, the Tate Collection metadata, and the Getty Vocabularies.

The Metropolitan Museum of Art has recently released all images of the museum’s public domain works under a Creative Commons Zero license.

More museum data can be found here!

Using Reddit’s API to Gather Text Data

The Reddit logo.

I initially started my research with an eye to using digital techniques to analyze an encyclopedia that collects a number of conspiracy theories in order to determine what constitute typical features of conspiracy theories. At this point, I realize there were two flaws in my original plan. First, as discussed in a previous blog post, the book I selected failed to provide the sort of evidence I required to establish typical features of conspiracy theories. Second, the length of the book, though sizable, was nowhere near large enough to provide a corpus that I could use a topic model on in order to derive interesting information.

My hope is that I can shift to online sources of text in order to solve both of these problems. Specifically, I will be collecting posts from Reddit. The first problem was that my original book merely stated the content of a number of conspiracy theories, without making any effort to convince the reader that they were true. As a result, there was little evidence of typical rhetorical and argumentative strategies that might characterize conspiracy theories. Reddit, on the other hand, will provide thousands of instances of people interacting in an effort to convince other Redditors of the truth or falsity of particular conspiracy theories. The sorts of strategies that were absent from the encyclopedia of conspiracy theories will, I hope, be present on Reddit.
The second problem was that the encyclopedia failed to provide a sufficient amount of text. Utilizing Reddit will certainly solve this problem; in less than twenty-four hours, there were over 1,300 comments on a recent post alone. If anything, the solution to this problem represents a whole new problem: how to deal with such a vast (and rapidly changing) body of information.

Before I worry too much about that, it is important that I be able to access the information in the first place. To do this, I’ll need to use Reddit’s API. API stands for Application Programming Interface, and it’s essentially a tool for letting a user interact with a system. In this case, the API allows a user to access information on the Reddit website. Of course, we can already do this with an web browser. The API, however, allows for more fine-grained control than a browser. When I navigate to a Reddit page with my web browser, my requests are interpreted in a very pre-scripted manner. This is convenient; when I’m browsing a website, I don’t want to have to specify what sort of information I want to see every time a new page loads. However, if I’m looking for very specific information, it can be useful to use an API to hone in on just the relevant parts of the website.

For my purposes, I’m primarily interested in downloading massive numbers of Reddit posts, with just their text body, along with certain identifiers (e.g., the name of the poster, timestamp, and the relation of that post to other posts). The first obstacle to accessing the information I need is learning how to request just that particular set of information. In order to do this, I’ll need to learn how to write a request in Reddit’s API format. Reddit provides some help with this, but I’ve found these other resources a bit more helpful. The second obstacle is that I will need to write a program that automates my requests, to save myself from having to perform tens of thousands of individual requests. I will be attempting to do this in Python. While doing this, I’ll have to be sure that I abide by Reddit’s regulations for using its API. For example, a limited number of requests per minute are allowed so that the website is not overloaded. There seems to be a dearth of example code on the Internet for text acquisition of this sort, so I’ll be posting a link to any functional code I write in future posts.