Exploring Data Visualization #11

 In this monthly series, I share a combination of cool data visualizations, useful tools and resources, and other visualization miscellany. The field of data visualization is full of experts who publish insights in books and on blogs, and I’ll be using this series to introduce you to a few of them. You can find previous posts by looking at the Exploring Data Visualization tag.

Data Visualization Office Hours and Workshops

A headshot of Megan Ozeran with a border above her reading Data Viz Help and a banner below that reads The Librarian is In

Our amazing Data Visualization Librarian Megan Ozeran is holding open office hours every other Monday for the Spring 2019 semester! Drop by the Scholarly Commons from 2-4 on any of the dates listed below to ask any data viz questions you might have.

Office hours on: February 25, March 11, March 25, April 8, April 22, and May 6.

Additionally, Megan will teach a joint workshop as part of our Savvy Researcher series titled “Network Analysis in Digital Humanities” on Thursday, March 7th. Megan and SC GA Kayla Abner will cover the basics of how to use NodeXL, Palladio, and Cytoscape to show relationships between concepts in your research. Register online on our Savvy Researcher Calendar!

Lifespan of News Stories

A chart showing the search interest for different news stories in October 2018, represented as colored peaks with the apex labeled with a world event.

October was one of the busier times of the year, with eight overlapping news stories. Hurricane Michael tied with Hurricane Florence for the largest number of searches in 2018.

According to trends compiled by the news site Axios, “news cycles for some of the biggest moments of 2018 only lasted for a median of 7 days.” Axios put together a timeline of the year which shows the peaks and valleys of 49 of the top news stories from 2018. A simplified view of the year in the article “What captured America’s attention in 2018” shows the distribution of those 49 stories, while a full site, “The Lifespan of News Stories,” shows search interest by region and links to an article from Axios about the event (clever advertising on their part).

#SWDchallenge: visualize variance

A graph showing the average minimum temperature in Milwaukee, Wisconsin, for January 2000 through January 2019. The points on the chart are connected with light blue lines and filled in with blue to resemble icicles.

Knaflic’s icicle-style design for minimum temperature.

If there were to be a search interest visualization for the past few weeks in the Midwest, I have no doubt that the highest peak would be for the term “polar vortex.” The weather so far this year has been unusual, thanks to the extreme cold due to the polar vortex we had in the last week of January. Cole Nussbaumer Knaflic from Storytelling with Data used the cold snap as inspiration for the #SWDchallenge this month: visualize variance. Knaflic went through a series of visualizations in a blog post to show variation in average temperature in Milwaukee.

I hope you enjoyed this data visualization news! If you have any data visualization questions, please feel free to email the Scholarly Commons.

Cool Text Data – Music, Law, and News!

Computational text analysis can be done in virtually any field, from biology to literature. You may use topic modeling to determine which areas are the most heavily researched in your field, or attempt to determine the author of an orphan work. Where can you find text to analyze? So many places! Read on for sources to find unique text content.

Woman with microphone

Genius – the song lyrics database

Genius started as Rap Genius, a site where rap fans could gather to annotate and analyze rap lyrics. It expanded to include other genres in 2014, and now manages a massive database covering Ariana Grande to Fleetwood Mac, and includes both lyrics and fan-submitted annotations. All of this text can be downloaded and analyzed using the Genius API. Using Genius and a text mining method, you could see how themes present in popular music changed over recent years, or understand a particular artist’s creative process.

homepage of case.law, with Ohio highlighted, 147,692 unique cases. 31 reporters. 713,568 pages scanned.

Homepage of case.law

Case.law – the case law database

The Caselaw Access Project (CAP) is a fairly recent project that is still ongoing, and publishes machine-readable text digitized from over 40,000 bound volumes of case law from the Harvard Law School Library. The earliest case is from 1658, with the most recent cases from June 2018. An API and bulk data downloads make it easy to get this text data. What can you do with huge amounts of case law? Well, for starters, you can generate a unique case law limerick:

Wheeler, and Martin McCoy.
Plaintiff moved to Illinois.
A drug represents.
Pretrial events.
Rocky was just the decoy.

Check out the rest of their gallery for more project ideas.

Newspapers and More

There are many places you can get text from digitized newspapers, both recent and historical. Some newspaper are hundreds of years old, so there can be problems with the OCR (Optical Character Recognition) that will make it difficult to get accurate results from your text analysis. Making newspaper text machine readable requires special attention, since they are printed on thin paper and have possibly been stacked up in a dusty closet for 60 years! See OCR considerations here, but the newspaper text described here is already machine-readable and ready for text mining. However, with any text mining project, you must pay close attention to the quality of your text.

The Chronicling America project sponsored by the Library of Congress contains digital copies of newspapers with machine-readable text from all over the United States and its territories, from 1690 to today. Using newspaper text data, you can analyze how topics discussed in newspapers change over time, among other things.

newspapers being printed quickly on a rolling press

Looking for newspapers from a different region? The library has contracts with several vendors to conduct text mining, including Gale and ProQuest. Both provide newspaper text suitable for text mining, from The Daily Mail of London (Gale), to the Chinese Newspapers Collection (ProQuest). The way you access the text data itself will differ between the two vendors, and the library will certainly help you navigate the collections. See the Finding Text Data library guide for more information.

The sources mentioned above are just highlights of our text data collection! The Illinois community has access to a huge amount of text, including newspapers and primary sources, but also research articles and books! Check out the Finding Text Data library guide for a more complete list of sources. And, when you’re ready to start your text mining project, contact the Scholarly Commons (sc@library.illinois.edu), and let us help you get started!

Wikidata and Wikidata Human Gender Indicators (WHGI)

Wikipedia is a central player in online knowledge production and sharing. Since its founding in 2001, Wikipedia has been committed to open access and open editing, which has made it the most popular reference work on the web. Though students are still warned away from using Wikipedia as a source in their scholarship, it presents well-researched information in an accessible and ostensibly democratic way.

Most people know Wikipedia from its high ranking in most internet searches and tend to use it for its encyclopedic value. The Wikimedia Foundation—which runs Wikipedia—has several other projects which seek to provide free access to knowledge. Among those are Wikimedia Commons, which offers free photos; Wikiversity, which offers free educational materials; and Wikidata, which provides structured data to support the other wikis.

The Wikidata logo

Wikidata provides structured data to support Wikimedia and other Wikimedia Foundation projects

Wikidata is a great tool to study how Wikipedia is structured and what information is available through the online encyclopedia. Since it is presented as structured data, it can be analyze quantitatively more easily than Wikipedia articles. This has led to many projects that allow users to explore data through visualizations, queries, and other means. Wikidata offers a page of Tools that can be used to analyze Wikidata more quickly and efficiently, as well as Data Access instructions for how to use data from the site.

The webpage for the Wikidata Human Gender Indicators project

The home page for the Wikidata Human Gender Indicators project

An example of a project born out of Wikidata is the Wikidata Human Gender Indicators (WHGI) project. The project uses metadata from Wikidata entries about people to analyze trends in gender disparity over time and across cultures. The project presents the raw data for download, as well as charts and an article written about the discoveries the researchers made while compiling the data. Some of the visualizations they present are confusing (perhaps they could benefit from reading our Lightning Review of Data Visualization for Success), but they succeed in conveying important trends that reveal a bias toward articles about men, as well as an interesting phenomenon surrounding celebrities. Some regions will have a better ratio of women to men biographies due to many articles being written about actresses and female musicians, which reflects cultural differences surrounding fame and gender.

Of course, like many data sources, Wikidata is not perfect. The creators of the WHGI project frequently discovered that articles did not have complete metadata related to gender or nationality, which greatly influenced their ability to analyze the trends present on Wikipedia related to those areas. Since Wikipedia and Wikidata are open to editing by anyone and are governed by practices that the community has agreed upon, it is important for Wikipedians to consider including more metadata in their articles so that researchers can use that data in new and exciting ways.

An animated gif of the Wikipedia logo bouncing like a ball

Lightning Review: Data Visualization for Success

Data visualization is where the humanities and sciences meet: viewers are dazzled by the presentation yet informed by research. Lovingly referred to as “the poster child of interdisciplinarity” by Steven Braun, data visualization brings these two fields closer together than ever to help provide insights that may have been impossible without the other. In his book Data Visualization for Success, Braun sits down with forty designers with experience in the field to discuss their approaches to data visualization, common techniques in their work, and tips for beginners.

Braun’s collection of interviews provides an accessible introduction into data visualization. Not only is the book filled with rich images, but each interview is short and meant to offer an individual’s perspective on their own work and the field at large. Each interview begins with a general question about data visualization to contribute to the perpetual debate of what data visualization is and can be moving forward.

Picture of Braun's "Data Visualization for Success"

Antonio Farach, one of the designers interviewed in the book, calls data visualization “the future of storytelling.” And when you see his work – or really any of the work in this book – you can see why. Each new image has an immediate draw, but it is impossible to move past without exploring a rich narrative. Visualizations in this book cover topics ranging from soccer matches to classic literature, economic disparities, selfie culture, and beyond.

Each interview ends by asking the designer for their advice to beginners, which not only invites new scholars and designers to participate in the field but also dispels any doubt of the hard work put in by these designers or the science at the root of it all. However, Barbara Hahn and Christine Zimmermann of Han+Zimmermann may have put it best, “Data visualization is not making boring data look fancy and interesting. Data visualization is about communicating specific content and giving equal weight to information and aesthetics.”

A leisurely, stunning, yet informative read, Data Visualization for Success offers anyone interested in this explosive field an insider’s look from voices around the world. Drop by the Scholarly Commons during our regular hours to flip through this wonderful read.

And finally, if you have any further interest in data visualization make sure you stay up to date on our Exploring Data Visualization series or take a look at what services the Scholarly Commons provides!

Whimsical Data

Photograph of a Yorkshire terrier in a field of yellow flowers.

It’s finally springtime!

It’s April! After what felt like eternity, it’s starting to warm up here at the University of Illinois at Urbana-Champaign. So today, in celebration of spring, we’re going to take a look at few whimsical data sets that have made us laugh, smile, and think.

Dogs of NYC

Dogs of NYC was published by the NYC Department of Health and Mental Hygiene in 2013. The department collected data on 50,000 New York dogs, including their name, gender, breed, birth date, dominant, secondary and third color, and whether they are spayed/neutered or a guard dog, along with the borough they live in and their zip code. WYNC used this data to explore dog names and breeds by area, and Kaylin Pavlik used the data to show the relationship between dog names and dog breeds.

What made us laugh: How high the TF-IDF score for the name Pugsley was for Pugs as compared to other breeds.

What made us think: Does the perceived danger of a dog breed influence what people name them?

UK Government Hospitality wine cellar annual statement

Each year, the UK publishes an annual statement on the Government Wine Cellar, which they describe as being “used to support the work of Government Hospitality in delivering business hospitality for all government ministers and departments”. The first report was published in July 2014, and the latest was published in September 2017.

What made us laugh: Government Hospitality has an an advisory committee that meets four times a year and are known as Masters of Wine. They are unpaid.

What made us think: With threats to government transparency across the globe, it is nice to see data that some may brush off as inconsequential, but actually deals with large sums of money.

Most Popular Christmas Toys According to Search Data

Published by Reckless in November 2017, this data set shows search data based on the Toys R Us catalog (RIP) that shows which toys, video games, and board games were most popular among different age groups. Favorite toys included the Barbie Dreamhouse, Furby Connect, Razor Crazy Cart, and R2D2 Interactive Robotic Droid.

What made us laugh: The Silly Sausage game was one of the most searched board games during this period.

What made us think: Toys play a pivotal role during childhood development. It’s a little astonishing to see that, despite all of her critics, Barbie still reigns supreme in the 2-4 year-old age group.

Do you have a favorite data set? Let us know in the comments!

Endangered Data Week is Coming

The Endangered Data Week logo

Did you know that Endangered Data Week is happening from February 26-March 2? Endangered Data Week is a collaborative effort to help highlight on public datasets that are in danger of being deleted, repressed, mishandled, or lost. Inspired by recent events that have shown how fragile publicly administered data is, Endangered Data Week hopes to promote care for endangered collections by publicizing datasets and increasing engagement with them, and through advocating for political activism.

The Endangered Data Week organizes hope to cultivate a broad community of supporters for access to public data, and who advocate for open data policies and help cultivate data skills and competencies among students and colleagues. During Endangered Data Week, librarians, scholars and activists will use the #EndangeredData Twitter hashtag, as well as host events across the country.

While this is the first year of Endangered Data Week, the organizers hope to work both on the momentum of similar movements, such as Sunshine Week, Open Access Week, and the #DataRescue, and to continue organizing events into the future.

What are you doing during Endangered Data Week? Let us know in the comments!

Preparing Your Data for Topic Modeling

In keeping with my series of blog posts on my research project, this post is about how to prepare your data for input into a topic modeling package. I used Twitter data in my project, which is relatively sparse at only 140 characters per tweet, but the principles can be applied to any document or set of documents that you want to analyze.

Topic Models:

Topic models work by identifying and grouping words that co-occur into “topics.” As David Blei writes, Latent Dirichlet allocation (LDA) topic modeling makes two fundamental assumptions: “(1) There are a fixed number of patterns of word use, groups of terms that tend to occur together in documents. Call them topics. (2) Each document in the corpus exhibits the topics to varying degree. For example, suppose two of the topics are politics and film. LDA will represent a book like James E. Combs and Sara T. Combs’ Film Propaganda and American Politics: An Analysis and Filmography as partly about politics and partly about film.”

Topic models do not have any actual semantic knowledge of the words, and so do not “read” the sentence. Instead, topic models use math. The tokens/words that tend to co-occur are statistically likely to be related to one another. However, that also means that the model is susceptible to “noise,” or falsely identifying patterns of cooccurrence if non-important but highly-repeated terms are used. As with most computational methods, “garbage in, garbage out.”

In order to make sure that the topic model is identifying interesting or important patterns instead of noise, I had to accomplish the following pre-processing or “cleaning” steps.

  • First, I removed the punctuation marks, like “,.;:?!”. Without this step, commas started showing up in all of my results. Since they didn’t add to the meaning of the text, they were not necessary to analyze.
  • Second, I removed the stop-words, like “I,” “and,” and “the,” because those words are so common in any English sentence that they tend to be over-represented in the results. Many of my tweets were emotional responses, so many authors wrote in the first person. This tended to skew my results, although you should be careful about what stop words you remove. Simply removing stop-words without checking them first means that you can accidentally filter out important data.
  • Finally, I removed too common words that were uniquely present in my data. For example, many of my tweets were retweets and therefore contained the word “rt.” I also ended up removing mentions to other authors because highly retweeted texts tended to mean that I was getting Twitter user handles as significant words in my results.

Cleaning the Data:

My original data set was 10 Excel files of 10,000 tweets each. In order to clean and standardize all these data points, as well as combining my file into one single document, I used OpenRefine. OpenRefine is a powerful tool, and it makes it easy to work with all your data at once, even if it is a large number of entries. I uploaded all of my datasets, then performed some quick cleaning available under the “Common Transformations” option under the triangle dropdown at the head of each column: I changed everything to lowercase, unescaped HTML characters (to make sure that I didn’t get errors when trying to run it in Python), and removed extra white spaces between words.

OpenRefine also lets you use regular expressions, which is a kind of search tool for finding specific strings of characters inside other text. This allowed me to remove punctuation, hashtags, and author mentions by running a find and replace command.

  • Remove punctuation: grel:value.replace(/(\p{P}(?<!’)(?<!-))/, “”)
    • Any punctuation character is removed.
  • Remove users: grel:value.replace(/(@\S*)/, “”)
    • Any string that begins with an @ is removed. It ends at the space following the word.
  • Remove hashtags: grel:value.replace(/(#\S*)/,””)
    • Any string that begins with a # is removed. It ends at the space following the word.

Regular expressions, commonly abbreviated as “regex,” can take a little getting used to in order to understand how they work. Fortunately, OpenRefine itself has some solid documentation on the subject, and I also found this cheatsheet valuable as I was trying to get it work. If you want to create your own regex search strings, regex101.com has a tool that lets you test your expression before you actually deploy it in OpenRefine.

After downloading the entire data set as a Comma Separated Value (.csv) file, I then used the Natural Language ToolKit (NLTK) for Python to remove stop-words. The code itself can be found here, but I first saved the content of the tweets as a single text file, and then I told NLTK to go over every line of the document and remove words that are in its common stop word dictionary. The output is then saved in another text file, which is ready to be fed into a topic modeling package, such as MALLET.

At the end of all these cleaning steps, my resulting data is essentially composed of unique nouns and verbs, so, for example, @Phoenix_Rises13’s tweet “rt @drlawyercop since sensible, national gun control is a steep climb, how about we just start with orlando? #guncontrolnow” becomes instead “since sensible national gun control steep climb start orlando.” This means that the topic modeling will be more focused on the particular words present in each tweet, rather than commonalities of the English language.

Now my data is cleaned from any additional noise, and it is ready to be input into a topic modeling program.

Interested in working with topic models? There are two Savvy Researcher topic modeling workshops, on December 6 and December 8, that focus on the theory and practice of using topic models to answer questions in the humanities. I hope to see you there!

DIY Data Science

Data science is a special blend of statistics and programming with a focus on making complex statistical analyses more understandable and usable to users, typically through visualization. In 2012, the Harvard Business Review published the article, “Data Scientist: The Sexiest Job of the 21st Century” (Davenport, 2012), showing society’s perception of data science. While some of the excitement of 2012 has died down, data science continues on, with data scientists earning a median base salary over $100,000 (Noyes, 2016).

Here at the Scholarly Commons, we believe that having a better understanding of statistics means you are less likely to get fooled when they are deployed improperly, and will help you have a better understanding of the inner workings of data visualization and digital humanities software applications and techniques. We might not be able to make you a data scientist (though certainly please let us know if inspired by this post and you enroll in formal coursework) but we can share some resources to let you try before you buy and incorporate methods from this growing field in your own research.

As we have discussed again and again on this blog, whether you want to improve your coding, statistics, or data visualization skills, our collection has some great reads to get you started.

In particular, take a look at:

The Human Face of Big Data created by Rick Smolan and Jennifer Erwitt

  • This is a great coffee table book of data visualizations and a great flip through if you are here in the space. You will learn a little bit more about the world around you and will be inspired with creative ways to communicate your ideas in your next project.

Data Points: Visualization That Means Something by Nathan Yau

  • Nathan Yau is best known for being the man behind Flowing Data, an extensive blog of data visualizations that also offers tutorials on how to create visualizations. In this book he explains the basics of statistics and visualization.

Storytelling with Data by Cole Nussbaumer Knaflic

LibGuides to Get You Started:

And more!

There are also a lot of resources on the web to help you:

The Open Source Data Science Masters

  • This is not an accredited masters program but rather a curated collection of suggested free and low-cost print and online resources for learning the various skills needed to become a data scientist. This list was created and is maintained by Clare Corthell of Luminant Data Science Consulting
  • This list does suggest many MOOCS from universities across the country, some even available for free

Dataquest

  • This is a project-based data science course created by Vik Paruchuri, a former Foreign Service Officer turned data scientist
  • It mostly consists of a beginner Python tutorial, though it is only one of many that are out there
  • Twenty-two quests and portfolio projects are available for free, though the two premium versions offer unlimited quests, more feedback, a Slack community, and opportunities for one-on-one tutoring

David Venturi’s Data Science Masters

  • A DIY data science course, which includes a resource list, and, perhaps most importantly, includes links to reviews of data science online courses with up to date information. If you are interested in taking an online course or participating in a MOOC this is a great place to get started

Mitch Crowe Learn Data Science the Hard Way

  • Another curated list of data science learning resources, this time based on Zed Shaw’s Learn Code the Hard Way series. This list comes from Mitch Crowe, a Canadian data science

So, is data science still sexy? Let us know what you think and what resources you have used to learn data science skills in the comments!

Works Cited:

Davenport, T. H., & Patil, D. J. (2012, October 1). Data Scientist: The Sexiest Job of the 21st Century. Retrieved June 1, 2017, from https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Noyes, K. (2016, January 21). Why “data scientist” is this year’s hottest job. Retrieved June 1, 2017, from http://www.pcworld.com/article/3025502/why-data-scientist-is-this-years-hottest-job.html

Adventures at the Spring 2017 Library Hackathon

This year I participated in an event called HackCulture: A Hackathon for the Humanities, which was organized by the University Library. This interdisciplinary hackathon brought together participants and judges from a variety of fields.

This event is different than your average campus hackathon. For one, it’s about expanding humanities knowledge. In this event, teams of undergraduate and graduate students — typically affiliated with the iSchool in some way — spend a few weeks working on data-driven projects related to humanities research topics. This year, in celebration of the sesquicentennial of the University of Illinois at Urbana-Champaign, we looked at data about a variety of facets of university life provided by the University Archives.

This was a good experience. We got firsthand experience working with data; though my teammates and I struggled with OpenRefine and so we ended up coding data by hand. I now way too much about the majors that are available at UIUC and how many majors have only come into existence in the last thirty years. It is always cool to see how much has changed and how much has stayed the same.

The other big challenge we had was not everyone on the team had experience with design, and trying to convince folks not to fall into certain traps was tricky.

For an idea of how our group functioned, I outlined how we were feeling during the various checkpoints across the process.

Opening:

We had grand plans and great dreams and all kinds of data to work with. How young and naive we were.

Midpoint Check:

Laura was working on the Python script and sent a well-timed email about what was and wasn’t possible to get done in the time we were given. I find public speaking challenging so that was not my favorite workshop. I would say it went alright.

Final:

We prevailed and presented something that worked in public. Laura wrote a great Python script and cleaned up a lot of the data. You can even find it here. One day in the near future it will be in IDEALS as well where you can already check out projects from our fellow humanities hackers.

Key takeaways:

  • Choose your teammates wisely; try to pick a team of folks you’ve worked with in advance. Working with a mix of new and not-so-new people in a short time frame is hard.
  • Talk to your potential client base! This was definitely something we should have done more of.
  • Go to workshops and ask for help. I wish we had asked for more help.
  • Practicing your presentation in advance as well as usability testing is key. Yes, using the actual Usability Lab at Scholarly Commons is ideal but at the very least take time to make sure the instructions for using what you created are accurate. It’s amazing what steps you will leave off when you have used an app more than twice. Similarly make sure that you can run your program and another program at the same time because if you can’t chances are it means you might crash someone’s browser when they use it.

Overall, if you get a chance to participate in a library hackathon, go for it, it’s a great way to do a cool project and get more experience working with data!

Register Today for ICPSR’s Summer Program in Quantitative Methods of Social Research

The ICPSR logo.

The Inter-university Consortium for Political and Social Research (ICPSR) is once again offering its summer workshops for researchers! Workshops range from Rational Choice Theories of Politics and Society to Survival Analysis, Event History Modeling, and Duration Analysis. There are so many fantastic choices across the country that we can hardly decide which we’d want to go to the most!

This is what the ICPSR website describes the workshops as:

Since 1963, the Inter-university Consortium for Political and Social Research (ICPSR) has offered the ICPSR Summer Program in Quantitative Methods of Social Research as a complement to its data services. The ICPSR Summer Program provides rigorous, hands-on training in statistical techniques, research methodologies, and data analysis. ICPSR Summer Program curses emphasize the integration of methodological strategies with the theoretical and practical concerns that arise in research on substantive issues. The Summer Program’s broad curriculum is designed to fulfill the needs of researchers throughout their careers. Participants in each year’s Summer Program generally represent about 30 different disciplines from more than 350 colleges, universities, and organizations around the world. Because of the premier quality of instruction and unparalleled opportunities for networking, the ICPSR Summer Program is internationally recognized as the leader for training in research methodologies and technologies used across the social, behavioral, and medical sciences.

Courses are available in 4-week sessions (June 26 – July 21, 2017 and July 24 – August 18, 2017) as well as shorter workshops lasting 3-to-5 days (beginning May 8). More details about the courses can be found here.

Details about registration deadlines, fees, and other important information can be found here.

If you want some help figuring out which workshops are most appropriate for you or just want to chat about the exciting offerings, come on over to the Scholarly Commons, where our social science experts can give you a hand!