Preparing Your Data for Topic Modeling

In keeping with my series of blog posts on my research project, this post is about how to prepare your data for input into a topic modeling package. I used Twitter data in my project, which is relatively sparse at only 140 characters per tweet, but the principles can be applied to any document or set of documents that you want to analyze.

Topic Models:

Topic models work by identifying and grouping words that co-occur into “topics.” As David Blei writes, Latent Dirichlet allocation (LDA) topic modeling makes two fundamental assumptions: “(1) There are a fixed number of patterns of word use, groups of terms that tend to occur together in documents. Call them topics. (2) Each document in the corpus exhibits the topics to varying degree. For example, suppose two of the topics are politics and film. LDA will represent a book like James E. Combs and Sara T. Combs’ Film Propaganda and American Politics: An Analysis and Filmography as partly about politics and partly about film.”

Topic models do not have any actual semantic knowledge of the words, and so do not “read” the sentence. Instead, topic models use math. The tokens/words that tend to co-occur are statistically likely to be related to one another. However, that also means that the model is susceptible to “noise,” or falsely identifying patterns of cooccurrence if non-important but highly-repeated terms are used. As with most computational methods, “garbage in, garbage out.”

In order to make sure that the topic model is identifying interesting or important patterns instead of noise, I had to accomplish the following pre-processing or “cleaning” steps.

  • First, I removed the punctuation marks, like “,.;:?!”. Without this step, commas started showing up in all of my results. Since they didn’t add to the meaning of the text, they were not necessary to analyze.
  • Second, I removed the stop-words, like “I,” “and,” and “the,” because those words are so common in any English sentence that they tend to be over-represented in the results. Many of my tweets were emotional responses, so many authors wrote in the first person. This tended to skew my results, although you should be careful about what stop words you remove. Simply removing stop-words without checking them first means that you can accidentally filter out important data.
  • Finally, I removed too common words that were uniquely present in my data. For example, many of my tweets were retweets and therefore contained the word “rt.” I also ended up removing mentions to other authors because highly retweeted texts tended to mean that I was getting Twitter user handles as significant words in my results.

Cleaning the Data:

My original data set was 10 Excel files of 10,000 tweets each. In order to clean and standardize all these data points, as well as combining my file into one single document, I used OpenRefine. OpenRefine is a powerful tool, and it makes it easy to work with all your data at once, even if it is a large number of entries. I uploaded all of my datasets, then performed some quick cleaning available under the “Common Transformations” option under the triangle dropdown at the head of each column: I changed everything to lowercase, unescaped HTML characters (to make sure that I didn’t get errors when trying to run it in Python), and removed extra white spaces between words.

OpenRefine also lets you use regular expressions, which is a kind of search tool for finding specific strings of characters inside other text. This allowed me to remove punctuation, hashtags, and author mentions by running a find and replace command.

  • Remove punctuation: grel:value.replace(/(\p{P}(?<!’)(?<!-))/, “”)
    • Any punctuation character is removed.
  • Remove users: grel:value.replace(/(@\S*)/, “”)
    • Any string that begins with an @ is removed. It ends at the space following the word.
  • Remove hashtags: grel:value.replace(/(#\S*)/,””)
    • Any string that begins with a # is removed. It ends at the space following the word.

Regular expressions, commonly abbreviated as “regex,” can take a little getting used to in order to understand how they work. Fortunately, OpenRefine itself has some solid documentation on the subject, and I also found this cheatsheet valuable as I was trying to get it work. If you want to create your own regex search strings, regex101.com has a tool that lets you test your expression before you actually deploy it in OpenRefine.

After downloading the entire data set as a Comma Separated Value (.csv) file, I then used the Natural Language ToolKit (NLTK) for Python to remove stop-words. The code itself can be found here, but I first saved the content of the tweets as a single text file, and then I told NLTK to go over every line of the document and remove words that are in its common stop word dictionary. The output is then saved in another text file, which is ready to be fed into a topic modeling package, such as MALLET.

At the end of all these cleaning steps, my resulting data is essentially composed of unique nouns and verbs, so, for example, @Phoenix_Rises13’s tweet “rt @drlawyercop since sensible, national gun control is a steep climb, how about we just start with orlando? #guncontrolnow” becomes instead “since sensible national gun control steep climb start orlando.” This means that the topic modeling will be more focused on the particular words present in each tweet, rather than commonalities of the English language.

Now my data is cleaned from any additional noise, and it is ready to be input into a topic modeling program.

Interested in working with topic models? There are two Savvy Researcher topic modeling workshops, on December 6 and December 8, that focus on the theory and practice of using topic models to answer questions in the humanities. I hope to see you there!

DIY Data Science

Data science is a special blend of statistics and programming with a focus on making complex statistical analyses more understandable and usable to users, typically through visualization. In 2012, the Harvard Business Review published the article, “Data Scientist: The Sexiest Job of the 21st Century” (Davenport, 2012), showing society’s perception of data science. While some of the excitement of 2012 has died down, data science continues on, with data scientists earning a median base salary over $100,000 (Noyes, 2016).

Here at the Scholarly Commons, we believe that having a better understanding of statistics means you are less likely to get fooled when they are deployed improperly, and will help you have a better understanding of the inner workings of data visualization and digital humanities software applications and techniques. We might not be able to make you a data scientist (though certainly please let us know if inspired by this post and you enroll in formal coursework) but we can share some resources to let you try before you buy and incorporate methods from this growing field in your own research.

As we have discussed again and again on this blog, whether you want to improve your coding, statistics, or data visualization skills, our collection has some great reads to get you started.

In particular, take a look at:

The Human Face of Big Data created by Rick Smolan and Jennifer Erwitt

  • This is a great coffee table book of data visualizations and a great flip through if you are here in the space. You will learn a little bit more about the world around you and will be inspired with creative ways to communicate your ideas in your next project.

Data Points: Visualization That Means Something by Nathan Yau

  • Nathan Yau is best known for being the man behind Flowing Data, an extensive blog of data visualizations that also offers tutorials on how to create visualizations. In this book he explains the basics of statistics and visualization.

Storytelling with Data by Cole Nussbaumer Knaflic

LibGuides to Get You Started:

And more!

There are also a lot of resources on the web to help you:

The Open Source Data Science Masters

  • This is not an accredited masters program but rather a curated collection of suggested free and low-cost print and online resources for learning the various skills needed to become a data scientist. This list was created and is maintained by Clare Corthell of Luminant Data Science Consulting
  • This list does suggest many MOOCS from universities across the country, some even available for free

Dataquest

  • This is a project-based data science course created by Vik Paruchuri, a former Foreign Service Officer turned data scientist
  • It mostly consists of a beginner Python tutorial, though it is only one of many that are out there
  • Twenty-two quests and portfolio projects are available for free, though the two premium versions offer unlimited quests, more feedback, a Slack community, and opportunities for one-on-one tutoring

David Venturi’s Data Science Masters

  • A DIY data science course, which includes a resource list, and, perhaps most importantly, includes links to reviews of data science online courses with up to date information. If you are interested in taking an online course or participating in a MOOC this is a great place to get started

Mitch Crowe Learn Data Science the Hard Way

  • Another curated list of data science learning resources, this time based on Zed Shaw’s Learn Code the Hard Way series. This list comes from Mitch Crowe, a Canadian data science

So, is data science still sexy? Let us know what you think and what resources you have used to learn data science skills in the comments!

Works Cited:

Davenport, T. H., & Patil, D. J. (2012, October 1). Data Scientist: The Sexiest Job of the 21st Century. Retrieved June 1, 2017, from https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Noyes, K. (2016, January 21). Why “data scientist” is this year’s hottest job. Retrieved June 1, 2017, from http://www.pcworld.com/article/3025502/why-data-scientist-is-this-years-hottest-job.html

Adventures at the Spring 2017 Library Hackathon

This year I participated in an event called HackCulture: A Hackathon for the Humanities, which was organized by the University Library. This interdisciplinary hackathon brought together participants and judges from a variety of fields.

This event is different than your average campus hackathon. For one, it’s about expanding humanities knowledge. In this event, teams of undergraduate and graduate students — typically affiliated with the iSchool in some way — spend a few weeks working on data-driven projects related to humanities research topics. This year, in celebration of the sesquicentennial of the University of Illinois at Urbana-Champaign, we looked at data about a variety of facets of university life provided by the University Archives.

This was a good experience. We got firsthand experience working with data; though my teammates and I struggled with OpenRefine and so we ended up coding data by hand. I now way too much about the majors that are available at UIUC and how many majors have only come into existence in the last thirty years. It is always cool to see how much has changed and how much has stayed the same.

The other big challenge we had was not everyone on the team had experience with design, and trying to convince folks not to fall into certain traps was tricky.

For an idea of how our group functioned, I outlined how we were feeling during the various checkpoints across the process.

Opening:

We had grand plans and great dreams and all kinds of data to work with. How young and naive we were.

Midpoint Check:

Laura was working on the Python script and sent a well-timed email about what was and wasn’t possible to get done in the time we were given. I find public speaking challenging so that was not my favorite workshop. I would say it went alright.

Final:

We prevailed and presented something that worked in public. Laura wrote a great Python script and cleaned up a lot of the data. You can even find it here. One day in the near future it will be in IDEALS as well where you can already check out projects from our fellow humanities hackers.

Key takeaways:

  • Choose your teammates wisely; try to pick a team of folks you’ve worked with in advance. Working with a mix of new and not-so-new people in a short time frame is hard.
  • Talk to your potential client base! This was definitely something we should have done more of.
  • Go to workshops and ask for help. I wish we had asked for more help.
  • Practicing your presentation in advance as well as usability testing is key. Yes, using the actual Usability Lab at Scholarly Commons is ideal but at the very least take time to make sure the instructions for using what you created are accurate. It’s amazing what steps you will leave off when you have used an app more than twice. Similarly make sure that you can run your program and another program at the same time because if you can’t chances are it means you might crash someone’s browser when they use it.

Overall, if you get a chance to participate in a library hackathon, go for it, it’s a great way to do a cool project and get more experience working with data!

Learn Python Summer 2017

Are you sitting around thinking to yourself, golly, the bloggers at Commons Knowledge have not tried to convince me to learn Python in a few weeks, what’s going on over there? Well, no worries! We’re back with another post going over the reasons why you should learn Python. And to answer your next question no, the constant Python promotion isn’t us taking orders from some sinister serpentine society. We just really like playing with Python and coding here at the Scholarly Commons.

Why should I learn Python?

Python is a coding language with many applications for data science, bioinformatics, digital humanities, GIS, and even video games! Python is a great way to get started with coding and beef up your resume. It’s also considered one of the easier coding languages to learn and whether or not you are a student in LIS 452, we have resources here for you! And if you need help you can always email the Scholarly Commons with questions!

Where can I get started at Scholarly Commons?

We have a small section of great books aimed at new coders and those working on specific projects here in the space and online through the library catalog. Along with the classic Think Python book, some highlights include:

Python Crash Course: A Hands on Project-Based Introduction to Programming

Python Crash Course is an introductory textbook for Python, which goes over programming concepts and is full of examples and practice exercises. One unique feature of this book is that it also includes three multi-step longer projects: a game, a data visualization, and a web app, which you can follow for further practice. One nice thing is that with these instructions available you have something to base your own long term Python projects on, whether for your research or a course. Don’t forget to check out the updates to the book at at their website.

Automate Boring Stuff with Python: Practical Programming for Total Beginners

Automate Boring Stuff with Python is a solid introduction to Python with lots of examples. The target audience is non-programmers who plan to stay non-programmers; the author aims to provide the minimum amount of information necessary so that users can ultimately use Python for useful tasks, such as batch organizing files. It is still a lot of information and I feel some of the visual metaphors are more confusing than helpful. Of course, having a programming background helps, despite the premise of the book.

This book can also be found online for free on this website.

Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code

Although focused on Python 2, this is a book about teaching programming skills to newbie coders. Although the author does not specifically use this term this book is based on what is known in psychology as deliberate practice or “the hard way,” which is described in Cal Newport’s blog post “The Grandmaster in the Corner Office” (Newport, 2010).  And Learn Python the Hard Way certainly lives up to the title. Even the basic command line instructions prove difficult. But based on my own learning experiences with deliberate practice, if you follow the instructions I imagine you will have a solid understanding of Python, programming, and from what I’ve read in the book definitely some of your more techie friends’ programming jokes.

Online Resources

If the command line makes you scared or if you want to get started right away, definitely check out PythonAnywhere, which offers a basic plan that allows users to create and run Python programs in their browser. If PythonAnywhere isn’t your speed, check out this article, which lists the 45 best places to learn to code online.

Interested in joining an online Python learning group this summer?

Definitely check out, Advent of Python, an online Python co-learning group through The Digital Humanities Slack. It started Tuesday May 30 with introductions, and every week  there will be Python puzzles for you to help you develop your skills. IT IS NOT TOO LATE TO JOIN! The first check-in and puzzle solutions will be June 6. The solutions and check-ins are going to be every Tuesday, except the Fourth of July — that meeting will be on Wednesday, July 5.  There is a Slack, a Google Doc, and subreddits.

Living in Champaign-Urbana?

Be sure to check out Py-CU a Maker/Hacker group in Urbana welcome to coders with all levels of experience with the next meeting on June 3rd. And obligatory heads up, the Urbana Makerspace is pretty much located in Narnia.

Question for the comments, how did you learn to code? What websites, books and resources do you recommend for the newbie coder? 

Works Cited:

Newport, C. (2010, January 6). The Grandmaster in the Corner Office: What the Study of Chess Experts Teaches Us about Building a Remarkable Life. Retrieved May 30, 2017, from http://calnewport.com/blog/2010/01/06/the-grandmaster-in-the-corner-office-what-the-study-of-chess-experts-teaches-us-about-building-a-remarkable-life/

Love and Big Data

Can big data help you find true love?

It’s Love Your Data Week, but did you know people have been using Big Data for to optimize their ability to find their soul mate with the power of data science! Wired Magazine profiled mathematician and data scientist Chris McKinlay in “How to Hack OkCupid“.There’s even a book spin-off from this! “Optimal Cupid”, which unfortunately is not at any nearby libraries.

But really, we know you’re all wondering, where can I learn the data science techniques needed to find “The One”, especially if I’m not a math genius?

ETHICS NOTE: WE DO NOT ENDORSE OR RECOMMEND TRYING TO CREATE SPYWARE, ESPECIALLY NOT ON COMPUTERS IN THE SPACE. WE ALSO DON’T GUARANTEE USING BIG DATA WILL HELP YOU FIND LOVE.

What did Chris McKinlay do?

Methods used:

  • Automating tasks, such as writing a python script to answer questions on OKCupid
  • Scraping data from dating websites
  • Surveying
  • Statistical analysis
  • Machine learning to figure out how to rank the importance of answers of questions
  • Bots to visit people’s pages
  • Actually talking to people in the real world!

Things we can help you with at Scholarly Commons:

Selected workshops and resources, come by the space to find more!

Whether you reach out to us by email, phone, or in-person our experts are ready to help with all of your questions and helping you make the most of your data! You might not find “The One” with our software tools, but we can definitely help you have a better relationship with your data!

Playing With Python: Learning Coding Through Online Games

GALAGA” by Kevin Simpson, hosted on flickr.com.

For those of us who want to learn to code, but don’t necessarily have the time or patience to sit down with a thick book on Python, there are a few ways out there to trick yourself into learning some coding language while still having a fun time. Online games that help players learn coding through game play have been popping up on the Internet and can be helpful tools for those who want to start coding, but aren’t sure where. This article is a overview of a few fun games that you can play to increase your coding skills.

Empire of Code

Made for beginners, Empire of Code allows the player to use either Python or JavaScript to build, protect, and rule a space kingdom. Game play is largely focused on timing, where certain elements are created more quickly and efficiently when certain algorithms are used. The game is still in beta, and can run slowly at times. Further, coding isn’t necessary to game play, and there are lulls where you can’t do much. But of the games on the list, it’s the most aesthetically pleasing choice, and has a lot in common with popular apps.

Screeps

Screeps is a way for beginning JavaScript learners to flex their muscles. This “MMO sandbox strategy game” lets players control “units” in real time by writing JavaScript. Unlike some of these other games, Screeps is for people with at least a basic working knowledge of JavaScript. In the game, you create your room, gather resources, as well as interact with other players. As you go on, the scope of things you can do within the Screeps universe expands, as well as your knowledge of JavaScript.

Code Combat

Code Combat is a game created for younger students to learn computer science, so it may come off as a little cheesy to older players. But if you’re looking for a painless way to learn some code, Code Combat may be for you! Each level is a different adventure, where you can choose a character and coding language to solve various puzzles and mazes. It’s a well-designed game in or outside of the classroom, and a helpful tool for true beginners.

CodeinGame

Code in Game is a little less user-friendly than some of the other games on this list, but is also the most versatile. There are a number of mini-games that you can play which range in difficulty. You can also play any of these games in twenty-five different coding languages, making the game as a whole incredibly useful to someone who wants to learn a language that’s a little more niche, or who wants a wide range of coding options. However, the game does throw you in very quickly, and with very little instruction. At least a little prior knowledge of coding will be helpful if you want to tackle CodeinGame.

Do you have a particular coding game that you play? Let us know in the comments!

Explore coding and other technical skills with free online resources

Computer programming and other technical skills are increasingly in demand, both in academia and the private sector. Fortunately, as these skills have become more central to all sectors and industries, a wide variety of resources for learning these skills have emerged. In this post, we’d like to highlight just a few resources for getting started with programming

Codecademy is one of the better known online resources for learning programming languages and other technical skills. When you explore the site, you’ll see that it has courses divided into different categories, including web developer skills, languages, and simple projects (for instance, how to create an animation of your name). Each course is divided into several units, which are further divided into lessons that are built in a step-by-step manner. Lessons often begin by introducing the basics of a concept, and then having you apply the concept by walking through a simple procedure. For instance, the JavaScript course introduces functions, and then has you create a simple “rock, paper, scissors” game that is built out of functions.

One downside to Codeacademy is that, due to its step by step design, you may feel that you aren’t acquiring an understanding of the relevant concepts at the level you desire. So depending on your learning style, you might want to consider supplementing Codecademy with other resources.

One option would Lynda.com. Lynda offers video tutorials on a wide variety of topics and skills, with a focus on software and technical skills. For many topics, beginner level tutorials are offered. These provide a general overview of the subject matter, along with accessible explanations of key concepts. Lessons of this sort may serve as a nice complement to a more hands on, step by step, setup, such as that offered in Codecademy. University of Illinois students, faculty, and staff have free access to Lynda’s resources. To log in with your Illinois credentials, visit go.illinois.edu/lynda.

If you’re simply interested in familiarizing yourself with common programming terms and concepts, check out MIT’s Scratch. Scratch is a programming language and online community where you can create your own interactive stories, games, and animations. Scratch is, admittedly, designed with children in mind (in fact, it’s a project of the MIT Media Lab’s Lifelong Kindergarten group). Nevertheless, it can serve as a wonderful resource, especially for those completely new to coding (and I can report from first-hand experience that it is used in at least one class at the iSchool at Illinois). It can also be a lot of fun!

An image of a simple (and very clunky!) maze game I made for one of my classes using Scratch.

A simple (and very clunky!) game I made for one of my classes using Scratch. Scratch is developed by the Lifelong Kindergarten Group at the MIT Media Lab. See http://scratch.mit.edu.

If you would like some one-on-one assistance with programming projects, you can drop by the Scholarly Commons for Data Help Open Hours, a joint service of the Scholarly Commons and the Research Data Service. In addition to getting help with Python coding, you can get help with R, SQL, and XML. If you’re ready to go more in-depth, check out our reference collection which contains books on Python, Java, R, and many more topics.

Do you know of any other good resources for learning to program? Let us know in the comments below!