Stata vs. R vs. SPSS for Data Analysis

As you do research with larger amounts of data, it becomes necessary to graduate from doing your data analysis in Excel and find a more powerful software. It can seem like a really daunting task, especially if you have never attempted to analyze big data before. There are a number of data analysis software systems out there, but it is not always clear which one will work best for your research. The nature of your research data, your technological expertise, and your own personal preferences are all going to play a role in which software will work best for you. In this post I will explain the pros and cons of Stata, R, and SPSS with regards to quantitative data analysis and provide links to additional resources. Every data analysis software I talk about in this post is available for University of Illinois students, faculty, and staff through the Scholarly Commons computers and you can schedule a consultation with CITL if you have specific questions.

Short video loop of a kid sitting at a computer and putting on sun glasses

Rock your research with the right tools!


STATA

Stata logo. Blue block lettering spelling out Stata.

Among researchers, Stata is often credited as the most user-friendly data analysis software. Stata is popular in the social sciences, particularly economics and political science. It is a complete, integrated statistical software package, meaning it can accomplish pretty much any statistical task you need it to, including visualizations. It has both a point-and-click user interface and a command line function with easy-to-learn command syntax. Furthermore, it has a system for version-control in place, so you can save syntax from certain jobs into a “do-file” to refer to later. Stata is not free to have on your personal computer. Unlike an open-source program, you cannot program your own functions into Stata, so you are limited to the functions it already supports. Finally, its functions are limited to numeric or categorical data, it cannot analyze spatial data and certain other types.

 

Pros

Cons

User friendly and easy to learn An individual license can cost
between $125 and $425 annually
Version control Limited to certain types of data
Many free online resources for learning You cannot program new
functions into Stata

Additional resources:


R logo. Blue capital letter R wrapped with a gray oval.

R and its graphical user interface companion R Studio are incredibly popular software for a number of reasons. The first and probably most important is that it is a free open-source software that is compatible with any operating system. As such, there is a strong and loyal community of users who share their work and advice online. It has the same features as Stata such as a point-and-click user interface, a command line, savable files, and strong data analysis and visualization capabilities. It also has some capabilities Stata does not because users with more technical expertise can program new functions with R to use it for different types of data and projects. The problem a lot of people run into with R is that it is not easy to learn. The programming language it operates on is not intuitive and it is prone to errors. Despite this steep learning curve, there is an abundance of free online resources for learning R.

Pros

Cons

Free open-source software Steep learning curve
Strong online user community Can be slow
Programmable with more functions
for data analysis

Additional Resources:

  • Introduction to R Library Guide: Find valuable overviews and tutorials on this guide published by the University of Illinois Library.
  • Quick-R by DataCamp: This website offers tutorials and examples of syntax for a whole host of data analysis functions in R. Everything from installing the package to advanced data visualizations.
  • Learn R on Code Academy: A free self-paced online class for learning to use R for data science and beyond.
  • Nabble forum: A forum where individuals can ask specific questions about using R and get answers from the user community.

SPSS

SPSS logo. Red background with white block lettering spelling SPSS.

SPSS is an IBM product that is used for quantitative data analysis. It does not have a command line feature but rather has a user interface that is entirely point-and-click and somewhat resembles Microsoft Excel. Although it looks a lot like Excel, it can handle larger data sets faster and with more ease. One of the main complaints about SPSS is that it is prohibitively expensive to use, with individual packages ranging from $1,290 to $8,540 a year. To make up for how expensive it is, it is incredibly easy to learn. As a non-technical person I learned how to use it in under an hour by following an online tutorial from the University of Illinois Library. However, my take on this software is that unless you really need a more powerful tool just stick to Excel. They are too similar to justify seeking out this specialized software.

Pros

Cons

Quick and easy to learn By far the most expensive
Can handle large amounts of data Limited functionality
Great user interface Very similar to Excel

Additional Resources:

Gif of Kermit the frog dancing and flailing his arms with the words "Yay Statistics" in block letters above

Thanks for reading! Let us know in the comments if you have any thoughts or questions about any of these data analysis software programs. We love hearing from our readers!

 

Cool Text Data – Music, Law, and News!

Computational text analysis can be done in virtually any field, from biology to literature. You may use topic modeling to determine which areas are the most heavily researched in your field, or attempt to determine the author of an orphan work. Where can you find text to analyze? So many places! Read on for sources to find unique text content.

Woman with microphone

Genius – the song lyrics database

Genius started as Rap Genius, a site where rap fans could gather to annotate and analyze rap lyrics. It expanded to include other genres in 2014, and now manages a massive database covering Ariana Grande to Fleetwood Mac, and includes both lyrics and fan-submitted annotations. All of this text can be downloaded and analyzed using the Genius API. Using Genius and a text mining method, you could see how themes present in popular music changed over recent years, or understand a particular artist’s creative process.

homepage of case.law, with Ohio highlighted, 147,692 unique cases. 31 reporters. 713,568 pages scanned.

Homepage of case.law

Case.law – the case law database

The Caselaw Access Project (CAP) is a fairly recent project that is still ongoing, and publishes machine-readable text digitized from over 40,000 bound volumes of case law from the Harvard Law School Library. The earliest case is from 1658, with the most recent cases from June 2018. An API and bulk data downloads make it easy to get this text data. What can you do with huge amounts of case law? Well, for starters, you can generate a unique case law limerick:

Wheeler, and Martin McCoy.
Plaintiff moved to Illinois.
A drug represents.
Pretrial events.
Rocky was just the decoy.

Check out the rest of their gallery for more project ideas.

Newspapers and More

There are many places you can get text from digitized newspapers, both recent and historical. Some newspaper are hundreds of years old, so there can be problems with the OCR (Optical Character Recognition) that will make it difficult to get accurate results from your text analysis. Making newspaper text machine readable requires special attention, since they are printed on thin paper and have possibly been stacked up in a dusty closet for 60 years! See OCR considerations here, but the newspaper text described here is already machine-readable and ready for text mining. However, with any text mining project, you must pay close attention to the quality of your text.

The Chronicling America project sponsored by the Library of Congress contains digital copies of newspapers with machine-readable text from all over the United States and its territories, from 1690 to today. Using newspaper text data, you can analyze how topics discussed in newspapers change over time, among other things.

newspapers being printed quickly on a rolling press

Looking for newspapers from a different region? The library has contracts with several vendors to conduct text mining, including Gale and ProQuest. Both provide newspaper text suitable for text mining, from The Daily Mail of London (Gale), to the Chinese Newspapers Collection (ProQuest). The way you access the text data itself will differ between the two vendors, and the library will certainly help you navigate the collections. See the Finding Text Data library guide for more information.

The sources mentioned above are just highlights of our text data collection! The Illinois community has access to a huge amount of text, including newspapers and primary sources, but also research articles and books! Check out the Finding Text Data library guide for a more complete list of sources. And, when you’re ready to start your text mining project, contact the Scholarly Commons (sc@library.illinois.edu), and let us help you get started!

Preparing Your Data for Topic Modeling

In keeping with my series of blog posts on my research project, this post is about how to prepare your data for input into a topic modeling package. I used Twitter data in my project, which is relatively sparse at only 140 characters per tweet, but the principles can be applied to any document or set of documents that you want to analyze.

Topic Models:

Topic models work by identifying and grouping words that co-occur into “topics.” As David Blei writes, Latent Dirichlet allocation (LDA) topic modeling makes two fundamental assumptions: “(1) There are a fixed number of patterns of word use, groups of terms that tend to occur together in documents. Call them topics. (2) Each document in the corpus exhibits the topics to varying degree. For example, suppose two of the topics are politics and film. LDA will represent a book like James E. Combs and Sara T. Combs’ Film Propaganda and American Politics: An Analysis and Filmography as partly about politics and partly about film.”

Topic models do not have any actual semantic knowledge of the words, and so do not “read” the sentence. Instead, topic models use math. The tokens/words that tend to co-occur are statistically likely to be related to one another. However, that also means that the model is susceptible to “noise,” or falsely identifying patterns of cooccurrence if non-important but highly-repeated terms are used. As with most computational methods, “garbage in, garbage out.”

In order to make sure that the topic model is identifying interesting or important patterns instead of noise, I had to accomplish the following pre-processing or “cleaning” steps.

  • First, I removed the punctuation marks, like “,.;:?!”. Without this step, commas started showing up in all of my results. Since they didn’t add to the meaning of the text, they were not necessary to analyze.
  • Second, I removed the stop-words, like “I,” “and,” and “the,” because those words are so common in any English sentence that they tend to be over-represented in the results. Many of my tweets were emotional responses, so many authors wrote in the first person. This tended to skew my results, although you should be careful about what stop words you remove. Simply removing stop-words without checking them first means that you can accidentally filter out important data.
  • Finally, I removed too common words that were uniquely present in my data. For example, many of my tweets were retweets and therefore contained the word “rt.” I also ended up removing mentions to other authors because highly retweeted texts tended to mean that I was getting Twitter user handles as significant words in my results.

Cleaning the Data:

My original data set was 10 Excel files of 10,000 tweets each. In order to clean and standardize all these data points, as well as combining my file into one single document, I used OpenRefine. OpenRefine is a powerful tool, and it makes it easy to work with all your data at once, even if it is a large number of entries. I uploaded all of my datasets, then performed some quick cleaning available under the “Common Transformations” option under the triangle dropdown at the head of each column: I changed everything to lowercase, unescaped HTML characters (to make sure that I didn’t get errors when trying to run it in Python), and removed extra white spaces between words.

OpenRefine also lets you use regular expressions, which is a kind of search tool for finding specific strings of characters inside other text. This allowed me to remove punctuation, hashtags, and author mentions by running a find and replace command.

  • Remove punctuation: grel:value.replace(/(\p{P}(?<!’)(?<!-))/, “”)
    • Any punctuation character is removed.
  • Remove users: grel:value.replace(/(@\S*)/, “”)
    • Any string that begins with an @ is removed. It ends at the space following the word.
  • Remove hashtags: grel:value.replace(/(#\S*)/,””)
    • Any string that begins with a # is removed. It ends at the space following the word.

Regular expressions, commonly abbreviated as “regex,” can take a little getting used to in order to understand how they work. Fortunately, OpenRefine itself has some solid documentation on the subject, and I also found this cheatsheet valuable as I was trying to get it work. If you want to create your own regex search strings, regex101.com has a tool that lets you test your expression before you actually deploy it in OpenRefine.

After downloading the entire data set as a Comma Separated Value (.csv) file, I then used the Natural Language ToolKit (NLTK) for Python to remove stop-words. The code itself can be found here, but I first saved the content of the tweets as a single text file, and then I told NLTK to go over every line of the document and remove words that are in its common stop word dictionary. The output is then saved in another text file, which is ready to be fed into a topic modeling package, such as MALLET.

At the end of all these cleaning steps, my resulting data is essentially composed of unique nouns and verbs, so, for example, @Phoenix_Rises13’s tweet “rt @drlawyercop since sensible, national gun control is a steep climb, how about we just start with orlando? #guncontrolnow” becomes instead “since sensible national gun control steep climb start orlando.” This means that the topic modeling will be more focused on the particular words present in each tweet, rather than commonalities of the English language.

Now my data is cleaned from any additional noise, and it is ready to be input into a topic modeling program.

Interested in working with topic models? There are two Savvy Researcher topic modeling workshops, on December 6 and December 8, that focus on the theory and practice of using topic models to answer questions in the humanities. I hope to see you there!

CITL Workshops and Statistical Consulting Fall 2017

CITL is back at it again with the statistics, survey, and data consulting services! They have a busy fall 2017, with a full schedule of workshops on the way, as well as their daily consulting hours in the Scholarly Commons.

Their workshops are as follows:

  • 9/19: R I: Getting Started with R
  • 10/17: R I: Getting Started with R
  • 9/26: R II: Inferential Statistics
  • 10/24: R II: Inferential Statistics
  • 10/3: SAS I: Getting Started with SAS
  • 10/10: SAS II: Inferential Statistics with SAS
  • 10/4: STATA I: Getting Started with Stata
  • 9/20: SPSS I: Getting Started with SPSS
  • 9/27: SPSS II: Inferential Statistics with SPSS
  • 10/11: ATLAS.ti I: Qualitative Data analysis
  • 10/12: ATLAS.ti II: Data Exploration and Analysis

Workshops are free, but participants must register beforehand. For more information about each workshop, and to register, head to the CITL Workshop Details and Resources page.

And remember that CITL is at the Scholarly Commons Monday – Friday, 10 AM – 4 PM.You can always request a consultation, or walk-in.

Adventures at the Spring 2017 Library Hackathon

This year I participated in an event called HackCulture: A Hackathon for the Humanities, which was organized by the University Library. This interdisciplinary hackathon brought together participants and judges from a variety of fields.

This event is different than your average campus hackathon. For one, it’s about expanding humanities knowledge. In this event, teams of undergraduate and graduate students — typically affiliated with the iSchool in some way — spend a few weeks working on data-driven projects related to humanities research topics. This year, in celebration of the sesquicentennial of the University of Illinois at Urbana-Champaign, we looked at data about a variety of facets of university life provided by the University Archives.

This was a good experience. We got firsthand experience working with data; though my teammates and I struggled with OpenRefine and so we ended up coding data by hand. I now way too much about the majors that are available at UIUC and how many majors have only come into existence in the last thirty years. It is always cool to see how much has changed and how much has stayed the same.

The other big challenge we had was not everyone on the team had experience with design, and trying to convince folks not to fall into certain traps was tricky.

For an idea of how our group functioned, I outlined how we were feeling during the various checkpoints across the process.

Opening:

We had grand plans and great dreams and all kinds of data to work with. How young and naive we were.

Midpoint Check:

Laura was working on the Python script and sent a well-timed email about what was and wasn’t possible to get done in the time we were given. I find public speaking challenging so that was not my favorite workshop. I would say it went alright.

Final:

We prevailed and presented something that worked in public. Laura wrote a great Python script and cleaned up a lot of the data. You can even find it here. One day in the near future it will be in IDEALS as well where you can already check out projects from our fellow humanities hackers.

Key takeaways:

  • Choose your teammates wisely; try to pick a team of folks you’ve worked with in advance. Working with a mix of new and not-so-new people in a short time frame is hard.
  • Talk to your potential client base! This was definitely something we should have done more of.
  • Go to workshops and ask for help. I wish we had asked for more help.
  • Practicing your presentation in advance as well as usability testing is key. Yes, using the actual Usability Lab at Scholarly Commons is ideal but at the very least take time to make sure the instructions for using what you created are accurate. It’s amazing what steps you will leave off when you have used an app more than twice. Similarly make sure that you can run your program and another program at the same time because if you can’t chances are it means you might crash someone’s browser when they use it.

Overall, if you get a chance to participate in a library hackathon, go for it, it’s a great way to do a cool project and get more experience working with data!

Scholarly Smackdown: StoryMap JS vs. Story Maps

In today’s very spatial Scholarly Smackdown post we are covering two popular mapping visualization products, Story Maps and StoryMap JS.Yes they both have “story” and “map” in the name and they both let you create interactive multimedia maps without needing a server. However, they are different products!

StoryMap JS

StoryMap JS, from the Knight Lab at Northwestern, is a simple tool for creating interactive maps and timelines for journalists and historians with limited technical experience.

One  example of a project on StoryMap JS is “Hockey, hip-hop, and other Green Line highlights” by Andy Sturdevant for the Minneapolis Post, which connects the stops of the Green Line train to historical and cultural sites of St. Paul and Minneapolis Minnesota.

StoryMap JS uses Google products and map software from OpenStreetMap.

Using the StoryMap JS editor, you create slides with uploaded or linked media within their template. You then search the map and select a location and the slide will connect with the selected point. You can embed your finished map into your website, but Google-based links can deteriorate over time! So save copies of all your files!

More advanced users will enjoy the Gigapixel mode which allows users to create exhibits around an uploaded image or a historic map.

Story Maps

Story maps is a custom map-based exhibit tool based on ArcGIS online.

My favorite example of a project on Story Maps is The Great New Zealand Road Trip by Andrew Douglas-Clifford, which makes me want to drop everything and go to New Zealand (and learn to drive). But honestly, I can spend all day looking at the different examples in the Story Maps Gallery.

Story Maps offers a greater number of ways to display stories than StoryMap JS, especially in the paid version. The paid version even includes a crowdsourced Story Map where you can incorporate content from respondents, such as their 2016 GIS Day Events map.

With a free non-commercial public ArcGIS Online account you can create a variety of types of maps. Although it does not appear there is to overlay a historical map, there is a comparison tool which could be used to show changes over time. In the free edition of this software you have to use images hosted elsewhere, such as in Google Photos. Story Maps are created through their wizard where you add links to photos/videos, followed by information about these objects, and then search and add the location. It is very easy to use and almost as easy as StoryMap JS. However, since this is a proprietary software there are limits to what you can do with the free account and perhaps worries about pricing and accessing materials at a later date.

Overall, can’t really say there’s a clear winner. If you need to tell a story with a map, both software do a fine job, StoryMap JS is in my totally unscientific opinion slightly easier to use, but we have workshops for Story Maps here at Scholarly Commons!  Either way you will be fine even with limited technical or map making experience.

If you are interested in learning more about data visualization, ArcGIS Story Maps, or geopatial data in general, check out these upcoming workshops here at Scholarly Commons, or contact our GIS expert, James Whitacre!

Love and Big Data

Can big data help you find true love?

It’s Love Your Data Week, but did you know people have been using Big Data for to optimize their ability to find their soul mate with the power of data science! Wired Magazine profiled mathematician and data scientist Chris McKinlay in “How to Hack OkCupid“.There’s even a book spin-off from this! “Optimal Cupid”, which unfortunately is not at any nearby libraries.

But really, we know you’re all wondering, where can I learn the data science techniques needed to find “The One”, especially if I’m not a math genius?

ETHICS NOTE: WE DO NOT ENDORSE OR RECOMMEND TRYING TO CREATE SPYWARE, ESPECIALLY NOT ON COMPUTERS IN THE SPACE. WE ALSO DON’T GUARANTEE USING BIG DATA WILL HELP YOU FIND LOVE.

What did Chris McKinlay do?

Methods used:

  • Automating tasks, such as writing a python script to answer questions on OKCupid
  • Scraping data from dating websites
  • Surveying
  • Statistical analysis
  • Machine learning to figure out how to rank the importance of answers of questions
  • Bots to visit people’s pages
  • Actually talking to people in the real world!

Things we can help you with at Scholarly Commons:

Selected workshops and resources, come by the space to find more!

Whether you reach out to us by email, phone, or in-person our experts are ready to help with all of your questions and helping you make the most of your data! You might not find “The One” with our software tools, but we can definitely help you have a better relationship with your data!

Register for Spring 2017 Workshops at CITL!

Exciting news for anyone interested in learning the basics of statistical and qualitative analysis software! Registration is open for workshops to be held throughout spring semester at the Center for Innovation in Teaching and Learning! There will be workshops on ATLAS.ti, R, SAS, Stata, SPSS, and Questionnaire Design on Tuesdays and Wednesdays in February and March from 5:30-7:30 pm. To learn more details and to register click here to go to the workshops offered by CITL page. And if you need a place to use these statistical and qualitative software packages, such as to practice the skills you gained at the workshops stop by Scholarly Commons, Monday-Friday 9 am- 6 pm! And don’t forget, you can also schedule a consultation with our experts here for specific questions about using statistical and qualitative analysis software for your research!

JMP Pro Pilot Run at the Scholarly Commons Through March 14

Library patrons have the opportunity to use JMP Pro predictive analytics software through March 14 at the Scholarly Commons. JMP Pro is a sophisticated statistical discover tool from SAS designed for advanced data science. Two Scholarly Commons computers are currently equipped with version 12. To learn more about the software, visit the JMP Pro website or the Webstore’s product page.

During this trial period, we are hoping to get a sense of whether this software would provide a useful tool for our patrons. We encourage anyone who is interested in this opportunity to visit the Scholarly Commons to take JMP Pro for a test drive and share your thoughts about the software with us. If you’re an experienced JMP Pro user, we’d also love to hear about your impressions of the package. You can contact us by email, or leave a message in the comments below.

 

 

ICPSR 2014 Summer Program in Quantitative Methods of Social Research

Still making a list of summer plans? As you gear up for summer, keep in mind that the Institute for Social Research at the University of Michigan is offering a wide range of classes on quantitative data-analysis. Whether you are a beginner or you are ready to study more advanced techniques, the program has something unique to offer each individual. Course instruction is centered around interactive, participatory data-analysis within a broader context of substantive social research.

Courses for the summer 2014 program are offered in two four-week sessions, May through August. These sessions include lecture, seminar, and workshop formats with participants from a diverse range of departments, universities, and organizations.

The following are a few examples of courses that will be offered:

Basic Foundation
Introduction to Statistics and Data Analysis
Introduction to Regression
Introduction to Computing

Linear Models and Beyond
Regression Analysis
Hierarchical Linear and Multilevel Models
Categorical Data Analysis

Substantive Topics
Race and Ethnicity
Curating Data & Providing Data Services
Designing, Conducting, and Analyzing Field Experiments

Advanced Techniques
Applied Bayesian Modeling
Advanced Time Series
The R Statistical Computing Environment

Multivariate Techniques
Multivariate Statistical methods
Scaling and Dimensional Analysis
Intro & Advanced Network Analysis

Formal Modeling
Game Theory
Rational Choice
Empirical Modeling for Theory Evaluation

Registration is now open. There are also a few free workshops that will be offered over the summer, but registration for those sessions ends May 15, 2014 and seats are limited!

For a full list of courses, fee and discount information, and to fill out an application visit the website.

Questions?
Call: (734) 763-7400
Email: sumprog@icpsr.umich.edu