Open Source Tools for Social Media Analysis

Photograph of a person holding an iPhone with various social media icons.

This post was guest authored by Kayla Abner.


Interested in social media analytics, but don’t want to shell out the bucks to get started? There are a few open source tools you can use to dabble in this field, and some even integrate data visualization. Recently, we at the Scholarly Commons tested a few of these tools, and as expected, each one has strengths and weaknesses. For our exploration, we exclusively analyzed Twitter data.

NodeXL

NodeXL’s graph for #halloween (2,000 tweets)

tl;dr: Light system footprint and provides some interesting data visualization options. Useful if you don’t have a pre-existing data set, but the one generated here is fairly small.

NodeXL is essentially a complex Excel template (it’s classified as a Microsoft Office customization), which means it doesn’t take up a lot of space on your hard drive. It does have advantages; it’s easy to use, only requiring a simple search to retrieve tweets for you to analyze. However, its capabilities for large-scale analysis are limited; the user is restricted to retrieving the most recent 2,000 tweets. For example, searching Twitter for #halloween imported 2,000 tweets, every single one from the date of this writing. It is worth mentioning that there is a fancy, paid version that will expand your limit to 18,000, the maximum allowed by Twitter’s API, or 7 to 8 days ago, whichever comes first. Even then, you cannot restrict your data retrieval by date. NodeXL is a tool that would mostly be most successful in pulling recent social media data. In addition, if you want to study something besides Twitter, you will have to pay to get any other type of dataset, i.e., Facebook, Youtube, Flickr.

Strengths: Good for a beginner, differentiates between Mentions/Retweets and original Tweets, provides a dataset, some light data visualization tools, offers Help hints on hover

Weaknesses: 2,000 Tweet limit, free version restricted to Twitter Search Network

TAGS

TAGSExplorer’s data graph (2,902 tweets). It must mean something…

tl;dr: Add-on for Google Sheets, giving it a light system footprint as well. Higher restriction for number of tweets. TAGS has the added benefit of automated data retrieval, so you can track trends over time. Data visualization tool in beta, needs more development.

TAGS is another complex spreadsheet template, this time created for use with Google Sheets. TAGS does not have a paid version with more social media options; it can only be used for Twitter analysis. However, it does not have the same tweet retrieval limit as NodeXL. The only limit is 18,000 or seven days ago, which is dictated by Twitter’s Terms of Service, not the creators of this tool. My same search for #halloween with a limit set at 10,000 retrieved 9,902 tweets within the past seven days.

TAGS also offers a data visualization tool, TAGSExplorer, that is promising but still needs work to realize its potential. As it stands now in beta mode, even a dataset of 2,000 records puts so much strain on the program that it cannot keep up with the user. It can be used with smaller datasets, but still needs work. It does offer a few interesting additional analysis parameters that NodeXL lacked, such as ability to see Top Tweeters and Top Hashtags, which works better than the graph.

Image of hashtag searchThese graphs have meaning!

Strengths: More data fields, such as the user’s follower and friend count, location, and language (if available), better advanced search (Boolean capabilities, restrict by date or follower count), automated data retrieval

Weaknesses: data visualization tool needs work

Hydrator

Simple interface for Documenting the Now’s Hydrator

tl;dr: A tool used for “re-hydrating” tweet IDs into full tweets, to comply with Twitter’s Terms of Service. Not used for data analysis; useful for retrieving large datasets. Limited to datasets already available.

Documenting the Now, a group focused on collecting and preserving digital content, created the Hydrator tool to comply with Twitter’s Terms of Service. Download and distribution of full tweets to third parties is not allowed, but distribution of tweet IDs is allowed. The organization manages a Tweet Catalog with files that can be downloaded and run through the Hydrator to view the full Tweet. Researchers are also invited to submit their own dataset of Tweet IDs, but this requires use of other software to download them. This tool does not offer any data visualization, but is useful for studying and sharing large datasets (the file for the 115th US Congress contains 1,430,133 tweets!). Researchers are limited to what has already been collected, but multiple organizations provide publicly downloadable tweet ID datasets, such as Harvard’s Dataverse. Note that the rate of hydration is also limited by Twitter’s API, and the Hydrator tool manages that for you. Some of these datasets contain millions of tweet IDs, and will take days to be transformed into full tweets.

Strengths: Provides full tweets for analysis, straightforward interface

Weaknesses: No data analysis tools

Crimson Hexagon

If you’re looking for more robust analytics tools, Crimson Hexagon is a data analytics platform that specializes in social media. Not limited to Twitter, it can retrieve data from Facebook, Instagram, Youtube, and basically any other online source, like blogs or forums. The company has a partnership with Twitter and pays for greater access to their data, giving the researcher higher download limits and a longer time range than they would receive from either NodeXL or TAGS. One can access tweets starting from Twitter’s inception, but these features cost money! The University of Illinois at Urbana-Champaign is one such entity paying for this platform, so researchers affiliated with our university can request access. One of the Scholarly Commons interns, Matt Pitchford, uses this tool in his research on Twitter response to terrorism.

Whether you’re an experienced text analyst or just want to play around, these open source tools are worth considering for different uses, all without you spending a dime.

If you’d like to know more, researcher Rebekah K. Tromble recently gave a lecture at the Data Scientist Training for Librarians (DST4L) conference regarding how different (paid) platforms influence or bias analyses of social media data. As you start a real project analyzing social media, you’ll want to know how the data you have gathered may be limited to adjust your analysis accordingly.

Spotlight: PastPin

The PastPin logo.

Who? What? Where? When? and Why? While these make up a catchy song from Spy Kids, they’re also questions that can get lost when looking at digital images, especially when metadata is missing. PastPin wants to help answer these questions, by tagging the location and time of vintage images on Flickr Commons, with the hope that one day they will be searchable through the Where? and When? of the images themselves. By doing this, PastPin wants to create new ways to do research using public domain images online.

Created by Geopast — a genealogy service — PastPin uses 6,806,043 images from 115 cultural institutions hosted on Flickr. When a user brings up the PastPin website, they’ll be prompted with images that PastPin believes come from your geographic area. When you click on an image, you can then search a map for its specific location and enter in a date, which is then saved. The image then becomes searchable by PastPin users through the entered information. The hope is that all of these images will be identified, so that all users can search through location or date.

Some images are easier to geolocate and date than others. PastPin pulls in metadata and written descriptions from Flickr, so images that are published by an institution — such as the University Laboratory High School, like several images I encountered — may already have this information readily available, making it easy to type that into the map and save it. Other images become more difficult to locate or date because they lack that information, and take more outside knowledge to suss out. PastPin also lacks adequate guidelines for locations, in particular. As many of the images that come from the University of Illinois are from digitized books, are they looking for the location of where the book was printed? Or of the library it resides in? It’s unclear.

PastPin faces what would seem like a Herculean feat. As I’m writing this, only 1.79% of the nearly seven million images have been located so far, and 2.13% have been dated. Today, there have been 18 updates, including two that I made, so the work moves slowly.

Still, PastPin is an awesome example of the power of crowd-sourced projects, and the potential of new thinking to change the way that we do research. The Internet creates so many new opportunities for kinds of research, and the ability to search through public domain images in new ways is just one of them.

Do you know of other websites that are trying to crowd source data? How about websites that are trying to push research into new directions? Let us know in the comments!

What To Do When OCR Software Doesn’t Seem To Be Working

Optical character recognition can enhance your research!

While optical character recognition (OCR) is a powerful tool, it’s not a perfect one. Inputting a document into an OCR software doesn’t necessarily mean that the software will actually output something useful 100% of the time. Though most documents come out without a hitch, we have a few tips on what to do if your document just isn’t coming out.

Scanning Issues

The problem may be less with your program and more with your initial scan. Low-quality scans are less likely to be read by OCR software. Here are a few considerations to keep in mind when scanning a document you will be using OCR on:

  • Make sure your document is scanned at 300 DPI
  • Keep your brightness level at 50%
  • Try to keep your scan as straight as possible

If you’re working with a document that you cannot create another scan for, there’s still hope! OCR engines with a GUI tend to have photo editing tools in them. If your OCR software doesn’t have those tools, or if their provided tools aren’t cutting it, try using a photo manipulation tool such as Photoshop or GIMP to edit your document. Also, remember OCR software tends to be less effective when used on photographs than on scans.

Textual Issues

The issues you’re having may not stem from the scanning, but from the text itself. These issues can be more difficult to solve, because you cannot change the content of the original document, but they’re still good tips to know, especially when diagnosing issues with OCR.

  • Make sure that your document is in a language, and from a period that your OCR software recognizes; not all engines are trained to recognize all languages
  • Low contrast in documents can reduce OCR accuracy; contrast can be adjusted in a photo manipulation tool
  • Text created prior to 1850 or with a typewriter can be more difficult for OCR software to read
  • OCR software cannot read handwriting; while we’d all like to digitize our handwritten notes, OCR software just isn’t there yet

Working with Digital Files

Digital files can, in many ways, be more complicated to use OCR software on, just because someone else may have made the file. This means that a file is lower-quality to begin with, or that whoever scanned the file may have made errors. Most likely, you will run into scenarios that are easy fixes using photo manipulation tools. But there will be times that the images you come across just won’t work. It’s frustrating, but you’re not alone. Check out your options!

Always Remember that OCR is Imperfect

Even with perfect documents that you think will yield perfect results, there will be a certain percentage of mistakes. Most OCR software packages have an error rate between 97-99% per character. While this may seem like it’s not many errors, in a page with 1,800 characters, there will be between 18 and 54 errors. In a 300 page book with 1,800 characters per page, that’s between 5,400 and 16,200. So always be diligent and clean up your OCR!

The Scholarly Commons

Here at the Scholarly Commons, we have Adobe Acrobat Pro installed on every computer, and ABBYY FineReader installed on several. We can also help you set up Tesseract on your own computer. If you would like to learn more about OCR, check out our LibGuide and keep your eye open for our next Making Scanned Text Machine Readable through Optical Character Recognition Savvy Researcher workshop!

The Library’s Eighth Data Purchase Program Round is Accepting Applications!

We’re starting a bit earlier than in past years to help researchers acquire data they need for their research! Through the Library’s Data Purchase Program, the University Library accepts applications from campus researchers to purchase data. All applications will be reviewed by the Library’s Data Discovery and Support committee, which looks for requests which meet the following minimum criteria, in addition to others listed in the full program announcement:

  • The dataset must cost less than $5,000;
  • The dataset must be used for research; and
  • The Library must be able to make the data available for use by everyone at UIUC.

For some examples of past data requests supported by the Data Purchase Program, please explore this list.

The deadline for first consideration is May 26, 2017, but the Committee will consider applications that come in later based on availability of funds and whether the purchase can be completed by June 30, 2018.

If you have questions about the program or need help identifying data for your research, please contact the Scholarly Commons at sc@library.illinois.edu. We look forward to connecting you with the data you need!

Register Today for ICPSR’s Summer Program in Quantitative Methods of Social Research

The ICPSR logo.

The Inter-university Consortium for Political and Social Research (ICPSR) is once again offering its summer workshops for researchers! Workshops range from Rational Choice Theories of Politics and Society to Survival Analysis, Event History Modeling, and Duration Analysis. There are so many fantastic choices across the country that we can hardly decide which we’d want to go to the most!

This is what the ICPSR website describes the workshops as:

Since 1963, the Inter-university Consortium for Political and Social Research (ICPSR) has offered the ICPSR Summer Program in Quantitative Methods of Social Research as a complement to its data services. The ICPSR Summer Program provides rigorous, hands-on training in statistical techniques, research methodologies, and data analysis. ICPSR Summer Program curses emphasize the integration of methodological strategies with the theoretical and practical concerns that arise in research on substantive issues. The Summer Program’s broad curriculum is designed to fulfill the needs of researchers throughout their careers. Participants in each year’s Summer Program generally represent about 30 different disciplines from more than 350 colleges, universities, and organizations around the world. Because of the premier quality of instruction and unparalleled opportunities for networking, the ICPSR Summer Program is internationally recognized as the leader for training in research methodologies and technologies used across the social, behavioral, and medical sciences.

Courses are available in 4-week sessions (June 26 – July 21, 2017 and July 24 – August 18, 2017) as well as shorter workshops lasting 3-to-5 days (beginning May 8). More details about the courses can be found here.

Details about registration deadlines, fees, and other important information can be found here.

If you want some help figuring out which workshops are most appropriate for you or just want to chat about the exciting offerings, come on over to the Scholarly Commons, where our social science experts can give you a hand!

Scholarly Smackdown: StoryMap JS vs. Story Maps

In today’s very spatial Scholarly Smackdown post we are covering two popular mapping visualization products, Story Maps and StoryMap JS.Yes they both have “story” and “map” in the name and they both let you create interactive multimedia maps without needing a server. However, they are different products!

StoryMap JS

StoryMap JS, from the Knight Lab at Northwestern, is a simple tool for creating interactive maps and timelines for journalists and historians with limited technical experience.

One  example of a project on StoryMap JS is “Hockey, hip-hop, and other Green Line highlights” by Andy Sturdevant for the Minneapolis Post, which connects the stops of the Green Line train to historical and cultural sites of St. Paul and Minneapolis Minnesota.

StoryMap JS uses Google products and map software from OpenStreetMap.

Using the StoryMap JS editor, you create slides with uploaded or linked media within their template. You then search the map and select a location and the slide will connect with the selected point. You can embed your finished map into your website, but Google-based links can deteriorate over time! So save copies of all your files!

More advanced users will enjoy the Gigapixel mode which allows users to create exhibits around an uploaded image or a historic map.

Story Maps

Story maps is a custom map-based exhibit tool based on ArcGIS online.

My favorite example of a project on Story Maps is The Great New Zealand Road Trip by Andrew Douglas-Clifford, which makes me want to drop everything and go to New Zealand (and learn to drive). But honestly, I can spend all day looking at the different examples in the Story Maps Gallery.

Story Maps offers a greater number of ways to display stories than StoryMap JS, especially in the paid version. The paid version even includes a crowdsourced Story Map where you can incorporate content from respondents, such as their 2016 GIS Day Events map.

With a free non-commercial public ArcGIS Online account you can create a variety of types of maps. Although it does not appear there is to overlay a historical map, there is a comparison tool which could be used to show changes over time. In the free edition of this software you have to use images hosted elsewhere, such as in Google Photos. Story Maps are created through their wizard where you add links to photos/videos, followed by information about these objects, and then search and add the location. It is very easy to use and almost as easy as StoryMap JS. However, since this is a proprietary software there are limits to what you can do with the free account and perhaps worries about pricing and accessing materials at a later date.

Overall, can’t really say there’s a clear winner. If you need to tell a story with a map, both software do a fine job, StoryMap JS is in my totally unscientific opinion slightly easier to use, but we have workshops for Story Maps here at Scholarly Commons!  Either way you will be fine even with limited technical or map making experience.

If you are interested in learning more about data visualization, ArcGIS Story Maps, or geopatial data in general, check out these upcoming workshops here at Scholarly Commons, or contact our GIS expert, James Whitacre!

Finding the Right Data at the Scholarly Commons

As you probably know, February 13-17th is Love Your Data Week, an annual event that aims to help researchers take better care of their data. The theme for today — Thursday, February 16th — is finding the right data, a problem that almost all researchers will run into while doing their work at some point or another. And the Scholarly Commons is here to help you out! Here are a few ways that you can “find the right data” through the services we provide here at the Scholarly Commons.

Online resources

The University of Illinois subscribes to an almost countless number of online resources that you can find datasets and data files on. While it can be hard to figure out where to start, oftentimes, there will be a LibGuide that can help point you towards a few sources that you will find helpful. The Finding Numeric Data LibGuide specializes in data for the world, United States, and Illinois, and can generally be used for projects in the social sciences. If you’re looking for GIS data, you can head to the Geographic Information Systems (GIS) LibGuide. We even have an area where you can browse all of the Library’s LibGuides and see which guide will be of the most use to you.

Purchasing data

If you’ve found a dataset that you truly need, but cannot get it through one of the services UIUC subscribes to, you may be eligible for the 2017 Data Purchase Program. Researchers can submit an application which outlines their data needs, and the University Library may choose to purchase the data, and make it available for general use by the campus community. For more information, see the Data Purchase Program website, linked above.

Attending a Savvy Researcher workshop

Throughout the semester, the Scholarly Commons and other Library departments run Savvy Researcher workshops, which teach attendees various skills that will help them be better researchers. While many deal with finding or organizing data, here is a sampling of a few upcoming workshops that will deal directly with finding data: Finding and Organizing Primary Source Materials in DPLA, Advanced Text Mining Techniques with Python and HathiTrust Data, and GIS for Research II: GIS Research, Data Management, and Visualization. For the full schedule of Savvy Researcher workshops, head to the Savvy Researcher calendar. You can also get an idea of what’s going on with the Savvy Researcher workshops by looking at the #savvyresearcher on Twitter!

Making an appointment with an expert

A central part of the Scholarly Commons’ mission is to connect you to the people you need to get the help you need. If you’re looking for data help, take a gander at our Scholarly Commons Experts page and see if there is someone on staff who can help you find what you need. If you’re still not sure, don’t worry! You can always fill out a consultation request form, or email us, and we’ll help you get in touch with someone who can guide you.

Love and Big Data

Can big data help you find true love?

It’s Love Your Data Week, but did you know people have been using Big Data for to optimize their ability to find their soul mate with the power of data science! Wired Magazine profiled mathematician and data scientist Chris McKinlay in “How to Hack OkCupid“.There’s even a book spin-off from this! “Optimal Cupid”, which unfortunately is not at any nearby libraries.

But really, we know you’re all wondering, where can I learn the data science techniques needed to find “The One”, especially if I’m not a math genius?

ETHICS NOTE: WE DO NOT ENDORSE OR RECOMMEND TRYING TO CREATE SPYWARE, ESPECIALLY NOT ON COMPUTERS IN THE SPACE. WE ALSO DON’T GUARANTEE USING BIG DATA WILL HELP YOU FIND LOVE.

What did Chris McKinlay do?

Methods used:

  • Automating tasks, such as writing a python script to answer questions on OKCupid
  • Scraping data from dating websites
  • Surveying
  • Statistical analysis
  • Machine learning to figure out how to rank the importance of answers of questions
  • Bots to visit people’s pages
  • Actually talking to people in the real world!

Things we can help you with at Scholarly Commons:

Selected workshops and resources, come by the space to find more!

Whether you reach out to us by email, phone, or in-person our experts are ready to help with all of your questions and helping you make the most of your data! You might not find “The One” with our software tools, but we can definitely help you have a better relationship with your data!

Love Your Data Week 2017

The Scholarly Commons is excited to announce our participation in Love Your Data Week 2017. Taking place from February 13-17th, Love Your Data is an annual event that aims to “build a community to engage on topics related to research data management, sharing, preservation, reuse, and library-based research data services.” The 2017 theme is data quality.

Love Your Data Week takes place online, and you’ll find us posting content both on this blog (look out for our post on February 16th) and at our Twitter, @ScholCommons. We’ll be posting new content for each day of Love Your Data Week, so stay tuned! You can follow the wider conversation by looking at the hashtags #LYD17 and #loveyourdata on Twitter and elsewhere. You can also check out the University of Illinois Research Data Service’s Twitter @ILresearchdata for their Love Your Data Week content!

Each day of Love Your Data Week has a different theme. This year the themes are as follows:

  • Monday: Defining Data Quality
  • Tuesday: Documenting, Describing, Defining
  • Wednesday: Good Data Examples
  • Thursday: Finding the Right Data
  • Friday: Rescuing Unloved Data

Got something to say about data? Or just want to be a part of the action? Tweet @scholcommons or comment on this article!

Finding Data on Champaign County

The Champaign County Courthouse, taken by Beyond My Ken and hosted on Wikipedia Commons.

Many scholars at the University of Illinois at Urbana-Champaign keep their research local. But sometimes, finding data for a specific locale can be difficult. These suggestions are just a start when it comes to the resources that University of Illinois students, faulty, and staff have at their disposal when it comes to finding local data, but it’s a good place to start.

American FactFinder

American FactFinder is a free-to-use service provided by the United States Census Bureau. It contains basic facts in its Community Facts section, but allows for more detailed research through its Advanced Search option. We suggest that researchers use the Advanced Search for more in-depth questions. It contains census data for Champaign County from 2000 through 2015, at the time of writing this post.

Social Explorer

Social Explorer uses census data to create map visualizations. It is important that you access Social Explorer through the University of Illinois library, and not a Google search, as the latter will give you limited functionality in the site. Social Explorer offers information dating back to 1790, as well as a good deal of customization. Maps that you create with Social Explorer can be downloaded and used as a visual.

SimplyMap

SimplyMap uses a mix of both census and market research data to create map visualizations. A little clunkier than Social Explorer, it allows you to compare and contrast different variables with census and market research data, giving you powerful visualizations. Though you cannot download the visualizations themselves, you can download the data sets and tabular reports SimplyMap creates for you. Similarly to Social Explorer, you should enter SimplyMap through the Library, and create an account using your U of I email address.

These are just three of many data sources for Champaign County. Do these fill your needs? Do you have a favorite data source, either listed here or not listed? Let us know in the comments!