What Storify Shutting Down Means to Us

The Storify logo.

You may have heard that popular social media story platform Storify will be shutting down on May 16, 2018. Open to the public since 2011, it has hosted everything from academic conference tweet round-ups to “Dear David”, the ongoing saga of Buzzfeed writer Adam Ellis and the ghost that haunts his apartment. So it shocked long-time users in December when Storify suddenly announced that it would be shutting down in just a few months.

Already, Storify is no longer allowing new accounts to be created, and by May 1st, users won’t be able to create new stories. On May 16th, everything disappears. Storify will continue on with Storify 2, a feature of Livefyre, but will require you to purchase a Livefyre license for access. But the fact is that many users cannot or will not pay for Livefyre. Essentially, Storify will cease to exist on May 16th to most people.

So… what does this mean?

Of course, it means that you need to export anything that you have stored on Storify and want to save. (They provide instructions for exporting content on their shutting down FAQ.) More than that, however, we need to talk about how we are relying on services to archive our materials online and how that is a dangerous long-term preservation strategy.

The fact is, free Internet services can change in an instant, and without consulting their user base. As we have seen with Storify — as well as other services like Google Reader — what seems permanent can disappear quickly. When it comes to long-term digital preservation, we cannot solely depend on them as our only means of preservation.

That is not to say that we cannot use free digital tools like Storify. Storify was a great way to collect Tweets, present stories, and get information out to the public. And if you or your institution did not have the funds or support to create a long-term preservation plan, Storify was a great stop-gap until then. But digital preservation is a marathon, not a race, and we will need to continue to find new, innovative ways to ensure that digital material remains accessible.

When I heard Storify was shutting down, I went to our Scholarly Commons intern Matt Pitchford, whose research is on social media and who has a real stake in maintaining digital preservation, for his take on the issue. (You can read about Matt’s research here and here.) Here’s what Matt had to say:

Thinking about [Storify shutting down] from a preservation perspective, I think it reinforces the need to develop better archival tools along two dimensions: first, along the lines of navigating the huge amounts of data and information online (like how the Library of Congress has that huge Twitter archive, but no means to access it, and which they recently announced they will stop adding to). Just having all of Storify’s data wouldn’t make it navigable. Second, that archival tools need to be able to “get back” to older forms of data. There is no such thing as a “universally constant” medium. PDFs, twitter, Facebook posts, or word documents all may disappear over time too, despite how important they seem to our lives right now. Floppy disks, older computer games or programs, and even recently CDs, aren’t “accessible” in the way they used to be. I think the same is eventually going to be true of social media.
Matt brings up some great issues here. Storify shutting down could simply be a harbinger of more change online. Social media spaces come and go (who else remembers MySpace and LiveJournal?), and even the nature of posts change (who else remembers when Tweets were just 140 characters?). As archivists, librarians, and scholars, we will have to adopt, adapt, and think quickly in order to stay ahead of forces that are out of our control.
And most importantly, we’ll have to save backups of everything we do.

Open Source Tools for Social Media Analysis

Photograph of a person holding an iPhone with various social media icons.

This post was guest authored by Kayla Abner.


Interested in social media analytics, but don’t want to shell out the bucks to get started? There are a few open source tools you can use to dabble in this field, and some even integrate data visualization. Recently, we at the Scholarly Commons tested a few of these tools, and as expected, each one has strengths and weaknesses. For our exploration, we exclusively analyzed Twitter data.

NodeXL

NodeXL’s graph for #halloween (2,000 tweets)

tl;dr: Light system footprint and provides some interesting data visualization options. Useful if you don’t have a pre-existing data set, but the one generated here is fairly small.

NodeXL is essentially a complex Excel template (it’s classified as a Microsoft Office customization), which means it doesn’t take up a lot of space on your hard drive. It does have advantages; it’s easy to use, only requiring a simple search to retrieve tweets for you to analyze. However, its capabilities for large-scale analysis are limited; the user is restricted to retrieving the most recent 2,000 tweets. For example, searching Twitter for #halloween imported 2,000 tweets, every single one from the date of this writing. It is worth mentioning that there is a fancy, paid version that will expand your limit to 18,000, the maximum allowed by Twitter’s API, or 7 to 8 days ago, whichever comes first. Even then, you cannot restrict your data retrieval by date. NodeXL is a tool that would mostly be most successful in pulling recent social media data. In addition, if you want to study something besides Twitter, you will have to pay to get any other type of dataset, i.e., Facebook, Youtube, Flickr.

Strengths: Good for a beginner, differentiates between Mentions/Retweets and original Tweets, provides a dataset, some light data visualization tools, offers Help hints on hover

Weaknesses: 2,000 Tweet limit, free version restricted to Twitter Search Network

TAGS

TAGSExplorer’s data graph (2,902 tweets). It must mean something…

tl;dr: Add-on for Google Sheets, giving it a light system footprint as well. Higher restriction for number of tweets. TAGS has the added benefit of automated data retrieval, so you can track trends over time. Data visualization tool in beta, needs more development.

TAGS is another complex spreadsheet template, this time created for use with Google Sheets. TAGS does not have a paid version with more social media options; it can only be used for Twitter analysis. However, it does not have the same tweet retrieval limit as NodeXL. The only limit is 18,000 or seven days ago, which is dictated by Twitter’s Terms of Service, not the creators of this tool. My same search for #halloween with a limit set at 10,000 retrieved 9,902 tweets within the past seven days.

TAGS also offers a data visualization tool, TAGSExplorer, that is promising but still needs work to realize its potential. As it stands now in beta mode, even a dataset of 2,000 records puts so much strain on the program that it cannot keep up with the user. It can be used with smaller datasets, but still needs work. It does offer a few interesting additional analysis parameters that NodeXL lacked, such as ability to see Top Tweeters and Top Hashtags, which works better than the graph.

Image of hashtag searchThese graphs have meaning!

Strengths: More data fields, such as the user’s follower and friend count, location, and language (if available), better advanced search (Boolean capabilities, restrict by date or follower count), automated data retrieval

Weaknesses: data visualization tool needs work

Hydrator

Simple interface for Documenting the Now’s Hydrator

tl;dr: A tool used for “re-hydrating” tweet IDs into full tweets, to comply with Twitter’s Terms of Service. Not used for data analysis; useful for retrieving large datasets. Limited to datasets already available.

Documenting the Now, a group focused on collecting and preserving digital content, created the Hydrator tool to comply with Twitter’s Terms of Service. Download and distribution of full tweets to third parties is not allowed, but distribution of tweet IDs is allowed. The organization manages a Tweet Catalog with files that can be downloaded and run through the Hydrator to view the full Tweet. Researchers are also invited to submit their own dataset of Tweet IDs, but this requires use of other software to download them. This tool does not offer any data visualization, but is useful for studying and sharing large datasets (the file for the 115th US Congress contains 1,430,133 tweets!). Researchers are limited to what has already been collected, but multiple organizations provide publicly downloadable tweet ID datasets, such as Harvard’s Dataverse. Note that the rate of hydration is also limited by Twitter’s API, and the Hydrator tool manages that for you. Some of these datasets contain millions of tweet IDs, and will take days to be transformed into full tweets.

Strengths: Provides full tweets for analysis, straightforward interface

Weaknesses: No data analysis tools

Crimson Hexagon

If you’re looking for more robust analytics tools, Crimson Hexagon is a data analytics platform that specializes in social media. Not limited to Twitter, it can retrieve data from Facebook, Instagram, Youtube, and basically any other online source, like blogs or forums. The company has a partnership with Twitter and pays for greater access to their data, giving the researcher higher download limits and a longer time range than they would receive from either NodeXL or TAGS. One can access tweets starting from Twitter’s inception, but these features cost money! The University of Illinois at Urbana-Champaign is one such entity paying for this platform, so researchers affiliated with our university can request access. One of the Scholarly Commons interns, Matt Pitchford, uses this tool in his research on Twitter response to terrorism.

Whether you’re an experienced text analyst or just want to play around, these open source tools are worth considering for different uses, all without you spending a dime.

If you’d like to know more, researcher Rebekah K. Tromble recently gave a lecture at the Data Scientist Training for Librarians (DST4L) conference regarding how different (paid) platforms influence or bias analyses of social media data. As you start a real project analyzing social media, you’ll want to know how the data you have gathered may be limited to adjust your analysis accordingly.

Studying Rhetorical Responses to Terrorism on Twitter

As a part of my internship at the Scholarly Commons, I’m going to do a series of posts describing the tools and methodologies that I’ve used in order to work on my dissertation project. This write-up serves as an introduction to my project, it’s larger goals, and tools that I use to start working with my data.

The Dissertation Project

In general, my dissertation draws on computational methodologies to account for the digital circulation and fragmentation of political movement texts in new media environments. In particular, I will examine the rhetorical responses on Twitter to three terrorist attacks in the U.S.: the 2013 Boston Marathon Bombing, the 2015 San Bernardino Shooting, and the 2016 Orlando Nightclub shooting. I begin with the idea that terrorism is a kind of message directed at an audience, and I am interested in how digital audiences in the U.S. come to understand, make meaning of, and navigate uncertainty following a terrorist attack. I am interested in the patterns of narratives, community construction, and expressions of affect that characterize terrorism as a social media phenomenon.

I am interested in the following questions: What methods might rhetorical scholars use to better understand the vast numbers of texts, posts, and “tweets” that make up our social media? How do digital audiences construct meanings in light of terrorist attacks? How does the interwoven agency and materiality of digital spaces influence forms of rhetorical action, such as invention and style? In order to better address such challenges, I turn to the tools and techniques of the Digital Humanities as a computational modes of analysis to examine the digitally circulated rhetoric surrounding terror events. Investigation of this rhetoric using topic models will help scholars to understand not only particular aspects of terrorism as a social media phenomenon, but also to better see the ways that community and identity are themselves formed amid digitally circulated texts.

At the beginning of this project, I had no experience working with textual data, so the following posts represent a cleaned and edited version of the learning process I went through. There was a lot of mess and exploration involved, but that meant I’ve come to understand a lot more.

Gathering The Tools

I use a Mac, so accessing the command line is as simple as firing up the Terminal.App. Windows users have to do a bit more work in order to get all these tools, but plenty of tutorials can be found with a quick search.

Python (Anaconda)
The first big choice was to learn how to code in R or Python. I’d heard that Python was better for text and R was better for statistical work, but it seems that it mostly comes down to personal preference as you can find people doing both in either language. Both R and Python have a bit of a learning curve, but a quick search for topic modeling in Python gave me a ton of useful results, so I chose to start there.

Anaconda is a package management system for the Python languages. What’s great about Anaconda is not only that it has a robust management system (so I can easily download the tools and libraries that I need without having to worry about dependencies or other errors), but also that it encourages the creation of “environments” for you to work in. This means that I can make mistakes or install and uninstall packages without having to worry about messing up my overall system or my other environments.

Instructions for downloading Anaconda can be found here, and I found this cheat-sheet very useful in setting up my initial environments. Python has a ton of documentation, so these pages are useful, and there are plenty of tutorials online. Each environment comes with a few default packages, and I quickly added some toolkits for processing text and plotting graphs.

Confirming the Conda installation in Terminal, activating an environment, and listing the installed packages.

StackOverflow
Lots of people working with Python have the same problems or issues that I did. Whenever my code encountered an error, or when I didn’t know how to do something like write to a .txt file, searching StackOverflow usually got me on the right track. Most answers link to the Python documentation that relates to the question, so not only did I fix what was wrong but I also learned why.

GitHub
Sometimes scholars put their code on GitHub for sharing, advancing research, and confirming their findings. I found code on here that is for topic modeling in Python, as well as setting up repositories for my own work. Using GitHub is a useful version control system, so it also meant that I never “lost” old code and could track changes over time.

Programming Historian
This is a site for scholars interested in learning how to use tools for Digital Humanities work. There are some great tutorials here on a range of topics, including how to set up and use Python. It’s approachable and does a good job of covering everything you need to know.

These tools, taken together, form the basis of my workspace for dealing with my data. Upcoming topics will cover Data Collection, Cleaning the Data, Topic Models, and Graphing the Results.

Use Sifter for Twitter Research

For many academics, Twitter is an increasingly important source. Whether you love it or hate it, Twitter dominates information dissemination and discourse, and will continue to do so for the foreseeable future. However, actually sorting through Twitter — especially for large-scale projects — can be deceptively difficult, and a deterrent for would-be Twitter scholars. That is why Sifter will go through Twitter for you.

Sifter is a paid service — which will be discussed in greater detail below — which provides search and retrieve access for undeleted Tweets. Retrieved tweets are stored in an Enterprise DiscoverText account, which allows the user to perform data analytics on the Tweets. The DiscoverText account will be part of a fourteen day free trial, but for prolonged use the user will have to pay for account access.

However, Sifter can become prohibitively expensive. Each user can get three free estimates a day. Following that, it is $20 per day of data retrieval and $30 per 100,000 Tweets. Some more expensive purchases (over $500 and $1500, respectively) will receive longer DiscoverText trials with access added for additional users. There are no refunds. So prior to making your purchase, make sure that you have done enough research to know exactly what data you want, and which filters you’d like to use.

Possible filters that you can request when using Sifter.

Have you used Sifter? Or DiscoverText? What was your experience like? Alternatively, do you have a free resource that you prefer to use for Twitter data analytics? Please let us know in the comments!