Preparing Your Data for Topic Modeling

In keeping with my series of blog posts on my research project, this post is about how to prepare your data for input into a topic modeling package. I used Twitter data in my project, which is relatively sparse at only 140 characters per tweet, but the principles can be applied to any document or set of documents that you want to analyze.

Topic Models:

Topic models work by identifying and grouping words that co-occur into “topics.” As David Blei writes, Latent Dirichlet allocation (LDA) topic modeling makes two fundamental assumptions: “(1) There are a fixed number of patterns of word use, groups of terms that tend to occur together in documents. Call them topics. (2) Each document in the corpus exhibits the topics to varying degree. For example, suppose two of the topics are politics and film. LDA will represent a book like James E. Combs and Sara T. Combs’ Film Propaganda and American Politics: An Analysis and Filmography as partly about politics and partly about film.”

Topic models do not have any actual semantic knowledge of the words, and so do not “read” the sentence. Instead, topic models use math. The tokens/words that tend to co-occur are statistically likely to be related to one another. However, that also means that the model is susceptible to “noise,” or falsely identifying patterns of cooccurrence if non-important but highly-repeated terms are used. As with most computational methods, “garbage in, garbage out.”

In order to make sure that the topic model is identifying interesting or important patterns instead of noise, I had to accomplish the following pre-processing or “cleaning” steps.

  • First, I removed the punctuation marks, like “,.;:?!”. Without this step, commas started showing up in all of my results. Since they didn’t add to the meaning of the text, they were not necessary to analyze.
  • Second, I removed the stop-words, like “I,” “and,” and “the,” because those words are so common in any English sentence that they tend to be over-represented in the results. Many of my tweets were emotional responses, so many authors wrote in the first person. This tended to skew my results, although you should be careful about what stop words you remove. Simply removing stop-words without checking them first means that you can accidentally filter out important data.
  • Finally, I removed too common words that were uniquely present in my data. For example, many of my tweets were retweets and therefore contained the word “rt.” I also ended up removing mentions to other authors because highly retweeted texts tended to mean that I was getting Twitter user handles as significant words in my results.

Cleaning the Data:

My original data set was 10 Excel files of 10,000 tweets each. In order to clean and standardize all these data points, as well as combining my file into one single document, I used OpenRefine. OpenRefine is a powerful tool, and it makes it easy to work with all your data at once, even if it is a large number of entries. I uploaded all of my datasets, then performed some quick cleaning available under the “Common Transformations” option under the triangle dropdown at the head of each column: I changed everything to lowercase, unescaped HTML characters (to make sure that I didn’t get errors when trying to run it in Python), and removed extra white spaces between words.

OpenRefine also lets you use regular expressions, which is a kind of search tool for finding specific strings of characters inside other text. This allowed me to remove punctuation, hashtags, and author mentions by running a find and replace command.

  • Remove punctuation: grel:value.replace(/(\p{P}(?<!’)(?<!-))/, “”)
    • Any punctuation character is removed.
  • Remove users: grel:value.replace(/(@\S*)/, “”)
    • Any string that begins with an @ is removed. It ends at the space following the word.
  • Remove hashtags: grel:value.replace(/(#\S*)/,””)
    • Any string that begins with a # is removed. It ends at the space following the word.

Regular expressions, commonly abbreviated as “regex,” can take a little getting used to in order to understand how they work. Fortunately, OpenRefine itself has some solid documentation on the subject, and I also found this cheatsheet valuable as I was trying to get it work. If you want to create your own regex search strings, regex101.com has a tool that lets you test your expression before you actually deploy it in OpenRefine.

After downloading the entire data set as a Comma Separated Value (.csv) file, I then used the Natural Language ToolKit (NLTK) for Python to remove stop-words. The code itself can be found here, but I first saved the content of the tweets as a single text file, and then I told NLTK to go over every line of the document and remove words that are in its common stop word dictionary. The output is then saved in another text file, which is ready to be fed into a topic modeling package, such as MALLET.

At the end of all these cleaning steps, my resulting data is essentially composed of unique nouns and verbs, so, for example, @Phoenix_Rises13’s tweet “rt @drlawyercop since sensible, national gun control is a steep climb, how about we just start with orlando? #guncontrolnow” becomes instead “since sensible national gun control steep climb start orlando.” This means that the topic modeling will be more focused on the particular words present in each tweet, rather than commonalities of the English language.

Now my data is cleaned from any additional noise, and it is ready to be input into a topic modeling program.

Interested in working with topic models? There are two Savvy Researcher topic modeling workshops, on December 6 and December 8, that focus on the theory and practice of using topic models to answer questions in the humanities. I hope to see you there!

Neatline 101: Getting Started

Here at Commons Knowledge we love easy-to-use interactive map creation software! We’ve compared and contrasted different tools, and talked about StoryMap JS and Shanti Interactive. The Scholarly Commons is a great place to get help on GIS projects, from ArcGIS StoryMaps and beyond. But if you want something where you can have both a map and a timeline, and if you are willing to spend money on your own server, definitely consider using Neatline.

Neatline is a plugin created by Scholar’s Lab at University of Virginia that lets you create interactive maps and timelines in Omeka exhibits. My personal favorite example is the demo site by Paul Mawyer “‘I am it and it is I’: Lovecraft in Providence” with the map tiles from Stamen Design under CC-BY 3.0 license.

Screenshot of Lovecraft Neatline exhibit

*As far as the location of Lovecraft’s most famous creation, let’s just say “Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.”

Now one caveat — Neatline requires a server. I used Reclaim Hosting which is straightforward, and which I have used for Scalar and Mukurtu. The cheapest plan available on Reclaim Hosting was $32 a year. Once I signed up for the website and domain name, I took advantage of one nice feature of Reclaim Hosting, which lets you one-click install the Omeka.org content management system (CMS). The Omeka CMS is a popular choice for digital humanities users. Other popular content management systems include Wordpress and Scalar.

One click install of Omeka through Reclaim Hosting

BUT WAIT, WHAT ABOUT OMEKA THROUGH SCHOLARLY COMMONS?

Here at the Scholarly Commons we can set up an Omeka.net site for you. You can find more information on setting up an Omeka.net site through the Scholarly Commons here. This is a great option for people who want to create a regular Omeka exhibit. However, Neatline is only available as a plugin on Omeka.org, which needs a server to host. As far as I know, there is currently no Neatline plugin for Omeka.net and I don’t think that will be happening anytime soon. On Reclaim you can install Omeka on any LAMP server. And side advice from your very forgetful blogger, write down whatever username and password you make up when you set up your Omeka site, that will save you a lot of trouble later, especially considering how many accounts you end up with when you use a server to host a site.

Okay, I’m still interested, but what do I do once I have Omeka.org installed? 

So back to the demo. I used the instructions on the documentation page on Neatline, which were good for defining a lot of the terms but not so good at explaining exactly what to do. I am focusing on the original Neatline plugin but there are other Neatline plugins like NeatlineText depending on your needs. However all plugins are installed in a similar way. You can follow the official instructions here at Installing Neatline.

But I have also provided some because the official instructions just didn’t do it for me.

So first off, download the Neatline zip file.

Go to your Control Panel, cPanel in Reclaim Hosting, and click on “File Manager.”

File Manager circled on Reclaim Hosting

Sorry this looks so goofy, Windows snipping tool free form is only for those with a steady hand.

Navigate to the the Plugins folder.

arrow points at plugins folder in file manager

Double click to open the folder. Click Upload Files.

more arrows pointing at tiny upload option in Plugins folder

If you’re using Reclaim Hosting, IGNORE THE INSTRUCTIONS DO NOT UNZIP THE ZIP FILE ON YOUR COMPUTER JUST PLOP THAT PUPPY RIGHT INTO YOUR PLUGINS FOLDER.

Upload the entire zip file

                      Plop it in!

Go back to the Plugins folder. Right click the Neatline zip file and click extract. Save extracted files in Plugins.

Extract Neatline files in File Manager

Sign into your Omeka site at [yourdomainname].[com/name/whatever]/admin if you aren’t already.

Omeka dashboard with arrows pointing at Plugins

Install Neatline for real.

Omeka Plugins page

Still confused or having trouble with setup?

Check out these tutorials as well!

Open Street Maps is great and all but what if I want to create a fancy historical map?

To create historical maps on Neatline you have two options, only one of which is included in the actual documentation for Neatline.

Officially, you are supposed to use GeoServer. GeoServer is an open source server application built in Java. Even if you have your own server, it has a lot more dependencies to run than what’s required for Omeka / Neatline.

If you want one-click Neatline installation with GeoServer and have money to spend you might want to check out AcuGIS Neatline Cloud Hosting which is recommended in the Neatline documentation and the lowest cost plan starts at $250 a year.

Unofficially, there is a tutorial for this available at Lincoln Mullen’s blog “The Backward Glance” specifically his 2015 post “How to Use Neatline with Map Warper Instead of Geoserver.”

Let us know about the ways you incorporate geospatial data in your research!  And stay tuned for Neatline 102: Creating a simple exhibit!

Works Cited:

Extending Omeka with Plugins. (2016, July 5). Retrieved May 23, 2017, from http://history2016.doingdh.org/week-1-wednesday/extending-omeka-with-plugins/

Installing Neatline Neatline Documentation. (n.d.). Retrieved May 23, 2017, from http://docs.neatline.org/installing-neatline.html

Mawyer, Paul. (n.d.). “I am it and it is I”: Lovecraft in Providence. Retrieved May 23, 2017, from http://lovecraft.neatline.org/neatline-exhibits/show/lovecraft-in-providence/fullscreen

Mullen, Lincoln. (2015).  “How to Use Neatline with Map Warper Instead of Geoserver.” Retrieved May 23, 2017 from http://lincolnmullen.com/blog/how-to-use-neatline-with-map-warper-instead-of-geoserver/

Uploading Plugins to Omeka. (n.d.). Retrieved May 23, 2017, from https://community.reclaimhosting.com/t/uploading-plugins-to-omeka/195

Working with Omeka. (n.d.). Retrieved May 23, 2017, from https://community.reclaimhosting.com/t/working-with-omeka/194

Spotlight: Shanti Interactive

sint_logo

If you’re looking for tools that will help you create web-based visualizations, images or maps, Shanti Interactive may have exactly what you need. Shanti Interactive, a suite of tools made available from the University of Virginia’s Sciences, Humanities & Arts Network of Technological Initiatives (SHANTI), is free to use and a helpful resource for individuals seeking to show their data visually.

The Shanti Interactive suite includes five programs: Qmedia, SHIVA, MapScholar, VisualEyes, and VisualEyes 5. Qmedia creates instructional and scholarly videos. SHIVA creates “data-driven visualizations,” such as charts, graphs, maps, image montages and timelines. MapScholar creates geospatial visualizations while VisualEyes — arguably the most well-known tool from the suite — creates historic visualizations by weaving images, maps, charts, video and data into online exhibits. While we could write an entire post on each member of the suite (and maybe someday we will), I will quickly go over some of the main functions of the Shanti Interactive suite.

Qmedia

A screenshot of QMedia's demo video.

A screenshot of Qmedia’s live demo.

Qmedia creates an interactive video experience. The screen is broken up into various, customizable boxes, which the user can then interact with. In its own words, Qmedia “delineraizes” the video, allowing it to be scanned. Tools in Qmedia include table of contents, clickable, searchable transcripts, graphical concept maps, images, live maps, interactive visualizations, web apps and websites! While this list can be a little overwhelming, you can see the incredible results with Qmedia’s live demo.

SHIVA

SHIVA's timeline capability.

SHIVA’s timeline capability.

Think of SHIVA as a multi-faceted data visualization tool. It can create charts, maps, timelines, videos, images, graphs, subway maps, word clouds as well as plain text. SHIVA works with open source and open access web tools, such as Google’s Visualization Toolkit and Maps, YouTube, and Flickr. When a user inputs data, they do so through Google Docs. One fantastic feature in SHIVA is the ability to add on layers of annotations onto your data. For more on SHIVA’s capabilities and partners, see the SHIVA about page.

MapScholar

MapScholar is a great tool for creating what they call digital “atlases,” allowing scholars to use historic maps to compare and contrast how different areas have been depicted by mapmakers through time. For example, here is the base map on the eastern United States:

And here is that map overlayed with a Native American map from 1721:

VisualEyes and VisualEyes 5

VisualEyes is a multi-faceted online exhibit toolkit, which helps create interactive websites to display data. There are two versions: Flash-based VisualEyes, and HTML5-based VisualEyes 5, which is recommended. In many ways, VisualEyes is a combination of the rest of the suite’s tools, providing a platform for some incredible integration of sources. VisualEyes’ current example is a tour of Thomas Jefferson’s life (as the program was created at the University of Virginia), and worth a look if you’re interested in the program’s capabilities! It is far more interactive than one screengrab can communicate.

This project includes historic and modern maps, a timeline, and text, which all work together to create the story of Thomas Jefferson’s life.

Shanti Interactive includes diverse, free resources that can transform the way that you present your data to the world. If you need help getting started, or want to brainstorm ideas, stop by the Scholarly Commons and we’ll have someone ready to chat!

Text Analysis Basics – See Your Words in Voyant!

Interested in doing basic text analysis but have no or limited programming experience? Do you feel intimidated by the command line? One way to get started with text analysis, visualization, and uncovering patterns in large amounts of text is with browser-based programs! And today we have a mega blockbuster blog post extravaganza about Voyant Tools!

Voyant is a great solid browser based tool for text analysis. It is part of the Text Analysis Portal for Research (TAPoR)  http://tapor.ca/home. The current project leads are Stéfan Sinclair at McGill University (one of the minds behind BonPatron!) and Geoffrey Rockwell at the University of Alberta.

Analyzing a corpus:

I wanted to know what I needed to know to get a job so I got as many job ads as I could and ran them through very basic browser-based text analysis tools (to learn more about Word Clouds check out this recent post for Commons Knowledge all about them!) in order to see if what I needed to study in library school would emerge and I could then use that information to determine which courses I should take. This was an interesting idea and I mostly found that jobs prefer you have an ALA-accredited degree, which was consistent with what I had heard from talking to librarians. Now I have collected even more job ads (around December from the ALA job list mostly with a few from i-Link and elsewhere) to see what I can find out (and hopefully figure out some more skills I should be developing while I’m still in school).

Number of job ads = 300 there may be a few duplicates and this is not the cleanest data.

Uploading a corpus:

Voyant Tools is found at https://voyant-tools.org.

Voyant Home Page

For small amounts of text, copy and paste into the “Add Text” box. Otherwise, add files by clicking “Upload” and choosing the Word or Text files you want to analyze. Then click “Reveal”.

So I added in my corpus and here’s what comes up:

To choose a different view click  the small rectangle icon and choose from a variety of views. To save the visualization you created in order to later incorporate it into your research click the arrow and rectangle “Upload” icon and choose which aspect of the visualization you want to save.

Mode change option circled

“Stop words” are words excluded because they are very common words such as “the” or “and” that don’t always tell us anything significant about the content of our corpus. If you are interested in adding stop words beyond the default settings, you can do that with the following steps:

Summary button on Voyant circled

1. Click on Summary

Home screen for Voyant with the edit settings circled

2. Click on the define options button

Clicking on edit list in Voyant

3. If you want to add more words to the default StopList click Edit List

Edit StopList window in Voyant

4. Type in new words and edit the ones already there in the default StopList and click Save to save.

Mouse click on New User Defined List

5. Or to add your own list click New User Defined List and paste in your own list in the Edit list feature instead of editing the default list.

Here are some of the cool different views you can choose from in Voyant:

Word Cloud:

The Links mode, which shows connections between different words and how often they are paired with the thickness of the line between them.

My favorite mode is TextArc based on the text analysis and visualization project of the same name created by W. Brad Paley in the early 2000s. More information about this project can be found at http://www.textarc.org/, where you can also find Text arc versions of classic literature.

Voyant is pretty basic, it will give you a bunch of stuff you probably already knew, such as to get a library job it helps to have library experience. The advantage of the TextArc setting is that it puts everything out there and lets you see the connections between different words. And okay, it looks really cool too.

Check it out the original animated below! Warning this may slow down or even crash your browser:  https://voyant-tools.org/?corpus=3de9f7190e781ce7566e01454014a969&view=TextualArc

I also like the Bubbles feature (not to be confused with the Bubblelines feature) though none of the other GAs or staff here do, one going so far as to refer to it as an “abomination”.

Circles with corpus words (also listed in side pane) on inside

Truly abominable

The reason I have not included a link to this is DEFAULT VERSION MAY NOT MEET WC3 WEB DESIGN EPILEPSY GUIDELINES. DO NOT TRY IF YOU ARE PRONE TO PHOTOSENSITIVE SEIZURES. It is adapted from the much less flashy “Letter Pairs” project created by Martin Ignacio Bereciartua. This mode can also crash your browser.

To learn more about applying for jobs we have a Savvy Researcher workshop!

If you thought these tools were cool, to learn more advanced text mining techniques we have an upcoming Savvy Researcher workshop, also on March 6 :

Happy text mining and job searching! Hope to see some of you here at Scholarly Commons on March 6!

Spotlight on DiRT Directory: Digital Research Tools

The DiRT logo.

As a researcher, it can sometimes be frustrating knowing that someone out there has created a useful tool that will help you with what you’re working on, but being unable to find it. Google searches prove fruitless, and your network of friends don’t necessarily know what you’re talking about. In that moment of panic and frustration, you may just need to get a little DiRT-y.

DiRT Directory: Digital Research Tools is a directory of research tools for scholarly use. Using TaDiRAH (the Taxonomy of Digital Research Activities in the Humanities), DiRT breaks down the stages of a research project, and groups tools that are relevant to each stage: Capture, Creation, Enrichment, Analysis, Interpretation, Storage, and Dissemination. Users can either search for tools using these categories — broken down into subcategories whose specificity helps to narrow down the many tools found in the DiRT Directory — through a search box or by tag. Personally, I feel that searching through the TaDiRAH categories allows you to find relevant tools, but also allows you to explore options that you may not have previously thought of as being available, making it the most fruitful way to browse tools.

One nice aspect of DiRT is its search platform. After you choose your category, you have the option to search within the category for these criteria: Platform, Cost, Exclude, License, and Research Objects, as well as sort order. For researchers concerned with cost, this tool is especially useful, as you can limit your search to what is in your budget.

After you complete your search, you are offered a list of different tools. Tools range from well-known sources, like Google Docs, to things you have probably never heard of before. Each source includes a description, outlining what kind of tool it is — online, software, etc. — what its capabilities are, and in many cases, a note on its past or future development. Each entry also includes a link to the tool’s website, their license, and the date of DiRT’s most recent update on the source information.

An example tool entry on DiRT for Scrivener writing software.

An example tool entry on DiRT for Scrivener writing software on the search page.

Finally, each tool has its own page that you can access from the search function. This page holds a wealth of information, including an expanded description that outlines the nitty gritty aspects of the tool — from platforms to cost bracket to tags. It also includes screenshots of the tool in action, a list of recent edits to the page, and a comments section. However, not all tools have the same level of detail in their pages.

capture2

Scrivener’s page, which includes a description, screenshots, a list of contributors, and a comments section.

While the selection presented on DiRT can be almost overwhelming, digging through DiRT can help you find the perfect tools for your project.

If you still can’t find what you want in DiRT Directory, or need some guidance in what to search for in the first place, stop by the Scholarly Commons, located in Main Library Room 306, open from 9am-6pm on weekdays. Or, email us! We are always happy to help you with your research needs.