Data Sets

I’m currently searching datasets. I work with publicly available datasets as well as datasets using my own techniques (using Python and rStudio).

  1. This dataset is all of the Amazon reviews from May 1996-July 2014 or approximately 143 million, courtesy of Julian McAuley (UCSD). This set is cleaner and less repetitive than the better known Stanford dataset.
  2. This dataset is 1.6 billion public Reddit comments. The link also has good directions on how to clean up the data. Here is the API to pull off data.
  3. This sample data is a structured CSV file pulled from The New York Times (via their API). It’s all of the NYT online comments from the summer of 2015 (June, July, and August) or approximately 450,000 unique comments. I am currently structuring all of the comments dating back to August 2012 (about 5 million online comments; why 2012? Because of the 2012 election cycle).
  4. Net Data Directory (aggregation of internet data)
  5. Pew Research on the 25th Anniversary of the Internet. (You need to sign up for it)
  6. The IPEDS Analytics: Delta Cost Project Database. This includes a longitudinal database derived from IPEDS finance, enrollment, staffing, completions and student aid data for academic years 1986-87 through 2011-12.
  7. Gnip (buying large social media datasets)
  8. Datasift (for datasets of social media corporations)
  9. Open Knowledge (catalog of resources)

If you know of any interesting datasets of online comments, please feel free to send me an email:

University of Illinois, Urbana-Champaign