What is web-scraping?
Put simply, web-scraping is the act of taking information from a website and placing it into a file so that you can analyze it. More practically and less formally, it’s basically copying text from a website into either a text or CSV file. This page is dedicated to maintaining resources about web-scraping. If you think I should include something, please email me at johng@illinois.edu. Thanks!
Readings and theory
Rawson and Muñoz’s “Against Cleaning”
Michael Black’s A Textual History of Mozilla in DHQ
Marre and Weltevrede’s Scraping the social
E.J.T. Weltevrede’s Repurposing digital methods: The research affordances of platforms and engines
Basic workshop on web-scraping using XPath:
Using Google sheets as a basic web-scraper
- Sample Google sheet (Youtube)
- Sample Google sheet (Game of Thrones Wikipedia)
- Game of Thrones characters
Useful summaries of web-scraping
Coding resources
Free non-coding resources
Massmine (highly recommend as this is completely free, funded by humanities research, and the co-creator, Aaron Beveridge, is a good friend of mine). Here is a scholarly article about Mass Mine.
Octoparse (free up to 10,000 records per scrape)
Scrapolysis (Chrome extension)
Webscraper IO (free up to 12,000 records per scrape)