Web-scraping resources

What is web-scraping?

Put simply, web-scraping is the act of taking information from a website and placing it into a file so that you can analyze it. More practically and less formally, it’s basically copying text from a website into either a text or CSV file. This page is dedicated to maintaining resources about web-scraping. If you think I should include something, please email me at johng@illinois.edu. Thanks!

Readings and theory

Rawson and Muñoz’s “Against Cleaning”

Michael Black’s The World Wide Web as Complex Data Set: Expanding the Digital Humanities into the Twentieth Century and Beyond 

Michael Black’s A Textual History of Mozilla in DHQ

Marre and Weltevrede’s Scraping the social

E.J.T. Weltevrede’s Repurposing digital methods: The research affordances of platforms and engines

Basic workshop on web-scraping using XPath:

Using Google sheets as a basic web-scraper

Xpath Query language

Powerpoint slides

Worksheet Guide

Useful summaries of web-scraping

Python web-scraping

Web-scraping in R

Coding resources

Beautiful soup

Rvest

Tidyverse

Scrapy

Scrapy and Mongo DB

Free non-coding resources

Massmine (highly recommend as this is completely free, funded by humanities research, and the co-creator, Aaron Beveridge, is a good friend of mine).  Here is a scholarly article about Mass Mine.

Octoparse (free up to 10,000 records per scrape)

Scrapolysis (Chrome extension)

Webscraper IO (free up to 12,000 records per scrape)

RapidMiner

University of Illinois, Urbana-Champaign