Three easy pieces: Data Wrangling Outline. 1. Scraping & Parsing Tools &
Techniques. 3. A bit of database technology (MongoDB primer). 2. Data “
Cleaning” ...
The GovData Project MIT-Harvard Winter Course 2011
Module 2: Data Wrangling Techniques
Data Wrangling Outline Three easy pieces: 1. Scraping & Parsing Tools & Techniques 2. Data “Cleaning” Tools & Techniques 3. A bit of database technology (MongoDB primer)
Data Wrangling Outline Motivations: 1. Scraping & Parsing Tools & Techniques because the web, especially complex data portals, contains lots of data 2. Data “Cleaning” Tools & Techniques because the data, even coming from DB-backed sites, is often “dirty” 3. A bit of database technology (MongoDB primer) because you want to be able to serve up the data too
Data Wrangling Outline Three easy pieces: prelude: How the web works: request / response 1. Scraping interlude: How the web works: HTML ... and then Parsing Tools & Techniques 2. Data “Cleaning” Tools & Techniques 3. A bit of database technology (MongoDB primer)
How the Web Works (sort of)
SERVER
A (powerful) computer that hosts a webpage
CLIENT
Your computer, where you view the webpage
How the Web Works (sort of) (the web) SERVER SERVER
SERVER
SERVER
SERVER
SERVER
SERVER
How the Web Works (sort of) (the web)
CLIENT
SERVER
CLIENT CLIENT
CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT
CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of) (the web)
CLIENT
CLIENT
SERVER
CLIENT CLIENT
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of) CLIENT
(the web)
CLIENT
SERVER
CLIENT CLIENT
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of) (the web)
CLIENT
CLIENT
SERVER
CLIENT CLIENT
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of) http://thissite.com/thispage
CLIENT
CLIENT
CLIENT CLIENT
SERVER
CLIENT
CLIE
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT CLIENT
CLIENT
How the Web Works (sort of) CLIENT
(the web)
CLIENT
SERVER
CLIENT CLIENT
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
SERVER
SERVER
thissite.com/thispage
CLIENT
CLIENT
CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of)
Key fact: all the servers know all the other servers’ address, and know how to forward on the message properly
How the Web Works (sort of) http://thissite.com/thispage
GET me: thissite.com/thispage CLIENT
CLIENT
CLIENT CLIENT
SERVER
CLIENT
CLIE
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT CLIENT
CLIENT
How the Web Works (sort of) CLIENT
CLIENT
SERVER
CLIENT CLIENT
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
thissite.com/thispag
CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of) CLIENT
CLIENT
SERVER
CLIENT CLIENT
GET CLIENT: thissite.com/thispage
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER
SERVER
GET CLIENT: thissite.com/thispage CLIENT
GET CLIENT: thissite.com/thispage CLIENT CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
thissite.com/thispag
CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
CLIENT CLIENT CLIENT
How the Web Works (sort of) CLIENT
SERVER
SERVER CLIENT
CLIENT CLIENT
CLIENT
SERVER
SERVER
Server computes the response .
CLIENT
thissite.com/thispage.html CLIENT CLIENT CLIENT
like function input
Simple web page = static = little computation Complex page = dynamic (e.g. DB-backed) = more computation
How the Web Works (sort of) CLIENT
CLIENT
SERVER
CLIENT CLIENT
SERVER
CLIENT
CLIENT CLIENT CLIENT CLIENT
CLIENT CLIENT CLIENT
SERVER
SERVER CLIENT
SERVER CLIENT CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
CLIENT
SERVER
SERVER CLIENT
CLIENT
CLIENT
CLIENT
CLIENT CLIENT
CLIENT CLIENT
How the Web Works (sort of) A typical request/response pair looks like: Request
Response header
How the Web Works (sort of) The real contents of the response. It’s HTML.
How the Web Works: HTML Your web browser renders HTML into something meaningful
Data Wrangling Outline Revisited Three easy pieces: 1. Scraping & Parsing Tools & Techniques issue the right requests / transform resulting HTML into a data structure more suited to analytical manipulation 2. Data “Cleaning” Tools & Techniques correct and enrich the data structure 3. A bit of database technology (MongoDB primer) repackage the data structure make it available to others just it was made available to you (but better)
Scraping The Idea Of Scraping: Issue a GET request not through the browser, but instead through some other route, so as to be able to direct the response to your analyze its contents & extract its structured information (as opposed to having it rendered in the browser window). Sub-issues of Scraping: - How to issue the request - How to figure out which requests to issue in the first place - How to extract (that is, parse) data from the response into a useable data structure.
Scraping How to issue the request: Command Line Tools wget curl
More like programming
python urllib, urllib2 mechanize selenium GUI Web scrapers
More like browsing.
Scraping: wget / curl
You can integrate it into your python scripts trivially:
Scraping: wget / curl
Scraping: wget / curl
Scraping: wget / curl wget has all kinds of options, for recursively getting many subpages of a page, following links in different ways, using passwords with secure pages, controlling how input is named, configuring response headers, &c
curl is basically the same
Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.
Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.
The resulting page just doesn’t have the stuff in it. But it had to have gotten to your computer somehow. Time for Firebug.
Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.
Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.
Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.
Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.