Data Wrangling Techniques - MIT

7 downloads 411 Views 32MB Size Report
Three easy pieces: Data Wrangling Outline. 1. Scraping & Parsing Tools & Techniques. 3. A bit of database technology (MongoDB primer). 2. Data “ Cleaning” ...
The GovData Project MIT-Harvard Winter Course 2011

Module 2: Data Wrangling Techniques

Data Wrangling Outline Three easy pieces: 1. Scraping & Parsing Tools & Techniques 2. Data “Cleaning” Tools & Techniques 3. A bit of database technology (MongoDB primer)

Data Wrangling Outline Motivations: 1. Scraping & Parsing Tools & Techniques because the web, especially complex data portals, contains lots of data 2. Data “Cleaning” Tools & Techniques because the data, even coming from DB-backed sites, is often “dirty” 3. A bit of database technology (MongoDB primer) because you want to be able to serve up the data too

Data Wrangling Outline Three easy pieces: prelude: How the web works: request / response 1. Scraping interlude: How the web works: HTML ... and then Parsing Tools & Techniques 2. Data “Cleaning” Tools & Techniques 3. A bit of database technology (MongoDB primer)

How the Web Works (sort of)

SERVER

A (powerful) computer that hosts a webpage

CLIENT

Your computer, where you view the webpage

How the Web Works (sort of) (the web) SERVER SERVER

SERVER

SERVER

SERVER

SERVER

SERVER

How the Web Works (sort of) (the web)

CLIENT

SERVER

CLIENT CLIENT

CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT

CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of) (the web)

CLIENT

CLIENT

SERVER

CLIENT CLIENT

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of) CLIENT

(the web)

CLIENT

SERVER

CLIENT CLIENT

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of) (the web)

CLIENT

CLIENT

SERVER

CLIENT CLIENT

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of) http://thissite.com/thispage

CLIENT

CLIENT

CLIENT CLIENT

SERVER

CLIENT

CLIE

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT CLIENT

CLIENT

How the Web Works (sort of) CLIENT

(the web)

CLIENT

SERVER

CLIENT CLIENT

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

SERVER

SERVER

thissite.com/thispage

CLIENT

CLIENT

CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of)

Key fact: all the servers know all the other servers’ address, and know how to forward on the message properly

How the Web Works (sort of) http://thissite.com/thispage

GET me: thissite.com/thispage CLIENT

CLIENT

CLIENT CLIENT

SERVER

CLIENT

CLIE

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT CLIENT

CLIENT

How the Web Works (sort of) CLIENT

CLIENT

SERVER

CLIENT CLIENT

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

thissite.com/thispag

CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of) CLIENT

CLIENT

SERVER

CLIENT CLIENT

GET CLIENT: thissite.com/thispage

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER

SERVER

GET CLIENT: thissite.com/thispage CLIENT

GET CLIENT: thissite.com/thispage CLIENT CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

thissite.com/thispag

CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

CLIENT CLIENT CLIENT

How the Web Works (sort of) CLIENT

SERVER

SERVER CLIENT

CLIENT CLIENT

CLIENT

SERVER

SERVER

Server computes the response .

CLIENT

thissite.com/thispage.html CLIENT CLIENT CLIENT

like function input

Simple web page = static = little computation Complex page = dynamic (e.g. DB-backed) = more computation

How the Web Works (sort of) CLIENT

CLIENT

SERVER

CLIENT CLIENT

SERVER

CLIENT

CLIENT CLIENT CLIENT CLIENT

CLIENT CLIENT CLIENT

SERVER

SERVER CLIENT

SERVER CLIENT CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

CLIENT

SERVER

SERVER CLIENT

CLIENT

CLIENT

CLIENT

CLIENT CLIENT

CLIENT CLIENT

How the Web Works (sort of) A typical request/response pair looks like: Request

Response header

How the Web Works (sort of) The real contents of the response. It’s HTML.

How the Web Works: HTML Your web browser renders HTML into something meaningful

Data Wrangling Outline Revisited Three easy pieces: 1. Scraping & Parsing Tools & Techniques issue the right requests / transform resulting HTML into a data structure more suited to analytical manipulation 2. Data “Cleaning” Tools & Techniques correct and enrich the data structure 3. A bit of database technology (MongoDB primer) repackage the data structure make it available to others just it was made available to you (but better)

Scraping The Idea Of Scraping: Issue a GET request not through the browser, but instead through some other route, so as to be able to direct the response to your analyze its contents & extract its structured information (as opposed to having it rendered in the browser window). Sub-issues of Scraping: - How to issue the request - How to figure out which requests to issue in the first place - How to extract (that is, parse) data from the response into a useable data structure.

Scraping How to issue the request: Command Line Tools wget curl

More like programming

python urllib, urllib2 mechanize selenium GUI Web scrapers

More like browsing.

Scraping: wget / curl

You can integrate it into your python scripts trivially:

Scraping: wget / curl

Scraping: wget / curl

Scraping: wget / curl wget has all kinds of options, for recursively getting many subpages of a page, following links in different ways, using passwords with secure pages, controlling how input is named, configuring response headers, &c

curl is basically the same

Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.

Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.

The resulting page just doesn’t have the stuff in it. But it had to have gotten to your computer somehow. Time for Firebug.

Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.

Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.

Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.

Scraping: wget / curl NYTimes was a pretty simple example. Some are harder.

http://www.tiffany.com/shared/media/ products/26598044_l_over_M_3.jpg

Scraping: wget / curl

Scraping: wget / curl

Scraping: wget / curl Python’s urllib and urllib2 are pure-python libraries for doing similar things.

Scraping: wget / curl So let’s look e.g. at BEA. Navigate our way thru to:

What happens when we click on “Download All Years”. Something downloads. From where?

Scraping: wget / curl Let’s look at the source:

Scraping: wget / curl All right, let’s try it:

Stuck on this ... Go to Firebug and look for something fancy? NO!

Scraping: wget / curl We forgot you use “”s.

Scraping: wget / curl OK!

Scraping: wget / curl How to figure out which requests to get in the first place?