Big Data for Public Policy¶

Scraping the Internet to Collect Data¶

Malka Guillot ¶

ETH Zürich | 860-0033-00L ¶

Outline¶

Introduction
HTML: scraping and parsing
Web APIs

Topics of this coding session

Gathering (unstructured) web data and transforming it into structureddata (“web scraping”).
Accessing data on the web: APIs.

Intro¶

General interest references:¶

Edelman, Benjamin. 2012. "Using Internet Data for Economic Research." Journal of Economic Perspectives, 26 (2): 189-206.
Cavallo, Alberto, and Roberto Rigobon. 2016. "The Billion Prices Project: Using Online Prices for Measurement and Research." Journal of Economic Perspectives, 30 (2): 151-78.

Coding resources:¶

Python's requests & Beautiful Soup libraries (for web scraping & APIs)
Ryan Mitchell, Web Scraping with Python, O'Reilly Media, 2018

What is webscraping ?¶

Source: SICSS

What is Web Scraping?¶

Process of gathering information from the Internet
- structure or unstructured info
Involves automation

Challenges of Web Scraping¶

Variety. Every website is different.
Durability. Websites constantly change.

Points to keep in mind:¶

It may or may not be legal
- Loop at websites’ terms of service and robots.txt files
Webscraping is tedious and frustrating

Motivation¶

Publication of crawling papers by year

Source: Claussen, Jörg and Peukert, Christian, Obtaining Data from the Internet: A Guide to Data Crawling in Management Research (June 2019).

Example of data¶

Online markets: housing, job, goods
Social media: Twitter, Facebook, Wechat, newspaper text
Historical data using the internet Archives

Getting started: Things to consider before you begin¶

Send an email to try to get the data directly
Search if somebody has already faced the same or a similar problem.
Does the site or service provide an API that you can access directly?
Is the website only online for a limited time? Do you want an original snapshot as a backup? Is it more convenient to filter your data offline?
Which python package is needed?

Most important python library for data collection¶

Standard:
- Requests
- Beautiful Soup
More advanced
- Scrapy documentation
- Selenium

$+$ installing the package:

pip install BeautifulSoup4

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: BeautifulSoup4 in /usr/local/lib/python3.6/dist-packages (4.8.0)
Requirement already satisfied: soupsieve>=1.2 in /usr/local/lib/python3.6/dist-packages (from BeautifulSoup4) (1.9.4)
WARNING: You are using pip version 20.2.1; however, version 21.0.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.

Load packages¶

# Import packages + set options
from IPython.display import display
import json
import pandas as pd
pd.options.display.max_columns = None # Display all columns of a dataframe
pd.options.display.max_rows = 700
from pprint import pprint
import re

Data communication for the World Wide Web¶

HTTP protocol= way of communication between the client (browser) and the web server
- no encryption $\rightarrow$ not safe
HTTPS protocol= S for secured

$\Rightarrow $ works by doing Requests and Responses

Static vs. dynamic websites¶

Notes:

Static Websites: the server that hosts the site sends back HTML documents that already contain all the data you’ll get to see as a user.

Request and Response¶

all interactions between a client and a web server are split into a request and a response:

Requests contain relevant data regarding your request call:
- base URL [ More on this for API: the endpoint, the method used, the headers, and so on.]
Responses contain relevant data returned by the server:
- the data or content, the status code, and the headers.

import requests
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
response = requests.get(url)

`Request`'s attributes¶

request = response.request
print('request: ',request)
print('-----')
print('url: ',request.url)
print('-----')
print('path_url: ', request.path_url)
print('-----')
print('Method: ', request.method)
print('-----')
print('Method: ', request.headers)

request:  <PreparedRequest [GET]>
-----
url:  https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress
-----
path_url:  /List_of_current_members_of_the_U.S._Congress
-----
Method:  GET
-----
Method:  {'User-Agent': 'python-requests/2.23.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

`Response`'s attributes¶

.text returns the response contents in Unicode format
.content returns the response contents in bytes.

print('', response)
print('-----')
print('Text:', response.text[:50])
print('-----')
print('Status_code:', response.status_code)
print('-----')
print('hHeaders:', response.headers)

 <Response [200]>
-----
Text: <!DOCTYPE html>
<html class="client-nojs" lang="en
-----
Status_code: 200
-----
hHeaders: {'Accept-Ranges': 'bytes', 'Age': '530', 'Cache-Control': 's-maxage=900, must-revalidate, max-age=0', 'Content-Encoding': 'gzip', 'Content-language': 'en', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 04 Mar 2021 15:30:48 GMT', 'Last-Modified': 'Thu, 04 Mar 2021 15:06:57 GMT', 'Link': '</wiki/skins/common/images/bplogo.png?e1b99>;rel=preload;as=image', 'Server': 'Apache', 'Vary': 'Accept-Encoding,Cookie', 'Via': '1.1 varnish', 'X-Cache': 'HIT', 'X-Cache-Expires': '', 'X-Cacheable': 'YES', 'X-Content-Type-Options': 'nosniff', 'X-UA-Compatible': 'IE=Edge', 'X-Varnish': '1933158228 1933121453', 'Content-Length': '43106', 'Connection': 'keep-alive'}

Status Codes¶

Important information: if your request was successful, if it’s missing data, if it’s missing credentials

Scraping & Parsing in Practice¶

STEPS:¶

Inspect Your Data Source
Scrape HTML Content From a Page
Parse HTML Code With Beautiful Soup

Step 1: Inspect Your Data Source¶

Explore the Website¶

Objective: understanding its underlying structure

We will scrape the list of current members of the U.S. Congress

Website example for today¶

Source: ballotpedia website

Understanding URLs¶

Base URL: https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress
More complex URL with query parameter https://ballotpedia.org/wiki/index.php?search=jerry
- query parameter=p?search=jerry
- can be used to crawl websites if you have a list of queries that you want to loop over (e.g. dates, localities...)
- query structure:
  - Start: ?
  - Information: pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value).
  - Separator: & -> if multiple query parameters

Other example of URL: https://opendata.swiss/en/dataset?political_level=commune&q=health.

Your turn

Try to change the search and selection parameters and observe how that affects your URL.

Next, try to change the values directly in your URL. See what happens when you paste the following URL into your browser’s address bar:

Conclusion: When you explore URLs, you can get information on how to retrieve data from the website’s server.

Inspect the site: Using Developer Tools¶

We use the inspect function (right click) to access the underlying HTML interactively.

Developer tools¶

Developer tools can help you understand the structure of a website
I use it in firefox, but exists for most browsers
Interactively explore the source html & the webpage

html is great but intricated $\Rightarrow$ sublimed by beautifulsoup

Step 2: Scrape HTML Content From a Page¶

import requests
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
response = requests.get(url)
html=response.text
html[:500]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of current members of the U.S. Congress - Ballotpedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_current_members_'

html looks messy.

Using the prettify() function from BeautifulSoup helps

# Parse raw HTML
from bs4 import BeautifulSoup # package for parsing HTML
soup = BeautifulSoup(html, 'html.parser') # parse html of web page
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of current members of the U.S. Congress - Ballotpedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_current_members_of_the_U.S._Congress","wgTitle":"List of current members of the U.S. Congress","wgCurRevisionId":8246683,"wgRevisionId":8246683,"wgArticleId":180048,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Unique congress pages"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMont

Step 3: Parse HTML Code With Beautiful Soup¶

Objectif: extract url of senators from the webpage to build a list of url that will be used for scraping info on senators

Find Elements by ID¶

In an HTML web page, every element can have an id attribute assigned.

Can be used to directly access the element.

balance=soup.find(id='Leadership_and_partisan_balance')
print(balance.prettify()[:500])

<span class="mw-headline" id="Leadership_and_partisan_balance">
 Leadership and partisan balance
</span>

Your turn

Find the id & get the soup for the table entitled *List of current U.S. Senate members*.

officeholder_table=soup.find(id='officeholder-table')
print(officeholder_table.prettify()[:500])

<table class="bptable gray sortable" id="officeholder-table" style="width:auto; border-bottom:1px solid #bcbcbc;">
 <thead>
  <tr colspan="4" style="background:#4c4c4c!important;color:#fff!important;text-align:center;padding: 7px 8px 8px;margin-bottom:4px;">
   <th>
    Office
   </th>
   <p>
    <br/>
   </p>
   <th>
    Name
   </th>
   <th>
    Party
   </th>
   <th>
    Date assumed office
   </th>
  </tr>
 </thead>
 <tr>
  <td style="padding-left:10px;">
   <a href="https://ballotpedia.org/

Find Elements by HTML Class Name¶

Because the result is not unique find_all instead of find.

Lets' rely on the html structure to find the row of the table

thead= officeholder_table.find('thead')
thead

<thead>
<tr colspan="4" style="background:#4c4c4c!important;color:#fff!important;text-align:center;padding: 7px 8px 8px;margin-bottom:4px;">
<th>Office</th>
<p><br/>
</p>
<th>Name</th>
<th>Party</th>
<th>Date assumed office</th>
</tr>
</thead>

rows=officeholder_table.find_all('tr')
len(rows) # consistent: 100 members + headline

101

Let's try to get the `url` for one example row:¶

row=rows[1]
#row

tds=row.find_all('td')
tds[:4]

[<td style="padding-left:10px;"><a href="https://ballotpedia.org/List_of_United_States_Senators_from_Kansas">U.S. Senate Kansas</a></td>,
 <td style="padding-left:10px;text-align:center;"><a href="https://ballotpedia.org/Jerry_Moran">Jerry Moran</a></td>,
 <td class="partytd Republican">Republican </td>,
 <td style="text-align:center;">January 5, 2011</td>]

url= tds[1].find_all('a')
print('a list:', url)
print('its unique element', url[0])
print('url wanted', url[0]['href'] )
print('Text content', url[0].get_text())

a list: [<a href="https://ballotpedia.org/Jerry_Moran">Jerry Moran</a>]
its unique element <a href="https://ballotpedia.org/Jerry_Moran">Jerry Moran</a>
url wanted https://ballotpedia.org/Jerry_Moran
Text content Jerry Moran

Your turn

Use the code for 1 row in order to build a loop that gives a list of all of the wanted url.

list_url=[]
for row in rows[1:]:
    tds=row.find_all('td')
    url= tds[1].find_all('a')[0]['href']    
    list_url.append(url)
list_url[:10]

['https://ballotpedia.org/Jerry_Moran',
 'https://ballotpedia.org/Roger_Marshall',
 'https://ballotpedia.org/Gary_Peters',
 'https://ballotpedia.org/Debbie_Stabenow',
 'https://ballotpedia.org/Tim_Kaine',
 'https://ballotpedia.org/Mark_Warner',
 'https://ballotpedia.org/Chris_Van_Hollen',
 'https://ballotpedia.org/Ben_Cardin',
 'https://ballotpedia.org/Dianne_Feinstein',
 'https://ballotpedia.org/Alex_Padilla']

Your First Scraper¶

Then, the same logic can be implemented to get the info from the senators' page (e.g. https://ballotpedia.org/Jerry_Moran). The following code extracts info from the first 5 url from the list scraped above.

from bs4 import NavigableString, Tag

# the dataframe in which we will put the scraper's output
df_parsed=pd.DataFrame()

for url in list_url[:10]:
    print('--------',url, '--------')
    #1. Get the soup
    response = requests.get(url)
    html=response.text
    soup = BeautifulSoup(html, 'html.parser') # parse html of web page
    
    dic_text_by_header=dict()
    #2. Extract info from the soup
    # get all the text content between 2 header (h2)
    for header in soup.find_all('h2')[0:len(soup.find_all('h2'))-1] :
        # print('--------',header.get_text())        
        nextNode=header
        # use the nextSibling method
        while True:
            nextNode=nextNode.nextSibling
            if nextNode is None:
                break
            if isinstance(nextNode, Tag):
                if nextNode.name == "h2":
                    break
                #print(nextNode.get_text(strip=True).strip())
                # The result is put in a dictionary as a value for key=corresponding header
                dic_text_by_header[header.get_text()]=[nextNode.get_text(strip=True).strip()]
                
    # put the dictionary into a dataframe
    temp=pd.DataFrame.from_dict(dic_text_by_header)

    # Concats the temporary dataframe with the global one
    df_parsed=pd.concat([temp, df_parsed])

-------- https://ballotpedia.org/Jerry_Moran --------
-------- https://ballotpedia.org/Roger_Marshall --------
-------- https://ballotpedia.org/Gary_Peters --------
-------- https://ballotpedia.org/Debbie_Stabenow --------
-------- https://ballotpedia.org/Tim_Kaine --------
-------- https://ballotpedia.org/Mark_Warner --------
-------- https://ballotpedia.org/Chris_Van_Hollen --------
-------- https://ballotpedia.org/Ben_Cardin --------
-------- https://ballotpedia.org/Dianne_Feinstein --------
-------- https://ballotpedia.org/Alex_Padilla --------

df_parsed.head()

Saving the `DataFrame` in a `pickle` format¶

pickle format

Useful to store python objects
Well integrated in pandas (using to_pickle and read_pickle)
When the object is not a pandas Dataframe, use the pickle package

os package

os.getcwd(): fetchs the current path
os.path.dirname(): go back to the parent directory
os.path.join(): concatenates several paths

import os
parent_path=os.path.dirname(os.getcwd()) # os.getcwd() fetchs the current path, 
data_path=os.path.join(parent_path, 'data') 
df_parsed.to_pickle(os.path.join(data_path, 'df_senators.pickle'))

Going further¶

There are also dynamic websites: the server does not always send back HTML, but your browser also receive and interpret JavaScript code that you cannot retreive from the HTML. You receive JavaScript code that you cannot parse using beautiful soup but that you would need to execute like a browser does.

Solutions:

Use requests-html
Simulate a browser using selenium

Application Programming Interfaces (API)¶

What is an API?¶

Communication layer that allows different systems to talk to each other without having to understand exactly what each other does.

$\Rightarrow$ provide a to progammable access to data.

The website Programmable Web lists more than 225,353 API from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others.

Source: Programmable Web

How Does an API Work?¶

Relying on HTTP messages :
- request for information or data,
- the API returns a response with what you requested
Similar to visiting a website: you specify a URL and information is sent to your machine.

Better than webscraping if possible because:¶

More stable than webpages
No HTML but already structured data (e.g. in json)
we focus on the APIs that use HTTP protocol

HTTP Methods¶

HTTP Method	Description	Requests method
POST	Create a new resource.	requests.post()
GET	Read an existing resource.	requests.get()
PUT	Update an existing resource.	requests.put()
DELETE	Delete an existing resource.	requests.delete()

Calling Your First API Using Python¶

Forecasts from the Carbon Intensity API (include CO2 emissions related to eletricity generation only).

See the API documentation

import requests
headers = { 
  'Accept': 'application/json'
}
# fetch (or get) data from the URL
requests.get('https://api.carbonintensity.org.uk', params={}, headers = headers)

<Response [200]>

response = requests.get('https://api.carbonintensity.org.uk', params={}, headers = headers) 
print(response.text[:500])

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Carbon Intensity API</title>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js"></script>
    <style>

Endpoints and Resources¶

base URL: https://api.carbonintensity.org.uk
- Other examples: https://api.twitter.com; https://api.github.com
- very basic information about an API, not the real data.
Extend the url with endpoint
- = a part of the URL that specifies what resource you want to fetch
- check the documentation to learn more about what endpoints are available

Using the `intensity` endpoint:¶

# Get Carbon Intensity data for current half hour
r = requests.get('https://api.carbonintensity.org.uk/intensity', params={}, headers = headers)

# Different outputs:
print("--- text ---")
pprint(r.text)
print("--- Content ---")
pprint(r.content)
print("--- JSON---")
pprint(r.json())

--- text ---
('{ \r\n'
 '  "data":[{ \r\n'
 '    "from": "2021-03-04T15:00Z",\r\n'
 '    "to": "2021-03-04T15:30Z",\r\n'
 '    "intensity": {\r\n'
 '      "forecast": 270,\r\n'
 '      "actual": 267,\r\n'
 '      "index": "high"\r\n'
 '    }\r\n'
 '  }]\r\n'
 '}')
--- Content ---
(b'{ \r\n  "data":[{ \r\n    "from": "2021-03-04T15:00Z",\r\n    "to": "2021-'
 b'03-04T15:30Z",\r\n    "intensity": {\r\n      "forecast": 270,\r\n      "a'
 b'ctual": 267,\r\n      "index": "high"\r\n    }\r\n  }]\r\n}')
--- JSON---
{'data': [{'from': '2021-03-04T15:00Z',
           'intensity': {'actual': 267, 'forecast': 270, 'index': 'high'},
           'to': '2021-03-04T15:30Z'}]}

`json`

json= python dictionary
A great format for structured data

# json objects work as do any other dictionary in Python
json=r.json()
json['data']

[{'from': '2021-03-04T15:00Z',
  'to': '2021-03-04T15:30Z',
  'intensity': {'forecast': 270, 'actual': 267, 'index': 'high'}}]

# get the actual intensity value:
json['data'][0]['intensity']['actual']

267

Your turn

Get Carbon Intensity factors for each fuel type

Get Carbon Intensity data for current half hour for GB regions

r = requests.get('https://api.carbonintensity.org.uk/intensity/factors', params={}, headers = headers)
pprint(r.json())

{'data': [{'Biomass': 120,
           'Coal': 937,
           'Dutch Imports': 474,
           'French Imports': 53,
           'Gas (Combined Cycle)': 394,
           'Gas (Open Cycle)': 651,
           'Hydro': 0,
           'Irish Imports': 458,
           'Nuclear': 0,
           'Oil': 935,
           'Other': 300,
           'Pumped Storage': 0,
           'Solar': 0,
           'Wind': 0}]}

# Get Carbon Intensity data for current half hour for GB regions
r = requests.get('https://api.carbonintensity.org.uk/regional', params={}, headers = headers)
#pprint(r.json())

Query Parameters¶

cf. slide on url
used as filters you can send with your API request to further narrow down the responses.

# In the carbonintensity API, it works differently:
from_="2018-08-25T12:35Z"
to="2018-08-25T13:35Z"
r = requests.get('https://api.carbonintensity.org.uk/regional/intensity/{}/{}'.format(from_, to), params={}, headers = headers)
#pprint(r.json())

API Limitations¶

To prevent collection of huge amount of individual data, many APIs require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access.

API Credentials¶

Different methods/level of authentification exist
- API keys
- OAuth

Rate Limiting¶

The credentials also define how often we are allowed to make requests for data.
Be careful not to exceed the limits set by the API developers.

API Keys¶

Most common level of authentication
These keys are used to identify you as an API user or customer and to trace your use of the API.
API keys are typically sent as a request header or as a query parameter.

Example of API key authentification using the nasa API!!!¶

endpoint = "https://api.nasa.gov/mars-photos/api/v1/rovers/perseverance/photos"
# Replace DEMO_KEY below with your own key if you generated one.
api_key = "DEMO_KEY"
# You can add the API key to your request by appending the api_key= query parameter:
query_params = {"api_key": api_key, "earth_date": "2021-02-27"}
response = requests.get(endpoint, params=query_params)
response

<Response [200]>

Authentification was a success!

response.json()
photos = response.json()["photos"]
print(f"Found {len(photos)} photos")
photos[50]["img_src"]

Found 193 photos

'https://mars.nasa.gov/mars2020-raw-images/pub/ods/surface/sol/00009/ids/edr/browse/ncam/NLE_0009_0667755636_926ECM_N0030000NCAM05000_15_0LLJ01_1200.jpg'

More from Perseverance!¶

List of the different camera of perseverance.

Web application

General remarks¶

Start simple and expand your program incrementally.
Keep it simple. Do not overengineer the problem.
Do not repeat yourself. Code duplication implies bug reuse.
Limit the number of iterations for test runs. Use print statements toinspect objects.
Write tests to verify things work as intended.
If the web page cannot be navigated easily or has hidden javascript, look into Selenium.
If you scraper requires complex monitoring/validation procedures orthreading for performance, look into Python.

Class Survey¶

Please fill in this short survey about the class.

Biography	Elections	Campaign donors	Noteworthy events	State legislative career	Notable endorsements	See also	External links	Footnotes	Career	U.S. Senate	Key votes	Campaign themes	Analysis	Personal Gain Index	2016 Democratic National Convention	Committee assignments	Issues	Personal	Recent news	Presidential preference
2021 - Present: U.S. Senator fromCalifornia201...	Information about polls can be found by clicki...	The finance data shown here comes from the dis...	“I will not provide sensitive voter informatio...			United States SenateUnited States congressiona...	googletag.cmd.push(function() { googletag.disp...	(window.RLQ=window.RLQ\|\|[]).push(function(){mw...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
NaN			On May 26, aides from Feinstein's office confi...	NaN		United States SenateUnited States congressiona...	googletag.cmd.push(function() { googletag.disp...	(window.RLQ=window.RLQ\|\|[]).push(function(){mw...	1992-Present: U.S. Senator from California1990...	Committee on Rules and Administration	!function(d,s,id){var js,fjs=d.getElementsByTa...	"Accomplished" - Feinstein campaign ad, releas...	The websiteLegistormcompiles staff salary info...	Dianne Feinstein Campaign ContributionsTotal R...		NaN	NaN	NaN	NaN	NaN
Cardin was born in 1943 in Baltimore,MD, where...	U.S. Senate General Election, Maryland, 2008Pa...		NaN	NaN	NaN	MarylandUnited States SenateU.S. Senate delega...		(window.RLQ=window.RLQ\|\|[]).push(function(){mw...	2007-Present: U.S. Senator from Maryland1987-2...	NaN	googletag.cmd.push(function() { googletag.disp...	“A third-generation Marylander, Ben Cardin has...	Cardin voted with the Democratic Party96 perce...	Ben Cardin Campaign ContributionsTotal Raised$...		Committee on Small Business and Entrepreneursh...		Note: Pleasecontact usif the personal informat...	Ben Cardin News Feed	NaN
Van Hollen was born in 1959 in Karachi, Pakist...			NaN	NaN	NaN	MarylandUnited States congressional delegation...		(window.RLQ=window.RLQ\|\|[]).push(function(){mw...	2017-Present: U.S. Senator fromMaryland2003-20...	NaN	googletag.cmd.push(function() { googletag.disp...	NaN	Van Hollen voted with the Democratic Party93.4...	Chris Van Hollen Campaign ContributionsTotal R...		Committee on Appropriations	"I look at results, and the result could not b...	Note: Pleasecontact usif the personal informat...	Chris Van Hollen News Feed	NaN
Warner was born in Indiana and raised in Conne...	U.S. Senate, Virginia General Election, 2008Pa...		On January 27, 2021, Warner announced that he ...	NaN	NaN		googletag.cmd.push(function() { googletag.disp...	(window.RLQ=window.RLQ\|\|[]).push(function(){mw...	2008-Present:U.S. SenatorfromVirginia2002-2006...	NaN	!function(d,s,id){var js,fjs=d.getElementsByTa...	Mark Warner Campaign ContributionsTotal Raised...		The data used to calculate changes in net wort...		Committee on Rules and Administration	NaN	In 2004,Governingmagazine named Warner and Sen...	NaN	Warner was mentioned as apossible Democratic v...

Big Data for Public Policy¶

Scraping the Internet to Collect Data¶

Malka Guillot¶

ETH Zürich | 860-0033-00L¶

Outline¶

Intro¶

General interest references:¶

Coding resources:¶

What is webscraping ?¶

What is Web Scraping?¶

Challenges of Web Scraping¶

Points to keep in mind:¶

Motivation¶

Example of data¶

Getting started: Things to consider before you begin¶

Most important python library for data collection¶

Load packages¶

Data communication for the World Wide Web¶

Static vs. dynamic websites¶

Request and Response¶

Request's attributes¶

Response's attributes¶

Status Codes¶

Scraping & Parsing in Practice¶

STEPS:¶

Step 1: Inspect Your Data Source¶

Explore the Website¶

Website example for today¶

Understanding URLs¶

Your turn

Inspect the site: Using Developer Tools¶

Developer tools¶

Step 2: Scrape HTML Content From a Page¶

Step 3: Parse HTML Code With Beautiful Soup¶

Find Elements by ID¶

Your turn

Find Elements by HTML Class Name¶

Let's try to get the url for one example row:¶

Your turn

Your First Scraper¶

Saving the DataFrame in a pickle format¶

Going further¶

Application Programming Interfaces (API)¶

What is an API?¶

How Does an API Work?¶

Better than webscraping if possible because:¶

HTTP Methods¶

Calling Your First API Using Python¶

Endpoints and Resources¶

Using the intensity endpoint:¶

json

Your turn

Query Parameters¶

API Limitations¶

API Credentials¶

Rate Limiting¶

API Keys¶

Example of API key authentification using the nasa API!!!¶

More from Perseverance!¶

General remarks¶

Class Survey¶

Homework on webscraping for next week¶

Malka Guillot ¶

ETH Zürich | 860-0033-00L ¶

`Request`'s attributes¶

`Response`'s attributes¶

Let's try to get the `url` for one example row:¶

Saving the `DataFrame` in a `pickle` format¶

Using the `intensity` endpoint:¶

`json`