Topics of this coding session
Publication of crawling papers by year
Source: Claussen, Jörg and Peukert, Christian, Obtaining Data from the Internet: A Guide to Data Crawling in Management Research (June 2019).
python package
is needed? Requests
Beautiful Soup
Scrapy
documentationSelenium
$+$ installing the package:
pip install BeautifulSoup4
# Import packages + set options
from IPython.display import display
import json
import pandas as pd
pd.options.display.max_columns = None # Display all columns of a dataframe
pd.options.display.max_rows = 700
from pprint import pprint
import re
HTTP protocol
= way of communication between the client (browser) and the web server HTTPS protocol
= S for secured$\Rightarrow $ works by doing Requests
and Responses
Notes:
all interactions between a client and a web server are split into a request and a response:
Requests
contain relevant data regarding your request call:Responses
contain relevant data returned by the server:import requests
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
response = requests.get(url)
Request
's attributes¶request = response.request
print('request: ',request)
print('-----')
print('url: ',request.url)
print('-----')
print('path_url: ', request.path_url)
print('-----')
print('Method: ', request.method)
print('-----')
print('Method: ', request.headers)
Response
's attributes¶.text
returns the response contents in Unicode format.content
returns the response contents in bytes.print('', response)
print('-----')
print('Text:', response.text[:50])
print('-----')
print('Status_code:', response.status_code)
print('-----')
print('hHeaders:', response.headers)
Important information: if your request was successful, if it’s missing data, if it’s missing credentials
p?search=jerry
?
&
-> if multiple query parameters Other example of URL: https://opendata.swiss/en/dataset?political_level=commune&q=health.
Try to change the search and selection parameters and observe how that affects your URL.
Next, try to change the values directly in your URL. See what happens when you paste the following URL into your browser’s address bar:
Conclusion: When you explore URLs, you can get information on how to retrieve data from the website’s server.
We use the inspect
function (right click) to access the underlying HTML interactively.
html
is great but intricated $\Rightarrow$ sublimed by beautifulsoup
import requests
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
response = requests.get(url)
html=response.text
html[:500]
html looks messy.
Using the prettify()
function from BeautifulSoup
helps
# Parse raw HTML
from bs4 import BeautifulSoup # package for parsing HTML
soup = BeautifulSoup(html, 'html.parser') # parse html of web page
print(soup.prettify()[:1000])
Objectif: extract url of senators from the webpage to build a list of url that will be used for scraping info on senators
In an HTML web page, every element can have an id attribute assigned.
Can be used to directly access the element.
balance=soup.find(id='Leadership_and_partisan_balance')
print(balance.prettify()[:500])
Find the id & get the soup for the table entitled *List of current U.S. Senate members*.
officeholder_table=soup.find(id='officeholder-table')
print(officeholder_table.prettify()[:500])
Because the result is not unique find_all
instead of find
.
Lets' rely on the html structure to find the row of the table
thead= officeholder_table.find('thead')
thead
rows=officeholder_table.find_all('tr')
len(rows) # consistent: 100 members + headline
url
for one example row:¶row=rows[1]
#row
tds=row.find_all('td')
tds[:4]
url= tds[1].find_all('a')
print('a list:', url)
print('its unique element', url[0])
print('url wanted', url[0]['href'] )
print('Text content', url[0].get_text())
Use the code for 1 row in order to build a loop that gives a list of all of the wanted url.
list_url=[]
for row in rows[1:]:
tds=row.find_all('td')
url= tds[1].find_all('a')[0]['href']
list_url.append(url)
list_url[:10]
Then, the same logic can be implemented to get the info from the senators' page (e.g. https://ballotpedia.org/Jerry_Moran). The following code extracts info from the first 5 url from the list scraped above.
from bs4 import NavigableString, Tag
# the dataframe in which we will put the scraper's output
df_parsed=pd.DataFrame()
for url in list_url[:10]:
print('--------',url, '--------')
#1. Get the soup
response = requests.get(url)
html=response.text
soup = BeautifulSoup(html, 'html.parser') # parse html of web page
dic_text_by_header=dict()
#2. Extract info from the soup
# get all the text content between 2 header (h2)
for header in soup.find_all('h2')[0:len(soup.find_all('h2'))-1] :
# print('--------',header.get_text())
nextNode=header
# use the nextSibling method
while True:
nextNode=nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h2":
break
#print(nextNode.get_text(strip=True).strip())
# The result is put in a dictionary as a value for key=corresponding header
dic_text_by_header[header.get_text()]=[nextNode.get_text(strip=True).strip()]
# put the dictionary into a dataframe
temp=pd.DataFrame.from_dict(dic_text_by_header)
# Concats the temporary dataframe with the global one
df_parsed=pd.concat([temp, df_parsed])
df_parsed.head()
DataFrame
in a pickle
format¶pickle
format
python
objects
pandas
(using to_pickle
and read_pickle
)
pickle
package
os
package
os.getcwd()
: fetchs the current path
os.path.dirname()
: go back to the parent directory
os.path.join()
: concatenates several paths
import os
parent_path=os.path.dirname(os.getcwd()) # os.getcwd() fetchs the current path,
data_path=os.path.join(parent_path, 'data')
df_parsed.to_pickle(os.path.join(data_path, 'df_senators.pickle'))
There are also dynamic websites: the server does not always send back HTML, but your browser also receive and interpret JavaScript code that you cannot retreive from the HTML. You receive JavaScript code that you cannot parse using beautiful soup
but that you would need to execute like a browser does.
Solutions:
requests-html
Communication layer that allows different systems to talk to each other without having to understand exactly what each other does.
$\Rightarrow$ provide a to progammable access to data.
The website Programmable Web lists more than 225,353 API from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others.
Source: Programmable Web
request
for information or data, response
with what you requestedjson
)HTTP Method | Description | Requests method |
---|---|---|
POST | Create a new resource. | requests.post() |
GET | Read an existing resource. | requests.get() |
PUT | Update an existing resource. | requests.put() |
DELETE | Delete an existing resource. | requests.delete() |
Forecasts from the Carbon Intensity API (include CO2 emissions related to eletricity generation only).
See the API documentation
import requests
headers = {
'Accept': 'application/json'
}
# fetch (or get) data from the URL
requests.get('https://api.carbonintensity.org.uk', params={}, headers = headers)
response = requests.get('https://api.carbonintensity.org.uk', params={}, headers = headers)
print(response.text[:500])
intensity
endpoint:¶# Get Carbon Intensity data for current half hour
r = requests.get('https://api.carbonintensity.org.uk/intensity', params={}, headers = headers)
# Different outputs:
print("--- text ---")
pprint(r.text)
print("--- Content ---")
pprint(r.content)
print("--- JSON---")
pprint(r.json())
json
json
= python dictionary
# json objects work as do any other dictionary in Python
json=r.json()
json['data']
# get the actual intensity value:
json['data'][0]['intensity']['actual']
r = requests.get('https://api.carbonintensity.org.uk/intensity/factors', params={}, headers = headers)
pprint(r.json())
# Get Carbon Intensity data for current half hour for GB regions
r = requests.get('https://api.carbonintensity.org.uk/regional', params={}, headers = headers)
#pprint(r.json())
url
# In the carbonintensity API, it works differently:
from_="2018-08-25T12:35Z"
to="2018-08-25T13:35Z"
r = requests.get('https://api.carbonintensity.org.uk/regional/intensity/{}/{}'.format(from_, to), params={}, headers = headers)
#pprint(r.json())
To prevent collection of huge amount of individual data, many APIs require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access.
endpoint = "https://api.nasa.gov/mars-photos/api/v1/rovers/perseverance/photos"
# Replace DEMO_KEY below with your own key if you generated one.
api_key = "DEMO_KEY"
# You can add the API key to your request by appending the api_key= query parameter:
query_params = {"api_key": api_key, "earth_date": "2021-02-27"}
response = requests.get(endpoint, params=query_params)
response
Authentification was a success!
response.json()
photos = response.json()["photos"]
print(f"Found {len(photos)} photos")
photos[50]["img_src"]
Please fill in this short survey about the class.