Objective: get structured & tidy data
= The process of transforming raw data to a set of data tables that can be used for a variety of downstream purposes such as analytics
Treatment A | Treatment B | |
---|---|---|
John Smith | - | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
Name | Treatment | Result |
---|---|---|
John Smith | a | - |
Jane Doe | a | 16 |
Mary Johnson | a | 3 |
John Smith | b | 2 |
Jane Doe | b | 11 |
Mary Johnson | b | 1 |
⇒ we will be mainly working with structured data + learning how to go from unstructured to structured
What type of variable do you know?
Quantitative vs. qualitative variables
Stock vs. flow variables
Need to be identified & should be counted
Potential selection bias: is data missing at random?
Solutions:
Substantially larger or smaller values for one or a handful of observations.
Pandas
Pandas
= central piece of the Python ecosystem for data science
The DataFrame:
pandas
package of handling data os
package of path import pandas as pd
Let's create a table by hand
df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
}
)
Name | Age | Sex | |
---|---|---|---|
0 | Braund, Mr. Owen Harris | 22 | male |
1 | Allen, Mr. William Henry | 35 | male |
2 | Bonnell, Miss. Elizabeth | 58 | female |
Each column in a DataFrame is a Series: cf. exercise
titanic = pd.read_csv("data/titanic.csv")
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
Select only some of the columns:
df["Name"]
df[["Name", "Age"]]
Select only some of the rows:
above_35 = titanic[titanic["Age"] > 35]
above_35.head()
Select a combinations of rows and columns:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
Take a few minutes to :
df.head()
)Bolivia
Combine values across a column into a single value
titanic["Age"].mean()
titanic[["Age", "Fare"]].median()
Apply a function to every row, possibly creating more or fewer columns
titanic["Initial"] = titanic["Name"].map(lambda x: x[0])
titanic["survived labeled"] = titanic["survived"].map({0:"died", 1 "survived"})
Apply a function to every row, possibly creating more or fewer columns
titanic[["Sex", "Age"]].groupby("Sex").mean()
result = pd.concat([df1, df2], axis=0) # Concatenating row-wise
result = pd.concat([df1, df3], axis=1) # Concatenating column-wise
pd.merge(left, right, how='outer', left_on="key", right_on="key", suffixes=('_left', '_right'), indicator=False)
Look at the data
Transform the variables into a known type
The type matters for what we do with them
To check data cleaning (part of iterative process)
To guide subsequent analysis (for further analysis)
To give context of the results of subsequent analysis (for interpretation)
To ask additional questions (for specifying the (research) question)
Offer simple, but possibly important answers to questions.
For any given variable, a statistic is a meaningful number that we can compute from a dataset.
Basic summary statistics describe the most important features of distributions of variables.
All variables have a
The distribution of a variable tells the
Beware of
Histogram reveals important properties of a distribution.
Number and location of
Approximate regions for center and
Symmetric or not
Theoretical distributions can be helpful
Have well-known properties!
If variable in our data well approximated by a theoretical distribution –> attribute properties to the variable
Real life, many variables surprisingly close to theoretical distributions.
Will be useful when generalizing from data
Histogram is bell-shaped
Distribution is captured by two parameters:
Symmetric = median, mean (and mode) are the same.
Example: height of people, IQs, ect.
Asymmetrically distributed with long right tails.
Steps:
Always non-negative
Example distributions of income, or firm size.
Also called as Pareto distribution
Very large extreme values - well approximated
Relative frequency of close-by values are the same along large and small values
Real world: many examples, but often not the whole distribution
Example: frequency of words, city population, wealth