In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://raw.githubusercontent.com/malkaguillot/Foundations-in-Data-Science-and-Machine-Learning/refs/heads/main/docs/utils/custom.css">
%%HTML
<link rel="stylesheet" type="text/css" href="../utils/custom.css">
%%HTML

Foundations in Data Science and Machine Learning¶

Module 4: Data management with pandas¶

Malka Guillot¶

HSG Logo

What is (modern) pandas?¶

In [2]:
import pandas as pd

What is pandas?¶

  • Industry standard DataFrame library in Python
  • Covers all you need for data management
    • Loading datasets in many formats
    • Cleaning data
    • Generating variables
    • Reshaping datasets
  • Compatible with all plotting and statistics libraries

Modern pandas?¶

  • Pandas was created in 2008 and has some baggage
  • With version 3.0 many things will improve
  • Those features can already be enabled now:
    • More speed and less memory usage through better dtypes
    • Less confusion through copy-on-write
    • Better handling of missing values
    • Removal of the inplace argument

How to use modern pandas¶

  • Install version 2.1 or higher of pandas
  • Install version 13.0 or higher of pyarrow
  • Set some options after import
  • When loading datasets, use engine="pyarrow" if available
import pandas as pd
pd.options.mode.copy_on_write = True
pd.options.future.infer_string

What is a DataFrame?¶

  • Tabular data format, typically loaded from a file
  • Two mental models:
    1. Matrix/Array with labels
    2. Dictionary of columns
  • Variables are columns
  • Observations are rows
  • Can be manipulated in Python
No description has been provided for this image

DataFrame and Series¶

Creating a DataFrame¶

  • Data for a DataFrame can be nested lists or similar things
  • Columns and index can be ints, strings and tuples
  • Powerful strategy: Whenever you learn pandas or debug problems, create tiny DataFrames and Series to gain better understanding
In [3]:
df = pd.DataFrame(
    data=[[1, "bla"], [3, "blubb"]],
    columns=["a", "b"],
    index=["c", "d"]
)
df
Out[3]:
a b
c 1 bla
d 3 blubb

Creating a DataFrame from a dictionary¶

In [4]:
df = pd.DataFrame(
    data={
        "a": [1, 3],
        "b": ["bla", "blubb"]
    },
    index=["c", "d"]
)
df
Out[4]:
a b
c 1 bla
d 3 blubb

What is a Series?¶

  • Each column of a DataFrame is a Series
  • Mental model: Vector with an index
  • All entries in a Series have the same dtype

Creating a Series¶

In [5]:
pd.Series(
    [3.0, 4.5], index=["x", "y"],
)
Out[5]:
x    3.0
y    4.5
dtype: float64

Assignment is index aligned!¶

  • New columns can be assigned with square brackets
  • Index is automatically aligned!
    • Makes many things safer!
    • Can make pandas slow
In [6]:
sr = pd.Series(
    [3.0, 4.5], index=["c", "d"],
)
df["new_col"] = sr
df
Out[6]:
a b new_col
c 1 bla 3.0
d 3 blubb 4.5

Data types¶

In [7]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/HSG-8276-Spring-2025/class-ressources/refs/heads/main/data/winemag-data-130k-v3.csv", index_col=0,
    engine="pyarrow"
)
df.head()
Out[7]:
Unnamed: 0 serial country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna None Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro None None Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 2 US Tart and snappy, the flavors of lime flesh and... None 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore None Alexander Peartree None St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

The need for different data types¶

  • Each column has a dtype
  • Enables efficient storage and fast computation
  • Dtypes are not always set optimally after loading data
In [8]:
df.dtypes
Out[8]:
Unnamed: 0                 int64
serial                     int64
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

Benefits of good type representation¶

  • Fast calculations in a low level language
  • Access to operations that are only relevant for some types
  • Memory efficiency

Converting to efficient dtypes¶

In [9]:
better_dtypes = {
 "country": pd.CategoricalDtype(),
 "points": pd.UInt8Dtype(),
 "price": pd.Float32Dtype(),
}

df = df.astype(better_dtypes)
df.dtypes
Out[9]:
Unnamed: 0                  int64
serial                      int64
country                  category
description                object
designation                object
points                      UInt8
price                     Float32
province                   object
region_1                   object
region_2                   object
taster_name                object
taster_twitter_handle      object
title                      object
variety                    object
winery                     object
dtype: object

Overview of numeric dtypes¶

Type Properties
pd.Int8Dtype() Byte (-128 to 127)
pd.Int16Dtype() Integer (-32768 to 32767)
pd.Int32Dtype() Integer (-2147483648 to 2147483647)
pd.Int64Dtype() Integer (-9223372036854775808 to 9223372036854775807)
pd.UInt8Dtype() Unsigned Integer (0 to 255)
pd.UInt16Dtype() Unsigned Integer (0 to 65535)
pd.UInt32Dtype() Unsigned Integer (0 to 4294967295)
pd.UInt64Dtype() Unsigned Integer (0 to 18446744073709551615)
pd.Float32Dtype() Float (3.4028235e+38 to 1.4012985e-45)
pd.Float64Dtype() Float (double precision float)

String vs. Categorical¶

  • pd.CategoricalDtype() is for data that takes on a limited number of values

    • Internally stored as integers
    • Vert fast relabeling or resorting of categories
  • pd.StringDtype() is for actual text data

    • Internally stored as pyarrow array
    • Fast string functions similar to str in Python

Working with Strings¶

  • The .str accessor provides access to the string methods
  • Vectorized and fast implementations!
In [10]:
sr = pd.Series(["Guido", "Tim", "Raymond"])
sr.str.lower()
Out[10]:
0      guido
1        tim
2    raymond
dtype: object
In [11]:
sr.str.replace("i", "iii")
Out[11]:
0    Guiiido
1      Tiiim
2    Raymond
dtype: object
  • Other examples:

    • src.str.len
    • src.str.contains
  • See this tutorial for more string methods

Working with categoricals¶

  • Categories are defined independent of data
    • Protection against invalid categories
    • Good for visualization!
  • sr.cat accessor provides access to methods
  • See this tutorial for more method
In [12]:
cat_type = pd.CategoricalDtype(
    categories=["low", "middle", "high"],
    ordered=True,
)
sr = pd.Series(
    ["low", "high", "high"],
    dtype=cat_type,
)
sr
Out[12]:
0     low
1    high
2    high
dtype: category
Categories (3, object): ['low' < 'middle' < 'high']

Loading and saving data¶

No description has been provided for this imageNo description has been provided for this image

Example: Loading a csv file¶

  • first argument is the file name (with relative or absolute path)
  • engine="pyarrow" ensures we are getting modern pandas dtypes
  • many other optional arguments
In [13]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/HSG-8276-Spring-2025/class-ressources/refs/heads/main/data/winemag-data-130k-v3.csv", index_col=0, 
    engine="pyarrow"
)
df.head()
Out[13]:
Unnamed: 0 serial country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 0 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna None Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 1 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro None None Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 2 2 US Tart and snappy, the flavors of lime flesh and... None 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 3 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore None Alexander Peartree None St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 4 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Other read functions¶

Reader Extension Comment
pd.read_csv .csv Often need to use optional arguments to make it work
pd.read_pickle .pkl Good for intermediate files; Python specific.
pd.read_feather .arrow Very modern and powerful file format.
pd.read_stata .dta Stata’s proprietary format. Avoid if you can.

File format recommendations¶

  • Use .pkl format for processed datasets that you do not share with others
    • Very fast to read and write
    • Preserves every aspect of your DataFrame (e.g. dtypes)
  • Use .parquet to save files you want to share with others
    • Can be read by many languages and programs
    • Efficient compression
  • Use .dta iff sharing with Stata users

Setting and renaming columns and indices¶

Why the Index is important¶

  • We have seen that pandas aligns new columns in a DataFrame by index
  • Many other operations are aligned by index
  • Using a meaningful index makes this even safer
  • Index should be unique and not contain floats

Setting and resetting the index¶

  • set_index and reset_index are inverse functions
  • set_index can take any column or list of columns
  • Optional argument drop=True or drop=False determines what happens with the old index in set_index
In [14]:
df.index
Out[14]:
Index([     0,      1,      2,      3,      4,      5,      6,      7,      8,
            9,
       ...
       129961, 129962, 129963, 129964, 129965, 129966, 129967, 129968, 129969,
       129970],
      dtype='int64', name='', length=129971)
In [15]:
df = df.set_index(["country", "winery"])
df.index
Out[15]:
MultiIndex([(   'Italy',                                  'Nicosia'),
            ('Portugal',                      'Quinta dos Avidagos'),
            (      'US',                                'Rainstorm'),
            (      'US',                               'St. Julian'),
            (      'US',                             'Sweet Cheeks'),
            (   'Spain',                                   'Tandem'),
            (   'Italy',                          'Terre di Giurfo'),
            (  'France',                                 'Trimbach'),
            ( 'Germany',                              'Heinz Eifel'),
            (  'France',                       'Jean-Baptiste Adam'),
            ...
            (   'Italy',                                      'COS'),
            (   'Italy',                                 'Cusumano'),
            (  'Israel',                                   'Dalton'),
            (  'France',                          'Domaine Ehrhart'),
            (  'France',                  'Domaine Rieflé-Landmann'),
            ( 'Germany', 'Dr. H. Thanisch (Erben Müller-Burggraef)'),
            (      'US',                                 'Citation'),
            (  'France',                          'Domaine Gresser'),
            (  'France',                     'Domaine Marcel Deiss'),
            (  'France',                         'Domaine Schoffit')],
           names=['country', 'winery'], length=129971)
In [16]:
df = df.reset_index()
df.index
Out[16]:
RangeIndex(start=0, stop=129971, step=1)

Renaming columns¶

  • Dict can contain only the subset of variables that is actually renamed
  • Renaming the index works the same way but is rarely needed
In [17]:
df.columns
Out[17]:
Index(['country', 'winery', 'Unnamed: 0', 'serial', 'description',
       'designation', 'points', 'price', 'province', 'region_1', 'region_2',
       'taster_name', 'taster_twitter_handle', 'title', 'variety'],
      dtype='object')
In [18]:
new_names = {
    "country": "country name",
    "taster_name": "taster name",
}
df = df.rename(columns=new_names)
df.columns
Out[18]:
Index(['country name', 'winery', 'Unnamed: 0', 'serial', 'description',
       'designation', 'points', 'price', 'province', 'region_1', 'region_2',
       'taster name', 'taster_twitter_handle', 'title', 'variety'],
      dtype='object')
  • Instead of a dict, you can provide a function that converts old names to new names!

Selecting rows and columns¶

Selecting columns¶

No description has been provided for this imageNo description has been provided for this image
  • Use square brackets to select columns
In [19]:
df['country name'].head(6)
Out[19]:
0       Italy
1    Portugal
2          US
3          US
4          US
5       Spain
Name: country name, dtype: object
  • For multiple columns you need double brackets:
    • Outer: selecting columns
    • Inner: defining a list of variables
In [20]:
df[['country name', 'taster name']]
Out[20]:
country name taster name
0 Italy Kerin O’Keefe
1 Portugal Roger Voss
2 US Paul Gregutt
3 US Alexander Peartree
4 US Paul Gregutt
... ... ...
129966 Germany Anna Lee C. Iijima
129967 US Paul Gregutt
129968 France Roger Voss
129969 France Roger Voss
129970 France Roger Voss

129971 rows × 2 columns

Selecting individual rows¶

No description has been provided for this imageNo description has been provided for this image
  • Selection of rows needs .loc[]
  • Selection is label based!
In [21]:
df.loc[1]
Out[21]:
country name                                                      Portugal
winery                                                 Quinta dos Avidagos
Unnamed: 0                                                               1
serial                                                                   1
description              This is ripe and fruity, a wine that is smooth...
designation                                                       Avidagos
points                                                                  87
price                                                                 15.0
province                                                             Douro
region_1                                                              None
region_2                                                              None
taster name                                                     Roger Voss
taster_twitter_handle                                           @vossroger
title                        Quinta dos Avidagos 2011 Avidagos Red (Douro)
variety                                                     Portuguese Red
Name: 1, dtype: object
  • For a MultiIndex you can specify some or all levels
In [22]:
df = df.set_index(["country name", "winery"])
df.loc["Italy"]
Out[22]:
Unnamed: 0 serial description designation points price province region_1 region_2 taster name taster_twitter_handle title variety
winery
Nicosia 0 0 Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna None Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend
Terre di Giurfo 6 6 Here's a bright, informal red that opens with ... Belsito 87 16.0 Sicily & Sardinia Vittoria None Kerin O’Keefe @kerinokeefe Terre di Giurfo 2013 Belsito Frappato (Vittoria) Frappato
Masseria Setteporte 13 13 This is dominated by oak and oak-driven aromas... Rosso 87 NaN Sicily & Sardinia Etna None Kerin O’Keefe @kerinokeefe Masseria Setteporte 2012 Rosso (Etna) Nerello Mascalese
Baglio di Pianetto 22 22 Delicate aromas recall white flower and citrus... Ficiligno 87 19.0 Sicily & Sardinia Sicilia None Kerin O’Keefe @kerinokeefe Baglio di Pianetto 2007 Ficiligno White (Sicilia) White Blend
Canicattì 24 24 Aromas of prune, blackcurrant, toast and oak c... Aynat 87 35.0 Sicily & Sardinia Sicilia None Kerin O’Keefe @kerinokeefe Canicattì 2009 Aynat Nero d'Avola (Sicilia) Nero d'Avola
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Col Vetoraz Spumanti 129929 129929 This luminous sparkler has a sweet, fruit-forw... None 91 38.0 Veneto Prosecco Superiore di Cartizze None None None Col Vetoraz Spumanti NV Prosecco Superiore di... Prosecco
Baglio del Cristo di Campobello 129943 129943 A blend of Nero d'Avola and Syrah, this convey... Adènzia 90 29.0 Sicily & Sardinia Sicilia None Kerin O’Keefe @kerinokeefe Baglio del Cristo di Campobello 2012 Adènzia R... Red Blend
Feudo Principi di Butera 129947 129947 A blend of 65% Cabernet Sauvignon, 30% Merlot ... Symposio 90 20.0 Sicily & Sardinia Terre Siciliane None Kerin O’Keefe @kerinokeefe Feudo Principi di Butera 2012 Symposio Red (Te... Red Blend
COS 129961 129961 Intense aromas of wild cherry, baking spice, t... None 90 30.0 Sicily & Sardinia Sicilia None Kerin O’Keefe @kerinokeefe COS 2013 Frappato (Sicilia) Frappato
Cusumano 129962 129962 Blackberry, cassis, grilled herb and toasted a... Sàgana Tenuta San Giacomo 90 40.0 Sicily & Sardinia Sicilia None Kerin O’Keefe @kerinokeefe Cusumano 2012 Sàgana Tenuta San Giacomo Nero d... Nero d'Avola

19540 rows × 13 columns

Selecting rows and columns¶

  • Use .loc[rows, columns] to select rows and columns
  • Can use everything seen before
In [23]:
df.loc["Italy", "taster name"]
/var/folders/cg/tgk7cwd906x_71j8jdd3gzc00000gn/T/ipykernel_90395/898296892.py:1: PerformanceWarning: indexing past lexsort depth may impact performance.
  df.loc["Italy", "taster name"]
Out[23]:
winery
Nicosia                            Kerin O’Keefe
Terre di Giurfo                    Kerin O’Keefe
Masseria Setteporte                Kerin O’Keefe
Baglio di Pianetto                 Kerin O’Keefe
Canicattì                          Kerin O’Keefe
                                       ...      
Col Vetoraz Spumanti                        None
Baglio del Cristo di Campobello    Kerin O’Keefe
Feudo Principi di Butera           Kerin O’Keefe
COS                                Kerin O’Keefe
Cusumano                           Kerin O’Keefe
Name: taster name, Length: 19540, dtype: object
In [24]:
df.loc[["Italy", "Australia"], ["taster name", "points"]]
Out[24]:
taster name points
country name winery
Italy Nicosia Kerin O’Keefe 87
Terre di Giurfo Kerin O’Keefe 87
Masseria Setteporte Kerin O’Keefe 87
Baglio di Pianetto Kerin O’Keefe 87
Canicattì Kerin O’Keefe 87
... ... ... ...
Australia Kilikanoon None 89
Atze's Corner None 89
Moorooroo None 87
Mr. Riggs None 87
Henschke Joe Czerwinski 90

21869 rows × 2 columns

In [25]:
df = df.reset_index()

Selecting rows using Boolean Series¶

  • Comparisons of Series produceBoolean Series!
  • Complex conditions with | and &
  • Boolean Series can be used for selecting rows
  • Works also inside .loc
In [26]:
df['points']>88
Out[26]:
0         False
1         False
2         False
3         False
4         False
          ...  
129966     True
129967     True
129968     True
129969     True
129970     True
Name: points, Length: 129971, dtype: bool

Just a narrower dataset for better display

In [27]:
df = df[['country name', 'points', 'price', "taster name"]] 

Using the Booleans Series

In [28]:
df[df['points']>88]
Out[28]:
country name points price taster name
119 France 92 80.0 None
120 Italy 92 70.0 None
121 US 92 36.0 None
122 US 92 39.0 None
123 Australia 92 40.0 Joe Czerwinski
... ... ... ... ...
129966 Germany 90 28.0 Anna Lee C. Iijima
129967 US 90 75.0 Paul Gregutt
129968 France 90 30.0 Roger Voss
129969 France 90 32.0 Roger Voss
129970 France 90 21.0 Roger Voss

61271 rows × 4 columns

Inspecting and summarizing data¶

  • Real datasets don’t fit on a screen
  • Need quick ways to:
    • Look at subsets
    • Calculate summary statistics
    • Plot distributions
In [29]:
df
Out[29]:
country name points price taster name
0 Italy 87 NaN Kerin O’Keefe
1 Portugal 87 15.0 Roger Voss
2 US 87 14.0 Paul Gregutt
3 US 87 13.0 Alexander Peartree
4 US 87 65.0 Paul Gregutt
... ... ... ... ...
129966 Germany 90 28.0 Anna Lee C. Iijima
129967 US 90 75.0 Paul Gregutt
129968 France 90 30.0 Roger Voss
129969 France 90 32.0 Roger Voss
129970 France 90 21.0 Roger Voss

129971 rows × 4 columns

Summarize an entire DataFrame¶

  • .describe can summarize entire DataFrames
  • Result is again a DataFrame
  • Often good idea to select a subset of columns
In [30]:
relevant_columns = ['points', 'price']
df[relevant_columns].describe()
Out[30]:
points price
count 129971.000000 120975.000000
mean 88.447138 35.363389
std 3.039730 41.022218
min 80.000000 4.000000
25% 86.000000 17.000000
50% 88.000000 25.000000
75% 91.000000 42.000000
max 100.000000 3300.000000

Calculate specific statistics¶

No description has been provided for this imageNo description has been provided for this image
  • Standard summary statistics are implemented and named as expected:
    • std
    • min and max
    • median and quantile
  • Vectorized and really fast implementations
In [31]:
df['points'].mean()
Out[31]:
np.float64(88.44713820775404)

Grouping and aggregating¶

No description has been provided for this imageNo description has been provided for this image
In [32]:
df.groupby('country name')['points'].mean()[:5]
Out[32]:
country name
Argentina                 86.710263
Armenia                   87.500000
Australia                 88.580507
Austria                   90.101345
Bosnia and Herzegovina    86.500000
Name: points, dtype: float64

Quick plotting: Series¶

  • Any Series has a .plot method
  • Any Series has a .hist method
  • Summary statistics based on groupby return Series which can again be plotted
In [33]:
pd.options.plotting.backend = "plotly"
import plotly.io as pio
pio.renderers.default = "notebook"  # For Jupyter compatibility
In [34]:
df.loc[df['country name'] == 'Bosnia and Herzegovina', 'country name'] = 'BH'
In [35]:
fig = df.groupby("country name")["points"].mean().plot(width=700, height=400) 
fig.update_layout(xaxis=dict(tickfont=dict(size=10)))  # Adjust the font size as needed
fig.show() 

Quick plotting: DataFrames¶

  • Any DataFrame has a .plot method
  • Defaults to line plot, can access .scatter and many more
  • Notebook gives you interactive plots
In [36]:
df.plot.scatter(x="points", y="price", width=700, height=400)

Statistics for categorical data¶

In [37]:
df['country name'].unique()[:3]
Out[37]:
array(['Italy', 'Portugal', 'US'], dtype=object)
In [38]:
df['country name'].value_counts().sort_index()[:3]
Out[38]:
country name
Argentina    3800
Armenia         2
Australia    2329
Name: count, dtype: int64

Creating new variables¶

Using numpy math functions¶

  • All functions you’ll ever need are implemented:
    • np.log
    • np.exp
    • np.sqrt
    • np.power
  • See docs for details
  • Index is preserved
  • Very fast, vectorized implementation
In [39]:
import numpy as np
df['log_price'] = np.log(df['price'])
df['log_price'].head()
Out[39]:
0         NaN
1    2.708050
2    2.639057
3    2.564949
4    4.174387
Name: log_price, dtype: float64

Arithmetic with Series¶

  • * , + , - , / , … work as expected
  • All calculations are aligned by index
  • Not all Series have to come from the same DataFrame or be assigned to a DataFrame
In [40]:
df['price_thousand'] = df['price'] / 1000
df['price_thousand'].head()
Out[40]:
0      NaN
1    0.015
2    0.014
3    0.013
4    0.065
Name: price_thousand, dtype: float64

Recoding values¶

  • Can be useful to create new variable or fix typos in string variables
  • Not super fast, but faster than any looping approach
In [41]:
df['country code'] = df['country name'].replace(
    {
        "Italy": "IT",
        "France": "FR",
        "Spain": "ES",
        "USA": "US",
    }
) 
df[['country name', 'country code']].head()
Out[41]:
country name country code
0 Italy IT
1 Portugal Portugal
2 US US
3 US US
4 US US

Merging datasets¶

Why merge datasets?¶

  • Often when you download data, it comes in several files
  • While you might not like it, this is often because the data providers respected the normal forms!
  • Or it comes from very different sources
  • In this screencast we show you how to merge or concatenate DataFrames

Concatenating DataFrames vertically¶

No description has been provided for this imageNo description has been provided for this image

result = pd.concat([df1, df2], axis=0) # Concatenating row-wise

Concatenating DataFrames horizontally¶


result = pd.concat([df1, df3], axis=1) # Concatenating column-wise

Merge¶

No description has been provided for this imageNo description has been provided for this image

pd.merge(left, right, how='outer', left_on="key", right_on="key", suffixes=('_left', '_right'), indicator=False)

Rules for data management¶

1. Never ever change source data¶

  • Source data: Original dataset as downloaded or collected
  • Commit the source data to git and never change it
  • All modified datasets should be stored under different names
  • Modified datasets should not be under version control!

2. Separate data mgm’t and analysis¶

  • Data management: Converting source data to formats your analysis programs need
  • Separate data management code from analysis code
  • Never modify the content of a variable outside the data management code!

3. Values have no internal structure¶

  • a.k.a. the first normal form
  • I.e., no need for parsing values before using them
  • E.g. store first names and last names separately
  • Not too often a problem in economic data
    • X-digit industrial or educational classifiers
    • Store each digit level you need in a separate variable

4. No redundant information in tables¶

  • a.k.a. the second normal form
  • In a panel structure: Store time-constant characteristics in a separate table
  • Violations make things much harder and error-prone:
    • Changes to data
    • Consistency checks
    • Selecting observations

5. No structure in variable names (and use meaningful names)¶

  • a.k.a. use long format if you can
  • There should not be different variables with similar content referring to different time periods etc.
  • If you need wide format for regressions, still do your data management in long format