In [ ]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://raw.githubusercontent.com/malkaguillot/Foundations-in-Data-Science-and-Machine-Learning/refs/heads/main/docs/utils/custom.css">
<link rel="stylesheet" type="text/css" href="../utils/custom.css">
%%HTML

Foundations in Data Science and Machine Learning¶

Module 1: Introduction¶

Malka Guillot¶

HSG Logo

Overview of the class¶

Introduction: Who am I?¶

  • PhD in economics from the Paris School of Economics
  • Postdoc at ETH Zürich
  • Associate professor at the University of Liège, Belgium
  • Interested in public economics questions: inequality and taxation
  • Using the standard econometric toolbox + natural language processing + machine learning
Description

Who are you?¶

  • Where have you done your bachelor ?

  • Major ?

  • Programming experience?

  • A specific interest in: a method / a theme?

Class objective¶

The class focuses concepts & skills related to the management of data, that are central for the exploitation of data.

Goals:

  • Equip you with the standard datascience toolkit.
  • Put it to work on a real-world project.

Backbone of the class¶

  1. The skills:
    • Data manipulation:
      • Cleaning, Pipelines, data structure
    • Data vizualisation
    • Data modelling
  2. The tools:
    • shel
    • python
    • git
  3. The concepts:
    • Project management: documenting, sharing & managing code
    • Reproducibility
  4. The methods:
    • Machine Learning
    • Natural Language Processing Public targeted: anyone using data for projects.

What this course is, and is not¶

  • It is:
    • Applied and oriented towards practice;
    • General overview of different techniques - what they are and how to use them.
    • Data analysis in general, not restricted to a research or a field (economics, political science).
    • In python.
  • It is not:
    • Computer science. We’re not coding up models from scratch.
    • Mathematical statistics. We’re not deriving the functions by hand.

Example project¶

How did the under-developed and agricultural Switzerland from the 1850s become such an economically successful country?

A research question in history... where data are central

  • Timing of the development of certain economic sectors?
  • Geographical distribution of the firms?
  • Interaction with public policies ?

No description has been provided for this image No description has been provided for this image

Example: REFLEX project¶

  • Approach: The Swiss Commercial Registry (1883-)

    No description has been provided for this image
  • Methods:

    • Transforming the pdf into structured data using:
      • google vision
      • Open AI
      • deep learning

Example: REFLEX project¶

  • Objectives:
    No description has been provided for this image
  • Challenges:
    • Correcting for OCR mistakes
    • Building a firm identifyer
    • 3 languages
    • ...

Course organisation & logistics¶

How does the class work? Spirit¶

Sessions are designed to be interactive

  • mix of live concepts & coding exercises
  • I want to get you comfortable usings your computing environment to solve problems
    • bring your laptop!
    • I expect you have completed the installation guide and have all software installed.
    • ask questions!

Online Course Materials¶

  • Website:

    • with the course schedule + architecture + content
  • Github repository :

    • includes course resources (notebooks)
  • Github classroom:

    • for collective in-class exercises
  • StudyNet:

    • For homeworks
    • Course announcement and forum

Shell¶

Command line interfaces (CLIs) vs. Graphic User Interfaces (GUIs)¶

CLI¶

  • Faster if you know commands
  • More precise
  • Do everything in one place and don’t change windows
  • Multiple steps can be chained and automated to become one

GUI¶

  • Faster for beginners
  • Less need to memorize
  • Use an interface that is designed specifically for your task
  • No potential for automation or command chaining

Navigation in the shell¶

  • Present working directory (pwd)
    • Display your working directory
  • Use cd to navigate to different directories
    • cd .. to go up one directory
  • Use ls to list directories
  • mkdir to create a new directory
  • touch to create a new file
  • rm to remove a file
  • rmdir to remove a directory
  • cp to copy a file

Windows Users¶

  • There are 2 shells:
    • Powershell
    • Windows Command Prompt
  • You can use either one, but Powershell is more powerful and has more features (and has the same commands as the Mac/Linux shell)

Git¶

What is git?¶

Definition: A version control system

Enables coordinatation $\rightarrow$ no code change is lost or accidentally overwriten.

Provides an organized sharing platform $\rightarrow$ open source & documentation

$\Rightarrow$ key tool from our project management perspective

$\Rightarrow$ widely used in a companies / not enough in research:

  • Software development
  • Scientific researcher
  • Anything involving coding (even latex)

What is GitHub?¶

Definition: A web-based platform for version control and collaboration (using git)

  • Practical: Something that lets you:
    • host your code online
    • share your code with others
    • collaborate with others on the same project
    • keep track of changes made by different people
    • manage issues and pull requests

We (researchers, dataanalysts, developers) need git and GitHub¶

  • Most papers/projects are written by groups
  • Empirical and computational projects become more complex
  • The publication process entails multiple rounds of revisions
  • Reproducibility and transparency are key

Learning git¶

  • Learning git is a long journey
  • Payoffs are enormous and once you learned it you will not want to go back!
  • The best way to learn it is:
  • Learn git on your local computer first
  • Only use the shell!
  • Practice, practice, practice!
  • Don’t be afraid of an ugly and convoluted commit history, we won’t deduct anything for that
There is a learning curve, but I promise you it's worth it.
No description has been provided for this image

Git model¶

  1. You do work in your working directory
  2. Then you add it to your staging area
  3. Once you've staged all you changes for one discrete task, commit a snapshot of the staging area
  4. If you have a remote repository, push your commit
No description has been provided for this image

Getting started with git¶

Setup GitHub account¶

  • Navigate to GitHub's homepage + "Sign Up"
    • Go through the account setting steps ("Verify your email address"...)
No description has been provided for this image

Navigate to GitHub's homepage. Navigate to "Sign Up" in the top right hand side of the page.

Getting started with Git(Hub)¶

  1. Install Git (Linux, Mac, Windows) if not already installed

  2. Git comes with a command line interface (powerful!).

  3. You might want to add a graphical interface to make things easier:

  • GitHub desktop
  • You can link it with your GitHub account
  • Or use the one from VSCode

[Hint] What actually is the Git repository?¶

  • The Git local repository is associated with a particular directory
  • Open the directory in your Git interface to see your options
  • Git stores all its workings in that directory in a hidden subfolder called “.git”

3 special options:

  • README.md: description of the directory
  • .gitignore: what should be ignored by the tracking systel
  • licence $\rightarrow$ open source?

[Hint] What should I include?¶

  • At a minimum:

    • Code (.do, .py, .R, .m, .jl, and so on)
    • Text files (.txt)
    • LATEX documents (.tex)
  • I also recommend:

    • Raw .csv datasets, if small (<10 MB)
  • These are binary files, so you can’t see differences between versions. I recommend including them anyway.

    • PDF files
    • Word, Excel, PowerPoint files
  • Some people also include all datasets.

    • Note that GitHub doesn’t allow files larger than 100 MB, or projects with total size larger than 1 GB.

For datasets, look into Git Large File Storage.

[Hint] What should I exclude?¶

In order to avoid driving your collaborators crazy, you must tell Git to ignore the junk files using a file called .gitignore. It looks like this:

  • Junk created by LaTeX: $\textrm{*.synctex.gz, *.out *.log }$
  • Junk created by Python: $\textrm{*.pyc}$

Best practice: use .gitignore to explicitly exclude everything that you don’t want to include, and commit .gitignore like any other regular file.

GitHub maintains a list of standard .gitignore files for many common languages.

Backbone of git: Commits & branches¶

The lifecycle of files¶

No description has been provided for this image
  • Save: ctrl+S
  • Add: git add <filename> or git add . (all files)
    • You can see the staged files with git status
  • Commit: git commit -m "message"

Commits: saving a snapshot¶

"One discrete task" = a collection of changes, across multiple files (or not), that does one thing.

Examples:

  • Change the formatting of a variable from string to numeric, and treat it properly across multiple scripts
  • Change your regression specification in code, in the output, and in your paper and supporting documentation
  • Add a new function

Before you commit¶

  • Your code should run properly $\rightarrow$ run tests
  • No compilation erros (in Latex for example)
  • Output should be consistent inside the commit (including comments)

But it’s better to have frequent commits (that might have small mistakes) than to have giant, infrequent commits.

Checking the status of your repository (in the shell)¶

git status

Viewing changes when committing¶

No description has been provided for this image

Commit message¶

Examples:

  • “Change the formatting of start date variable from string to date format”
  • “Add year dummies to regression specification”

→ The more detail, the more your future self will thank you.

git commit -m "Add slide on the commit git lecture"
No description has been provided for this image

Viewing commit history¶

No description has been provided for this image

When things go wrong: go back in time!¶

What happens when a commit was a mistake? Revert it, to make a new commit that undoes it.

git revert <commit_id>

Or in the GUI¶

No description has been provided for this image

This can happen!¶

No description has been provided for this image

Source: https://xkcd.com/1296/

Much more to learn on git & GitHub¶

  • Branches
  • Merging
    • Solvign conflicts
  • Pull requests
  • Collaboration
  • Issues

Branches: trying things out¶

Branches are the most powerful part of Git

  • By default, all the work you do goes into the “main” branch

  • Want to experiment? Start a new branch

    • You can switch between branches, and make commits to either branch
  • If your experiment works out, commit and merge back into the master branch

    • If there are conflicts between the commits you’ve made on the two branches, Git will ask you to resolve them
    • This is easiest with a graphical interface like GitKraken
    • Only works with binary files
    • If your experiment doesn’t work out, delete the new branch painlessly

Collaborating: interacting with the remote directory¶

The remote repository is on a server, and holds a record of your commits and branches

You push to the remote repository to save all your commits

  • You pull from the remote repository to load all new commits
  • Always commit before pushing or pulling
  • If what you’re doing is an experiment, make a new branch to avoid any trouble for your coauthor
  • If there are conflicts between your commits and your colleagues’s commits, Git will ask you to resolve them

Basic workflow: push - pull¶

No description has been provided for this image

This is what happens between your computer (local) and your repository (remote).

git push origin <branch_name>
git pull origin <branch_name>

Pushing to the remote repository (GitHub Desktop)¶

No description has been provided for this image
git push origin <branch_name>

Sending my commits to the internet!

Create a remote repository¶

No description has been provided for this image
  • Make sure you click the box to initialize it with a README
  • gitignore $\rightarrow$ python template
  • licence

Create a remote repository¶

No description has been provided for this image

The README.md¶

  • Very important file!
    • Objective: communicate important information about your project
  • A markdown file
  • Markdown?= lightweight markup language
    • The guide
    • The syntax

Not only useful for README: for eg., these slides are written in markdown!

Basic work: clone or fork ?¶

No description has been provided for this image

Cloning a repo¶

No description has been provided for this image