Class evaluation

Grading

Detail Grade decomposition
Class project (=hand in May 22nd) 80%
Data management 15%
Descriptive analysis 15%
Modeling (ML and/or NLP) 15%
Code format (clean code and GitHub) 15%
Oral presentation (individual + group component) (=May 27th) 20%
Project pitch 10%
Active class participation (individual) 10%

Course project

In short

The bulk of the evaluation arises from the course project that students have to hand in . This is a computing project to be realized by groups of 2 persons. The content is free, but will have to include:

  • The use of several datasets that will be merged and cleaned;
    • Can be collected from the web (scrapping, API) or downloaded from existing databases;
  • Descriptive statistics with meaningfull visualisation
  • Modelling (ML and/or NLP) with interpretation of the results

Students are invited to propose themes that they care about.

The project has to use Git and should be made available on Github.

Project expectations

The project is a problem to which you want to find an answer, using data.

The first step is therefore to look for a problematization and contextualization. You should investigate a subject that appeals to you, so that you can motivate the reader to become involved in your approach.

There are three dimensions to the project. For each of these parts, you can go more or less far. But you should deepen at least one of the three dimensions (ie. the “advanced” bullet points below).

1. Data managagement

The data can be directly available in the form of .txt, .csv files, etc. or come from websites (scrapping, API).

You are likely to get data that is not ‘clean’: set up cleaning protocols to get a reliable and robust dataset(s) at the end of this stage to conduct your analysis. This is also the time to create variables that are more understandable, better identified, etc.

2. Descriptive analysis and graphical representations

With descriptive statistics, you seek to have an overall view of the major trends in your data: the link with the problem, how it allows you to answer it, what the first elements of the answer are… Each result must be interpreted (what does it show, how does it validate/contradict your argument?). In terms of graphical representations, several levels are possible. You can simply represent your data using matplotlib, go further with seaborn or scikit-plot. The basis of a good visualization is to find the right type of graph for what you want to show (do you need a scatter or a line to represent an evolution?) and to make it visible: a legend that makes sense, axes with names etc.

3. Modelling

Last, you will propose a modelling approach to complete / reinforce the descriptive analysis. The model does not matter (linear regression, random forest or other): it must be appropriate (meet your problem) and justified. You can also compare several models that do not have the same purpose. The results must be interpreted as well.

Minimal expectation by part:

Requirements

  1. Data retrieval
    • Standard:
    • Use several data sources, data cleaning - Advanced:
    • Data collection with webscraping
    • Merge several datasets
  2. Data visualisation
    • Standard:
    • Propose at least 4 exhibits (graphs or tables) - Advanced:
    • One more elaborated figures, such as maps, or unsupervised learning representations
  3. Data modelling
    • Standard:
      • Supervised ML approach: compare at least 2 models
      • or NLP - Advanced:
    • Both a ML model and some NLP (for example)

Organisation

Hand-in requirement [due date: May 15th]

  • A report taking the form of a Jupyter Notebook
    • Exception: if you want to develop an application (Dash or Streamite)
    • Make sure your notebook is “correctly compiled” (ie. cells named [1]-[N])
    • You might need other python or notebook files in the upstream analysis
  • Everything should be hosted on a Github repository: data, code, notebook, slides
    • Project folder should be have a well-defined structure
    • Code should be clean. Eg:
      • functions instead of copy paste
    • You must have a README.md in the main directory with:
      • Instructions on how we can build the assignment & what it does;
      • It should clearly indicate which file corresponds to the report (or the link to the deployed application).

Project pitch [due date: April 11th in class, or April 18th by email]

We will discuss your proposed project so that I can evaluate whether it is doable within the time frame. We will discuss this together either during the project kick-off. In case you cannot attend this session, you should send me an email by April 18th specifying the group members, and detailing the question, the data and the methods that you shall use.

Final presentation [May 27th]

In May, you have to hand in the project and present your project to the instructors.

The oral should be based on a beamer presentation. The presentation should present the question that you aim at answering, the underlying data, the descriptive and the modelling steps. You will be asked questions regarding the choice of the different visualisation and modelling approaches.

While the overall grade of the course project is at the group level, the presentation includes an individual dimension.

Submission format

Invite @malkaguillot to collaborate on your GitHub repository by the due date.

Some ideas and past projects

  • List of the 2024 project
  • A platform inspired by bxl-malade for Liège using opendata Liège et WalStat
  • A map of Belgium with communal level information from WalStat for Wallonia, IBSA for Bruxelles, and statistics-flanders.
  • From past years:
    • Corporate performance and employee sentiment (github)
    • Modelling hotel prices in Paris using data from booking
    • Visualisation of the Real Estate Landscape in the Province of Liège using data from Immoweb