Class project

In short

The bulk (3/4) of the evaluation arises from the course project that students have to hand in at the end of the semester. This is a computing project to be realized by groups of 2 persons. The content is free, but will have to include:

Enrichement of one or more open data datasets or datasets collected through scraping or APIs;
- Some visualization;
- Some modeling.

Students are invited to propose themes that they care about.

The project has to use Git and should be made available on Github.

Project expectations

The project is a problem to which you want to find an answer, using one or more data sets.

The first step is therefore to look for a problematization and contextualization. You should investigate a subject that appeals to you, so that you can motivate the reader to become involved in your approach.

There are three dimensions to the project. For each of these parts, you can go more or less far. But you should deepen at least one of the three dimensions.

1. Data collection and cleaning

The data can be directly available in the form of .txt, .csv files, etc. or come from websites (scrapping, API). The more work is done on data retrieval (e.g. scrapping from several sites) step, the more points this part will get. If the dataset used is a direct download of an existing one, you will have to complete it in one way or another to get points for this part.

You are likely to get data that is not ‘clean’: set up cleaning protocols to get a reliable and robust dataset(s) at the end of this stage to conduct your analysis. This is also the time to create variables that are more understandable, better identified, etc.

2. Descriptive analysis and graphical representations

With descriptive statistics, you seek to have an overall view of the major trends in your data: the link with the problem, how it allows you to answer it, what the first elements of the answer are… Each result must be interpreted (what does it show, how does it validate/contradict your argument?). In terms of graphic representation, several levels are possible. You can simply represent your data using matplotlib, go further with seaborn or scikit-plot. The basis of a good visualization is to find the right type of graph for what you want to show (do you need a scatter or a line to represent an evolution?) and to make it visible: a legend that makes sense, axes with names etc.

3. Modelling

Last, you will propose a modelling approach to complete / reinforce the descriptive analysis. The model does not matter (linear regression, random forest or other): it must be appropriate (meet your problem) and justified. You can also compare several models that do not have the same purpose. The results must be interpreted as well.

Minimal expectation by part:

Requirements

Data retrieval
- Standard:
- Get some data from the web, data cleaning - Advanced:
- Data collection with webscraping
Data visualisation
- Standard:
- Propose at least 4 exhibits (graphs or tables) - Advanced:
- More elaborated figures, such as maps, or unsupervised learning representations
Data modelling
- Standard:
  - Supervised ML approach: compare at least 2 models
  - or NLP - Advanced:
- Both a ML model and some NLP (for example)

Organisation

Hand-in requirement

A report taking the form of a Jupyter Notebook
- Exception: if you want to develop an application (Dash or Streamite)
Everything should be hosted on a Github repository: data, code, notebook, slides
- Project folder should be have a well-defined structure
- Code should be clean. Eg:
  - functions instead of copy paste
- You must have a README.md in the main directory with:
  - Instructions on how we can build the assignment & what it does;
  - It should clearly indicate which file corresponds to the report (or the link to the deployed application).

Project milestones

We will discuss your proposed project several times so that we can evaluate whether it is doable within the time frame. The project will be organized around the following milestones, materialized by a discussion in class:

Project idea discussion: a round table to exchange about your interests and potential ideas for projects.
Intermediate presentation: short presentation of the data and the method (data collection, visualisation & modelling plans) during the last class of the semester.

Final presentation

In January, you have to hand in the project and present your project to the instructors.

The oral should be based on a beamer presentation. The presentation should present the question that you aim at answering, the underlying data, the descriptive and the modelling steps. You will be asked questions regarding the choice of the different visualisation and modelling approaches.

While the overall grade of the course project is at the group level, the presentation includes an individual dimension.

Submission format

Invite @malkaguillot to collaborate on your GitHub repository by the due date (see here).

Indicative grade decomposition [total$=20$]

Category	80%
Data (collection and cleaning)	16%
Descriptive analysis	16%
Modeling	16%
Scientific approach and project reproducibility	8%
Code format (clean code and GitHub)	8%
Oral presentation	16%

Some ideas and past projects

A platform inspired by bxl-malade for Liège using opendata Liège et WalStat
A map of Belgium with communal level information from WalStat for Wallonia, IBSA for Bruxelles, and statistics-flanders.
From past years:
- Corporate performance and employee sentiment (github)
- Modelling hotel prices in Paris using data from booking
- Visualisation of the Real Estate Landscape in the Province of Liège using data from Immoweb

Useful links

Details on class evaluation