The bulk (3/4) of the evaluation arises from the course project that students have to hand in at the end of the semester. This is a computing project to be realized by groups of 2 persons. The content is free, but will have to include:
Students are invited to propose themes that they care about.
The project has to use Git and should be made available on Github.
The project is a problem to which you want to find an answer, using one or more data sets.
The first step is therefore to look for a problematization and contextualization. You should investigate a subject that appeals to you, so that you can motivate the reader to become involved in your approach.
There are three dimensions to the project. For each of these parts, you can go more or less far. But you should deepen at least one of the three dimensions.
The data can be directly available in the form of .txt, .csv files, etc. or come from websites (scrapping, API). The more work is done on data retrieval (e.g. scrapping from several sites) step, the more points this part will get. If the dataset used is a direct download of an existing one, you will have to complete it in one way or another to get points for this part.
You are likely to get data that is not ‘clean’: set up cleaning protocols to get a reliable and robust dataset(s) at the end of this stage to conduct your analysis. This is also the time to create variables that are more understandable, better identified, etc.
With descriptive statistics, you seek to have an overall view of the major trends in your data: the link with the problem, how it allows you to answer it, what the first elements of the answer are… Each result must be interpreted (what does it show, how does it validate/contradict your argument?). In terms of graphic representation, several levels are possible. You can simply represent your data using matplotlib, go further with seaborn or scikit-plot. The basis of a good visualization is to find the right type of graph for what you want to show (do you need a scatter or a line to represent an evolution?) and to make it visible: a legend that makes sense, axes with names etc.
Last, you will propose a modelling approach to complete / reinforce the descriptive analysis. The model does not matter (linear regression, random forest or other): it must be appropriate (meet your problem) and justified. You can also compare several models that do not have the same purpose. The results must be interpreted as well.
Requirements
Dash
or Streamite
)README.md
in the main directory with:
We will discuss your proposed project several times so that we can evaluate whether it is doable within the time frame. The project will be organized around the following milestones, materialized by a discussion in class:
In January, you have to hand in the project and present your project to the instructors.
The oral should be based on a beamer presentation. The presentation should present the question that you aim at answering, the underlying data, the descriptive and the modelling steps. You will be asked questions regarding the choice of the different visualisation and modelling approaches.
While the overall grade of the course project is at the group level, the presentation includes an individual dimension.
Invite @malkaguillot to collaborate on your GitHub repository by the due date (see here).
Category | 80% |
---|---|
Data (collection and cleaning) | 16% |
Descriptive analysis | 16% |
Modeling | 16% |
Scientific approach and project reproducibility | 8% |
Code format (clean code and GitHub) | 8% |
Oral presentation | 16% |