ML + NLP Challenge
1️⃣ Project Overview
This is a 3-hour in-class challenge where teams of up to 2 students will build an end-to-end pipeline that combines Natural Language Processing with a Machine-Learning classifier to predict the political orientation (e.g., left, centre, right) of news articles.
The deliverable is a working model, a performance evaluation, and reproducible code.
2️⃣ Timeline & Milestones
Phase |
Duration |
What Happens |
Deliverables |
In‑class challenge |
3 hours (during class) |
• Professors/TAs available for hints.
• Explore data, set up a baseline, iterate on models.
|
A runnable prototype (script or notebook) that trains a model and outputs predictions on a validation split.
|
Immediate check‑in |
≈ 15 min before the challenge ends |
• Each group announces the best‑performing model discovered (validation accuracy/F1).
• Brief discussion of challenges and next steps.
|
One‑sentence summary of the top model and its metric. |
Post‑class refinement |
48 hours (deadline: [date + 48 h]) |
• Continue training, fine‑tune hyper‑parameters, conduct error analysis, improve reproducibility.
• No further professor input is allowed.
|
Fully committed GitHub repository (see § 4) **with a notebook that follows the good‑practice guidelines**.
|
3️⃣ Materials & Repository
- GitHub Classroom repository – each team receives a private repo:
https://classroom.github.com/a/sOTgvaAz
The repo contains:
data/
– sample training/validation CSV files.
src/
– starter scripts for data loading, preprocessing, and a baseline model.
requirements.txt
– required Python packages (scikit‑learn, transformers, pandas, …).
README.md
– template with sections for description, setup, results, and usage.
4️⃣ Good Notebook Practices
To ensure that the submitted notebook can be executed by anyone (TA, future students, or yourself weeks later), follow these conventions:
- Reproducibility
- Set a global random seed at the top of the notebook (e.g.,
np.random.seed(42); torch.manual_seed(42)
).
- Record the exact package versions used (e.g.,
pip freeze > requirements.txt
).
- Store any trained artefacts (model weights, vectorizers) using deterministic filenames (e.g.,
model_v1.pkl
).
- Run From Start to Finish
- The notebook must be executable cell‑by‑cell from the first cell without manual edits in the middle.
- Avoid hidden state: do not rely on variables created in previous interactive sessions.
- Include a “Setup” section that installs packages (if needed) and loads data.
- Comments & Documentation
- Use Markdown cells liberally to explain why a step is performed, not just what is done.
- Add inline comments (
# comment
) in code cells for non‑obvious lines.
- Summarise results (tables, plots) in a concluding Markdown cell that interprets the numbers.
- Clear Cell Structure
- Section headers (
# Section
) in code cells to mirror the logical flow (e.g., # 1️⃣ Data Loading
).
- Keep each cell focused on a single logical operation (loading, cleaning, modelling, evaluation).
5️⃣ Evaluation Criteria
- Correctness – code runs end‑to‑end and reproduces reported numbers
- Performance – validation metric (accuracy / macro‑F1) compared to baseline
- Pipeline completeness – data preprocessing, model training, prediction, and serialization
- Documentation & reproducibility – clear README, environment file, commit history, and notebook best practices
- Ethical awareness – brief discussion of bias, data limits, or potential misuse
6️⃣ Submission Procedure
- Commit all source files, notebooks,
requirements.txt
, and the final README.md
to the main branch of your GitHub Classroom repo.
- Tag the final commit with
v1.0-final
to help TAs locate the correct version.
- Ensure the repository is visible to teaching staff (they have access through the Classroom organization).
- No additional zip files or external links are required.
7️⃣ Helpful Tips
- Start with the baseline (e.g., TF‑IDF + Logistic Regression) to obtain a quick score, then try a more complex approach (e.g., fine‑tuned DistilBERT) if time permits.
- Modularise the pipeline using
sklearn.pipeline.Pipeline
objects or separate functions – this simplifies later refinements.
- Document any external resources (pre‑trained weights, extra libraries) in the README.
- Keep the notebook clean and runnable from top to bottom; remove stray print statements or debugging code before the final commit.
Good luck, and enjoy building your NLP + ML solution! 🎉
8️⃣ Learning Objectives
- Full End‑to‑End NLP + ML Pipeline Implementation
- Speed, teamwork, rapid prototyping