Recent Question/Assignment

Project Requirements
This page contains the requirements for the major project assignment for this unit.
Your project will be done in a group, usually of four people, and should be a significant analysis of some data to answer research questions. The goal of the project is to make use of the tools and techniques that we have studied in the unit and apply them to a real-world problem. This is your chance to demonstrate your mastery of these techniques and develop a personal project that you can be proud of.
The core requirements of the project are:
1. There are well-defined questions or purposes to the analysis
2. It should involve some data preparation and exploration
3. You define a baseline performance with a simple model
4. You will make use of at least two analysis/prediction techniques from the unit
5. Develop some kind of visualisation of the data or results
Requirement 2 means that selecting a pre-defined problem from Kaggle is not a suitable project. In most cases Kaggle projects have data that has been prepared for you and defines a fixed problem. In this project we want some of your work to be on preparing data for analysis. This could mean transforming it from some format to a data frame or combining data from more than one source. You may find Kaggle useful as a source of data but then augment what they give you with other data to answer your question. We are also looking for the data exploration: e.g. variable identification, univariate analysis, bi-variate analysis, missing value treatment etc.
For requirement 4 you are asked to use more than one analysis technique. For example, you might use clustering to find groups within the data and then perform a linear regression on some variables within the groups. Or, you might use logistic regression to establish a baseline classification performance and then apply a neural network to see if you can improve performance. This would also satisfy requirement 3.
Requirement 5 can be involved in any part of the project, such as data itself, data exploration, data analysis. You may use a bar to visualise a categorical variable or a histogram for a numerical variable.
Here are few suggestions for the advanced project:
1. Make use of linear regression as a predictive model and improving it using polynomial regression. Find important features using RFE technique.
2. Make use of various classification/prediction/clustering techniques from the unit
3. Use various criteria (or metrics) for evaluation: e.g. use of Mean Square Error (MSE), Mean Absolute Error (MAE), and R-squared (r2) for regression problem. Use of accuracy, F-score, Area Under the ROC curve (AUC) for classification problem.
4. First implement a simple algorithm (or model) as a baseline and then improving the baseline using more complex models/techniques.
5. Do parameter analysis to find out which configuration of parameters give best model’s performance. For example, the performance under different k for the KNN algorithm.
The project to be done:
Project is to be done using jupyter notebook with proper graphics such as heat maps, logistic regression, and clustering. And also with proper dataset such as .csv files.
Problem statement (What problem are you trying to solve?)
Are we able to derive meaning from data that would inform the decision making of a football club?
Data Sources (What are the data sources you worked on in this project?)
Methods used (What are the different data analysis methods you applied on your datasets)
Question Template
1. Data points needed (Name, Shooting, DOB, Nation)
2. Data analysis method
3. Explanation of data analysis method
1. Do some countries naturally produce particular types of players?
a) ‘id’, ‘name’, ‘height’, ‘weight, ‘position’, ‘traits’, pace, ‘shooting’, ‘passing’, ‘dribbling’, ‘defending’, ‘physicality’,
b) Plotting
c) A descending bar graph for each nation that shows the most common attributes and their their rank (eg - Germans are good defenders because they have a higher than average height which allows them to defend corners and free kicks well) Could categorise nations into continents and see if continents produce particular players too
2. What types of players do different leagues/nations produce (attackers, midfielders, defenders, goalkeepers) (hard question to do since players are not categorised into general positions - we can just do strikers)
a) Names, position, league the play in/nation they’re from
b) Plotting
c) have a bar graph that shows how many players of each type are produced by each league/nation
3. Which league/teams/nations produce the highest potential players (players under 21 that have a rating of 70/75+)
a. Player name, player age, rating, league/team/nation they play for
b) not sure but can use same as above for now (plot) (i feel like we can do some clustering model for this)
c) bar graph that shows distribution of players below the age of 21 and rating above 70 (sure that we can use a different analysis technique for this but not exactly sure what)
4. (depending on football-api data) predict the highest goal scorers/assits makers/clean sheets/passess completed/LEAGUE STANDINGS - using historical data predict the results for 2020/2021 season
a. Depending on which focus area we go with this will differ
b. Logistic regression (or even linear regression if logistic is too complex)
c. Prediction model to compare to real life results and see whether our data is able able to provide a reliable predictor