Recent Question/Assignment

R studio - needs to be done on rmarkdown
161.324 Data Mining
Assignment 1, 2022
General information
This assignment is assessed. Your work must be submitted by 11:59pm on 27th March 2022.
All work should be done by altering the markdown file provided below.
Make sure you enter your name in the author field of the markdown document.
Marks are clearly stated next to each question.
All graphs should have clear axis labelling and legends if needed.
The answer alone will not give full marks; you should include explanation and/or descriptions of plots alongside your answers.
Once done, ‘Knit’ your document to HTML and submit the HTML file to stream.
Start by downloading the project file by Right clicking and Save File As… here: (
Then loading it into RStudio.
Before you start, make sure you can Knit this document to produce an HTML file from it.
Exercise 1: Exploratory analysis of UK weather records
This exercise is concerned with weather records obtained for three cities in England, namely Durham, Sheffield and Oxford. For each month for years 1929-2012, data are recorded on the rainfall (in millimetres) and number of hours of sunshine. There are some missing values.
In the assignment markdown file, this dataset is read in as the data frame uksun .
1. Which variables have missing data? For each of these variables, how many data points are missing?
2 marks 2. Which city has the highest mean sunshine hours, and which has the lowest mean sunshine
hours? 2 marks
3. Obtain estimates of the mean rainfall per month in all three cities. Ensure you clearly present your code
as well as your answers.
4. Reproduce the following graphic as closely as you can. 3 marks
10 marks

Some hints:
Pivoting the data longer might be useful. The names_prefix option may assist, and you might need to do this in two steps (for rain and then sun). patchwork can be used to assemble plots.
The light lines represent each year’s pattern. They’re black with some transparency.
The curves are smoothers across all years. Smoothers operate only on numeric data.
The colours are ‘steelblue’, ‘brown’ and ‘tan’.
The vector contains month abbreviations.
Exercise 2: Imputation of UK weather records
1. Consider just the July sunshine hours for each year in Durham city. Use mean imputation to fill in the missing values, and then produce a histogram of the July sunshine hours in Durham, colouring by
whether the values were imputed or not. 5 marks
2. Compute the standard deviation of the July sunshine hours in Durhan city before and after mean imputation. Which dataset is more disperse? Is this what you expect? Clearly explain your answer.
4 marks
3. Use k-nearest neighbour imputation, with k=5 and then again with k=100, to fill in all the missing values in the dataset. In computing these imputations, use all the weather data but not the year and month variables.
Produce separate scatterplots for k=5 and k=100 of Durham city sunshine against Oxford sunshine, with imputed data points coloured red and all other data points coloured black.
Do the imputed data points appear to follow the same trend as the real data? Clearly explain as to
whether this should be expected or not. 6 marks Exercise 3: Woops!
During preparation of gene samples on 1024 subjects, a careless lab technician contaminated one (and only one) of them. They must figure out which is the corrupted sample, otherwise all will have to be discarded. The lab technician has come seeking a data miner for help.
In the assignment markdown file, this dataset is read in as the data frame woops .
You will see that woops contains 1024 rows (one for each subject) and 4 columns: A row identifier id , and 3 columns of genes labelled gene1 , gene2 and gene3 . The entries in the gene columns are standardised measures of gene expression.
Using whichever means you like, identify the incorrect row in this data. Your answer should clearly state which row is incorrect in addition to describing how you found the answer, including any code used. 8 marks

Looking for answers ?