This assignment is assessed. Your work must be submitted by 11:59pm on 24th March 2025.
Start by downloading the project file by Right clicking and Save File As… here:
https://www.massey.ac.nz/~jcmarsha/161324/assessment/ass1.Rmd
Then loading it into RStudio.
Before you start, make sure you can Knit this document to produce an HTML file from it.
This exercise is concerned with weather records obtained for three cities in England, namely Durham, Sheffield and Oxford. For each month for years 1929-2012, data are recorded on the rainfall (in millimetres) and number of hours of sunshine. There are some missing values.
In the assignment markdown file, this dataset is read in as the data
frame uksun.
Which variables have missing data? For each of these variables, how many data points are missing? 2 marks
Which city has the highest mean sunshine hours, and which has the lowest mean sunshine hours? 2 marks
Obtain estimates of the mean rainfall per month in all three cities. Ensure you clearly present your code as well as your answers. 3 marks
Reproduce the following graphic as closely as you can. 10 marks
Some hints:
names_prefix option may assist, and you might need to do
this in two steps (for rain and then sun).patchwork can be used to assemble plots.month.abb vector contains month abbreviations.Consider just the July sunshine hours for each year in Durham city. Use mean imputation to fill in the missing values, and then produce a histogram of the July sunshine hours in Durham, colouring by whether the values were imputed or not. 5 marks
Compute the standard deviation of the July sunshine hours in Durhan city before and after mean imputation. Which dataset is more disperse? Is this what you expect? Clearly explain your answer. 4 marks
Use k-nearest neighbour imputation, with \(k=5\) and then again with \(k=100\), to fill in all the missing values in the dataset. In computing these imputations, use all the weather data but not the year and month variables.
Produce separate scatterplots for \(k=5\) and \(k=100\) of Durham city sunshine against Oxford sunshine, with imputed data points coloured red and all other data points coloured black.
Do the imputed data points appear to follow the same trend as the real data? Clearly explain as to whether this should be expected or not. 6 marks
During preparation of gene samples on 1024 subjects, a careless lab technician contaminated one (and only one) of them. They must figure out which is the corrupted sample, otherwise all will have to be discarded. The lab technician has come seeking a data miner for help.
In the assignment markdown file, this dataset is read in as the data
frame woops.
You will see that woops contains 1024 rows (one for each
subject) and 4 columns: A row identifier id, and 3 columns
of genes labelled gene1, gene2 and
gene3. The entries in the gene columns are standardised
measures of gene expression.
Using whichever means you like, identify the incorrect row in this data. Your answer should clearly state which row is incorrect in addition to describing how you found the answer, including any code used. 8 marks
You are required to provide a summary of any assistance you utilised to complete the assignment. This may include things such as:
In addition you should write a paragraph or more (no more than 200 words) reflecting on how the above sources helped you learn things for the assignment, for the course, or about data mining generally. 10 marks