General information

This assignment is assessed. Your work must be submitted by 11:59pm on 24th March 2025.

Start by downloading the project file by Right clicking and Save File As… here:

https://www.massey.ac.nz/~jcmarsha/161324/assessment/ass1.Rmd

Then loading it into RStudio.

Before you start, make sure you can Knit this document to produce an HTML file from it.

Exercise 1: Exploratory analysis of UK weather records

This exercise is concerned with weather records obtained for three cities in England, namely Durham, Sheffield and Oxford. For each month for years 1929-2012, data are recorded on the rainfall (in millimetres) and number of hours of sunshine. There are some missing values.

In the assignment markdown file, this dataset is read in as the data frame uksun.

  1. Which variables have missing data? For each of these variables, how many data points are missing? 2 marks

  2. Which city has the highest mean sunshine hours, and which has the lowest mean sunshine hours? 2 marks

  3. Obtain estimates of the mean rainfall per month in all three cities. Ensure you clearly present your code as well as your answers. 3 marks

  4. Reproduce the following graphic as closely as you can. 10 marks

    Some hints:

    • Pivoting the data longer might be useful. The names_prefix option may assist, and you might need to do this in two steps (for rain and then sun).
    • patchwork can be used to assemble plots.
    • The light lines represent each year’s pattern. They’re black with some transparency.
    • The curves are smoothers across all years. Smoothers operate only on numeric data.
    • The colours are ‘steelblue’, ‘brown’ and ‘tan’.
    • The month.abb vector contains month abbreviations.

Exercise 2: Imputation of UK weather records

  1. Consider just the July sunshine hours for each year in Durham city. Use mean imputation to fill in the missing values, and then produce a histogram of the July sunshine hours in Durham, colouring by whether the values were imputed or not. 5 marks

  2. Compute the standard deviation of the July sunshine hours in Durhan city before and after mean imputation. Which dataset is more disperse? Is this what you expect? Clearly explain your answer. 4 marks

  3. Use k-nearest neighbour imputation, with \(k=5\) and then again with \(k=100\), to fill in all the missing values in the dataset. In computing these imputations, use all the weather data but not the year and month variables.

    Produce separate scatterplots for \(k=5\) and \(k=100\) of Durham city sunshine against Oxford sunshine, with imputed data points coloured red and all other data points coloured black.

    Do the imputed data points appear to follow the same trend as the real data? Clearly explain as to whether this should be expected or not. 6 marks

Exercise 3: Woops!

During preparation of gene samples on 1024 subjects, a careless lab technician contaminated one (and only one) of them. They must figure out which is the corrupted sample, otherwise all will have to be discarded. The lab technician has come seeking a data miner for help.

In the assignment markdown file, this dataset is read in as the data frame woops.

You will see that woops contains 1024 rows (one for each subject) and 4 columns: A row identifier id, and 3 columns of genes labelled gene1, gene2 and gene3. The entries in the gene columns are standardised measures of gene expression.

Using whichever means you like, identify the incorrect row in this data. Your answer should clearly state which row is incorrect in addition to describing how you found the answer, including any code used. 8 marks

Exercise 4: Reflection

You are required to provide a summary of any assistance you utilised to complete the assignment. This may include things such as:

In addition you should write a paragraph or more (no more than 200 words) reflecting on how the above sources helped you learn things for the assignment, for the course, or about data mining generally. 10 marks