161.324 Data Mining

General information

This assignment is assessed. Your work must be submitted by 11:59pm on 24th March 2025.

All work should be done by altering the markdown file provided below.
Make sure you enter your name in the author field of the markdown document.
Marks are clearly stated next to each question.
All graphs should have clear axis labelling and legends if needed.
The answer alone will not give full marks; you should include explanation and/or descriptions of plots alongside your answers.
Once done, ‘Knit’ your document to HTML and submit the HTML file to stream.
The last exercise requires you to note down any external sources you used to help you answer the assignment questions and to reflect on how they were used.

Start by downloading the project file by Right clicking and Save File As… here:

https://www.massey.ac.nz/~jcmarsha/161324/assessment/ass1.Rmd

Then loading it into RStudio.

Before you start, make sure you can Knit this document to produce an HTML file from it.

Exercise 1: Exploratory analysis of UK weather records

This exercise is concerned with weather records obtained for three cities in England, namely Durham, Sheffield and Oxford. For each month for years 1929-2012, data are recorded on the rainfall (in millimetres) and number of hours of sunshine. There are some missing values.

In the assignment markdown file, this dataset is read in as the data frame uksun.

Which variables have missing data? For each of these variables, how many data points are missing? 2 marks
Which city has the highest mean sunshine hours, and which has the lowest mean sunshine hours? 2 marks
Obtain estimates of the mean rainfall per month in all three cities. Ensure you clearly present your code as well as your answers. 3 marks
Reproduce the following graphic as closely as you can. 10 marks

Some hints:
- Pivoting the data longer might be useful. The names_prefix option may assist, and you might need to do this in two steps (for rain and then sun).
- patchwork can be used to assemble plots.
- The light lines represent each year’s pattern. They’re black with some transparency.
- The curves are smoothers across all years. Smoothers operate only on numeric data.
- The colours are ‘steelblue’, ‘brown’ and ‘tan’.
- The month.abb vector contains month abbreviations.

Exercise 2: Imputation of UK weather records

Consider just the July sunshine hours for each year in Durham city. Use mean imputation to fill in the missing values, and then produce a histogram of the July sunshine hours in Durham, colouring by whether the values were imputed or not. 5 marks
Compute the standard deviation of the July sunshine hours in Durhan city before and after mean imputation. Which dataset is more disperse? Is this what you expect? Clearly explain your answer. 4 marks
Use k-nearest neighbour imputation, with \(k=5\) and then again with \(k=100\), to fill in all the missing values in the dataset. In computing these imputations, use all the weather data but not the year and month variables.

Produce separate scatterplots for \(k=5\) and \(k=100\) of Durham city sunshine against Oxford sunshine, with imputed data points coloured red and all other data points coloured black.

Do the imputed data points appear to follow the same trend as the real data? Clearly explain as to whether this should be expected or not. 6 marks

Exercise 3: Woops!

During preparation of gene samples on 1024 subjects, a careless lab technician contaminated one (and only one) of them. They must figure out which is the corrupted sample, otherwise all will have to be discarded. The lab technician has come seeking a data miner for help.

In the assignment markdown file, this dataset is read in as the data frame woops.

You will see that woops contains 1024 rows (one for each subject) and 4 columns: A row identifier id, and 3 columns of genes labelled gene1, gene2 and gene3. The entries in the gene columns are standardised measures of gene expression.

Using whichever means you like, identify the incorrect row in this data. Your answer should clearly state which row is incorrect in addition to describing how you found the answer, including any code used. 8 marks

Exercise 4: Reflection

You are required to provide a summary of any assistance you utilised to complete the assignment. This may include things such as:

The use of similar examples from the workshops, lectures or notes
Any questions you asked the teaching team (and their answers)
Any questions you asked of your fellow students (and their answers)
Any information sourced from websites (and their URLs)
Any prompts and answers from a chatbot.