--- title: 'Workshop 9: More classifiers' output: html_document --- In this workshop we'll be looking at some more classifiers, all of which you have met before: - k-nearest neighbours - trees and forests - neural networks As always, we'll use the `tidymodels` framework for fitting, estimating classes and for model accuracy. ## Return to the Crime Scene Let's reload the data on glass from earlier. ```{r, message=FALSE} library(tidyverse) library(rsample) library(parsnip) library(yardstick) set.seed(1234) glass <- read_csv("https://www.massey.ac.nz/~jcmarsha/data/glass.csv") |> mutate(Type = as_factor(Type)) split <- initial_split(glass, prop=0.75) glass.train <- training(split) glass.test <- testing(split) ``` ### Try yourself 1. Fit the following model types to the `glass.train` data: - A k-nearest neighbour classifier with `nearest_neighbour()` - A classification tree with `decision_tree()` - A random forest with `rand_forest()` - A neural network with `mlp()` 2. Evaluate their performance on the `glass.test` data using confusion matrices and accuracy. 3. You may wish to try tuning the models, e.g. altering the variables used, changing `k` for the knn classifier, or `size` for the neural network. 4. Which model(s) perform best? ## Singling out Scribes: The "Avila" Bible The Avila data set has been extracted from 800 images of the the "Avila Bible", a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain. The palaeographic analysis of the manuscript has individuated the presence of 12 copyists, or scribes. There are 10 numeric predictors, each of them normalised to mean 0 and standard deviation 1. The following predictors are available: Variable Description ------------ ----------------------------------------------------- `icd` Inter-column distance (i.e. distance between columns of text on the page) `upmar` Upper margin above the text `lomar` Lower margin below the text `exploit` How much of a column is exploited (filled with ink) for text. `numrow` Number of rows on a page `charratio` The average ratio of width:height of characters `ils` Inter-line spacing `weight` How much each row is filled with ink `numpeaks` How many peaks there are in the ink density across a row (counts vertical bars in letters). `crils` Ratio of charratio and ils `scribe` Which copyist or scribe, identified as unique letters. Target variable. The goal is to assign sections of the text (rows) to the correct scribe. The data has been partitioned into a training and (artificial) testing set: ```{r, message=FALSE} avila.train <- read_csv("https://www.massey.ac.nz/~jcmarsha/data/avila-train.csv") |> mutate(scribe = factor(scribe)) avila.test <- read_csv("https://www.massey.ac.nz/~jcmarsha/data/avila-test.csv") |> mutate(scribe = factor(scribe)) ``` ### Try yourself 1. Start by exploring the data. e.g. you might want to assess the relationship between each variable and the target `scribe`. 2. Fit a range of classification models to the training data. 3. Evaluate their predictive performance on the testing data. 4. Which model do you think works best? 5. If, instead of classification performance, the goal was to understand how the scribes differ, which model would you prefer?