---
title: "How to weigh a donkey in Morocco"
author: "Joanne Blogs"
output: html_document
---

## Introduction

In this notebook we'll look at data collected on 386 donkeys used to transport goods to market in Morocco. It is a sample of the data collected for *Estimation of the liveweight and body condition of working donkeys in Morocco* by R.A. Pearson and M. Ouassat, published in the Veterinary Record, 1996. The following variables are available.

Variable   Description
--------   -----------
Sex        Sex (male/female)
Age        Age (years)
Bodywt     Weight (kg)
Heartgirth Girth at the heart (cm)
Umbgirth   Girth at the umbilicus (cm)
Length     Length from elbow to buttocks (cm)
Height     Height to the withers (cm)

The goal of the study was to characterise body condition score and to find a method for estimating body weight where no reliable access to scales are available. In this exercise we'll look at how the body weight and other variables are distributed.

At the top of the document is a small set of code describing the title, author etc. **Start by changing the author field to your own name.**

We start by reading the data in from the web and looking at the first few rows:

```{r}
library(ggplot2)
donkey = read.csv("https://www.massey.ac.nz/~jcmarsha/227212/data/donkey.csv")
head(donkey)
```

Next, we'll perform a summary of the donkey data. **Running this code will give an error**. Recall that R is case-sensitive. So the `Donkey` (with capital D) object is unknown to R. Changing it to `donkey` will make it work. Do that now.

```{r}
summary(Donkey)
```

### Distribution of body weight

Recall what is of interest is to find a way to estimate the body weight of donkeys without having to weigh them (donkey scales being somewhat difficult to carry around, while tape measures are more practical). So what we want to be able to do is predict body weight. Let's start with finding out how the body weight is distributed:

```{r}
ggplot(data=donkey) + geom_histogram(mapping=aes(x=Bodywt))
```

You might want to try altering the number of bins for the histogram (noticing that it warns about using a default). You can do this by adding `bins=15` as an extra argument to the `geom_histogram` function like this:

```{r}
ggplot(data=donkey) + geom_histogram(mapping=aes(x=Bodywt), bins=15)
```

*Note: that arguments to functions are separated using the comma.*

Try different bin sizes out, and settle on what you think provides a good summary of the data, balancing the signal from the data against the noise of too many jagged bins.

How would you describe this distribution? Add a sentence or two below describing the distribution. You can potentially comment on center, shape and spread.

The next code block uses a density plot instead. Try altering it you could add `fill=Sex` to change the fill colour for each sex. You might want to add `alpha=0.5` inside the `geom_density` function to make them partially transparent.

```{r}
ggplot(data=donkey) + geom_density(mapping=aes(x=Bodywt))
```

The next code block uses a boxplot instead to look at the distribution of body weights by sex. **Add a sentence describing what you see - is there much of a difference in body weight between male and female donkeys?**

```{r}
ggplot(data=donkey) + geom_boxplot(mapping=aes(x=Sex, y=Bodywt))
```

### Relationship between body weight and other measures

OK, now that we know how the body weights are distributed, we have some idea as to what a donkey weighs. We know it is unlikely to be less than 50kg, and probably no more than 220kg or so. But that is kinda useless for an individual donkey. Let's see if there is a relationship between body weight and some of the other measurements. e.g. to assess body weight versus height we can use:

```{r}
ggplot(data=donkey) + geom_point(mapping=aes(x=Height, y=Bodywt))
```

You should see a clear outlier. Try clicking on the `donkey` dataset in the `Environment` tab (top right) and searching through to find which observation it is. Clicking on the columns to sort them should help! You should find it is observation 52.

A good way to identify this one (and get rid of it) might be to use the `subset` command. Noticing that the outlier is the only one with `Height` more than 150cm, we could use:

```{r}
only_donkeys = subset(donkey, Height < 150)
```

This creates a new data frame (notice it's in your Environment) with all the donkeys except the outlier. Redo the plot above (copy and paste the code) using this new data frame instead.

```{r}

```

Now, add some more code (you can do it in the code block above, or add a new one) to produce plots of body weight versus the other measurement variables (Length, Heartgirth, Umbgirth). Add some notes in response to the following questions:

 - How would you describe the relationships you see in the graphs?
 - Are they increasing or decreasing?
 - Are they straight lined or curved?
 - Which one would describe the **strongest** relationship do you think? What criteria are you using to assess this?

You should have noticed a *slight* curve in the body weight vs heartgirth plot. It turns out that the main tool we have in statistics is fitting a straight line to some data. So we usually try and make sure everything looks like a straight line. A classic way to do this is by using a log transformation. You can do this two ways: Firstly, we could replace `Heartgirth` and `Bodywt` in the `aes` bit with transformed versions `log(Heartgirth)` and `log(Bodywt)`. Try that in the code block below.

```{r}

```

An alternate is to leave them as-is, and add on `+ scale_x_log10()` after the `geom_point()` to switch the x axis to a log scale. In the code below I've also told the `scale_x_log10` function where I'd like the axis breaks/labels to go.

```{r}
ggplot(data=only_donkeys) + geom_point(mapping=aes(x=Heartgirth, y=Bodywt)) +
  scale_x_log10(breaks=c(80,100,120,140)) + scale_y_log10(breaks=c(50,100,150,200))
```

Just for interest, try using a smoother (to show the 'average' relationship). You can do this by switching the `geom_point` to `geom_smooth`.

You can also add both to the same plot - i.e. add the `geom_point` bit and then add on `geom_smooth` as well.

Once you're happy with your plot, add a comment or two about the relationship you see to your notebook to finish it up.