--- title: "The sampling distribution" author: "Joanne Blogs" output: html_document: df_print: paged --- ## Introduction In this notebook we'll look at the population of all Massey students from last year and see how various statistics computed on a random sample change from sample to sample. The idea is that normally we only have one sample, a subset of the population. Our goal is to say something about the population though, such as to estimate a population proportion or population mean. By knowing how these quantities vary from sample to sample, we can get an idea of how to work back from a single sample to the population. Our population consists of the following variables on the population of students at Massey university. Variable Description -------- ----------- GPA Grade point average (between 0 and 9 inclusive) Age Age (years) Sex Sex (female/male) We start by loading some needed packages/functions and then reading the population data from the web and looking at the first few rows: ```{r} library(ggplot2) source("https://www.massey.ac.nz/~jcmarsha/227212/sampling.R") massey = read.csv("https://www.massey.ac.nz/~jcmarsha/227212/data/massey.csv") head(massey) ``` ## Distribution of a sample proportion We start by assessing what the true proportion of females is in the population. In most situations we won't know this - instead we'll be trying to estimate it from a sample. We're kind of working backwards today so we can get a feel for how what we see in a sample relates to the truth in the population. ```{r} ggplot(data=massey) + geom_bar(mapping=aes(x=Sex)) proportion(massey$Sex, "female") ``` Next, let's take a sample and work out the proportion of females there to see how it compares to the population. We'll start with a sample of size 20. Try running this code block a few times - you'll get different samples (so different plots and proportions each time). ```{r} samp = take_samples(massey, 20) ggplot(data=samp) + geom_bar(mapping=aes(x=Sex)) proportion(samp$Sex, "female") ``` Let's automate the process of repetition. We'll take 20 samples and see what we get for the sample proportions across those 20 samples. You should notice tht the 20 samples differ a little bit. You should notice that the proportion of females in each sample are on average around about the population proportion. ```{r} samps = take_samples(massey, 20, 20) ggplot(data=samps) + geom_bar(mapping=aes(x=Sex)) + facet_wrap(vars(Sample)) props = sample_summary(samps, Sex, proportion, "female") summary(props) ``` Going a step further, let's take 1000 samples to get a better feel for the distribution that we get in the sample proportions (our 'estimates') and how this compares to the population proportion (the 'truth'). ```{r} samp20 = take_samples(massey, 20, 1000) props20 = sample_summary(samp20, Sex, proportion, "female") summary(props20) ggplot(data=props20) + geom_histogram(mapping=aes(x=proportion), binwidth=0.05) ``` *What do you notice about the distribution of the proportion of females you get across the 1000 samples?* Finally, we'll repeat the process with some larger sample sizes, to help figure out how sample size influences things. ```{r} samp20 = take_samples(massey, 20, 1000) samp80 = take_samples(massey, 80, 1000) samp320 = take_samples(massey, 320, 1000) props20 = sample_summary(samp20, Sex, proportion, "female") props80 = sample_summary(samp80, Sex, proportion, "female") props320 = sample_summary(samp320, Sex, proportion, "female") props_all = rbind(props20, props80, props320) ggplot(data=props_all) + geom_histogram(mapping=aes(x=proportion), binwidth=0.05) + facet_wrap(vars(Size)) ``` *What do you notice about the distribution of the sample proportion? How does that distribution change when the sample size increases?* ## Distribution of a sample mean We start by looking at the distribution of ages in the population, and compute the mean age in the population. ```{r} mean(massey$Age) ggplot(data=massey) + geom_histogram(mapping=aes(x=Age), bins=30) ``` We next take 20 samples each of size 20, and see how they are distributed. We also compute their sample means and assess how the sample means vary. ```{r} samps = take_samples(massey, 20, 20) ggplot(data=samps) + geom_histogram(mapping=aes(x=Age), bins=8) + facet_wrap(vars(Sample)) means = sample_summary(samps, Age, mean) summary(means) ``` You should notice that the mean age in each sample is on average around about the population age. Lastly, we take 1000 samples each of size 20, 80 and 320. We then compute the sample means and assess how the sample means vary for each of these sample sizes. ```{r} samp20 = take_samples(massey, 20, 1000) samp80 = take_samples(massey, 80, 1000) samp320 = take_samples(massey, 320, 1000) means20 = sample_summary(samp20, Age, mean) means80 = sample_summary(samp80, Age, mean) means320 = sample_summary(samp320, Age, mean) means_all = rbind(means20, means80, means320) ggplot(data=means_all) + geom_histogram(mapping=aes(x=mean), bins=20) + facet_wrap(vars(Size)) ``` *What do you notice about the distribution of the sample mean? How does that distribution change when the sample size increases?* What is your conclusion from all this? What can you say about how you'd expect sample means, or sample proportions to be distributed when repeatedly sampling from a population? How might this help you if you have just one sample mean or proportion, say of size 80? Could it be used to give a measure of precision maybe? Discuss this with those around you, and write some notes below.