--- title: "The sampling distribution" author: "Joanne Blogs" output: html_document: df_print: paged --- ## Introduction In this lab we'll look at the population of all Massey students from last year and see how various statistics computed on a random sample change from sample to sample. The idea is that normally **we only have one sample**, a subset of the population. Our goal is **to say something about the population** though, such as to estimate a population proportion or population mean. By knowing how these quantities vary from sample to sample, we can get an idea of how to work back from a single sample to the population. Our population consists of the following variables on the population of students at Massey university. Variable Description -------- ----------- GPA Grade point average (between 0 and 9 inclusive) Age Age (years) Sex Sex (female/male) We'll be taking samples of various sizes and seeing how the summary statistic from these samples varies, and how it compares to the corresponding population parameter. You won't have to write any code here - just run the code blocks and add comments about what you see. We start by loading some needed packages/functions and then reading the population data from the web and looking at the first few rows: ```{r} library(ggplot2) source("https://www.massey.ac.nz/~jcmarsha/sampling.R") massey = read.csv("https://www.massey.ac.nz/~jcmarsha/193301/data/massey.csv") head(massey) ``` ## Distribution of a sample proportion We start by assessing what the true proportion of females is in the population. In most situations we won't know this - instead we'll be trying to estimate it from a sample. We're kind of working backwards today so we can get a feel for how what we see in a sample relates to the truth in the population. ```{r} ggplot(data=massey) + geom_bar(mapping=aes(x=Sex)) proportion(massey$Sex, "female") ``` Next, let's take a sample and work out the proportion of females there to see how it compares to the population. ```{r} samp = take_samples(massey, 20) ggplot(data=samp) + geom_bar(mapping=aes(x=Sex)) proportion(samp$Sex, "female") ``` Let's repeat this process - take 20 samples and see what we get for the sample proportions across those 20 samples. ```{r} samps = take_samples(massey, 20, 20) ggplot(data=samps) + geom_bar(mapping=aes(x=Sex)) + facet_wrap(vars(Sample)) props = sample_summary(samps, Sex, proportion, "female") summary(props) ``` Going a step further, let's take 1000 samples to get a better feel for the distribution that we get in the sample proportions (our 'estimates') and how this compares to the population proportion (the 'truth') ```{r} samp20 = take_samples(massey, 20, 1000) props20 = sample_summary(samp20, Sex, proportion, "female") summary(props20) ggplot(data=props20) + geom_histogram(mapping=aes(x=proportion), binwidth=0.05) ``` Finally, we'll repeat the process with some larger sample sizes, to help figure out how sample size influences things. ```{r} samp20 = take_samples(massey, 20, 1000) samp80 = take_samples(massey, 80, 1000) samp320 = take_samples(massey, 320, 1000) props20 = sample_summary(samp20, Sex, proportion, "female") props80 = sample_summary(samp80, Sex, proportion, "female") props320 = sample_summary(samp320, Sex, proportion, "female") props_all = rbind(props20, props80, props320) ggplot(data=props_all) + geom_histogram(mapping=aes(x=proportion), binwidth=0.05) + facet_wrap(vars(Size)) ``` ## Distribution of a sample mean We start by looking at the distribution of ages in the population, and compute the mean age in the population. ```{r} mean(massey$Age) ggplot(data=massey) + geom_histogram(mapping=aes(x=Age), bins=30) ``` We next take 20 samples each of size 20, and see how they are distributed. We also compute their sample means and assess how the sample means vary. ```{r} samps = take_samples(massey, 20, 20) ggplot(data=samps) + geom_histogram(mapping=aes(x=Age), bins=8) + facet_wrap(vars(Sample)) means = sample_summary(samps, Age, mean) summary(means) ``` Lastly, we take 1000 samples each of size 20, 80 and 320. We then compute the sample means and assess how the sample means vary for each of these sample sizes. ```{r} samp20 = take_samples(massey, 20, 1000) samp80 = take_samples(massey, 80, 1000) samp320 = take_samples(massey, 320, 1000) means20 = sample_summary(samp20, Age, mean) means80 = sample_summary(samp80, Age, mean) means320 = sample_summary(samp320, Age, mean) means_all = rbind(means20, means80, means320) ggplot(data=means_all) + geom_histogram(mapping=aes(x=mean), bins=20) + facet_wrap(vars(Size)) ```