193.301 Biostatistics Lab 4

Introduction

In this lab we’ll look at the population of all Massey students from last year and see how various statistics computed on a random sample change from sample to sample. The idea is that normally we only have one sample, a subset of the population. Our goal is to say something about the population though, such as to estimate a population proportion or population mean. By knowing how these quantities vary from sample to sample, we can get an idea of how to work back from a single sample to the population.

Start by downloading the markdown file for today from here (Right click -> Save file as)

https://www.massey.ac.nz/~jcmarsha/193301/lab04_sampling.Rmd

Load up RStudio, and then use File->Open File… to load the markdown file in. We’ll be writing directly into this document like we did last time.

Our population consists of the following variables on the population of students at Massey university.

Variable	Description
GPA	Grade point average (between 0 and 9 inclusive)
Age	Age (years)
Sex	Sex (female/male)

We’ll be taking samples of various sizes and seeing how the summary statistic from these samples varies, and how it compares to the corresponding population parameter. You won’t have to write any code here - just run the code blocks and add comments about what you see.

Start by running the first code block which should look mostly familiar to you. It’s loading the ggplot2 library for plots, and reading in the data and showing a summary. It also uses the source function to pull in some helper commands that we’ll be using to make taking samples from a population a bit easier. When that line is run you’ll notice you have some Functions in your Environment (like proportion, sample_summary and take_samples).

Distribution of the sample proportion

The next code block produces a plot and works out the proportion of female students at Massey. Note that because there is more than one output, the notebook will display them both inline, and you can choose between them. I’ve used the proportion function here (one of the custom functions loaded in the source command above)¹.
The next code block is taking a sample of 20 massey students, produces a graph, and then works out the proportion of females in the sample. Try running this a few times. Notice that your sample is different each time, and thus the sample proportion may also be different.
We then take 20 samples and plot and summarise them. Notice the use of facet_wrap to split the data up by Sample to produce small multiples. This is a useful way of seeing how a relationship or distribution changes between groups. Also notice the range of proportions that we’re getting across our 20 samples. You should notice that the mean of the proportions from the 20 samples is reasonably close to the population proportion we calculated above. Add a sentence or two below this code block about what you observe.
The next code block takes this further - we create 1000 samples each of size 20, and compute the proportion of females in each of them. These are then summarised numerically, and then in graphical form with a histogram. You should notice that the mean of the proportions is fairly close to the population proportion, and that the distribution of sample proportions is fairly symmetric about that mean, with a spread from around 0.25 to 0.75 or so. We’ve used binwidth here instead of bins as the proportions can only ever be out of 20, so the smallest step between them is 0.05. This ensures each bar represents a unique proportion. If you want to, try using bins instead, and vary the number of bins from, say 8 to 12. You should find the shape of the histogram changes quite a bit, as the proportions in our samples are often on the edge of the bins.
We then repeat the process from above, but this time take 1000 samples of size 20, 80, and 320 respectively. We then compute all the summaries, and bind them up into a single data frame to plot. What do you notice about the center, shape and spread of the distribution of sample proportions as the sample size increases? Write some comments about this in your notebook under the code block.
The trick of statistics is how to convert what we’ve learnt about how the sample proportions are distributed into something we can use when we have only one sample. It goes something like this:
- we know when we take lots of samples of size 20, the sample proportions vary from about 0.3 to 0.7 or so, or about 0.2 below and above a central point.
- so the uncertainty from the sampling process is about +/- 0.2.
- for any single sample, we just work out the proportion of females, and then from that go +/- 0.2 to give an uncertainty interval (also known as a confidence interval).
It turns out that we can figure out the 0.2 value above using some maths rather than having to take lots and lots of samples. So we can just take one sample but still give a range for the uncertainty.

Distribution of the sample mean

The next code block shows a histogram of the age of students at Massey, and computes the population mean age. Run this and write a comment about the distribution you see under the code block.
We then take 20 samples of size 20 and show the distribution of those samples, along with computing the sample mean age for each sample. You should notice that the mean of the sample means is approximately the same as the population mean we found. Run this a few times. What do you notice about the distribution of the sample compared to the distribution of the population?
We then take 1000 samples of size 20, 80, and 320 respectively, and compute the mean age in each of the 1000 samples, and plot the distribution of the sample mean ages. What do you notice about the distribution. Is it always centered on the population mean age?
What is your conclusion from all this? What can you say about how you’d expect sample means to be distributed when repeatedly sampling from a population? How might this help you if you have just one sample, say of size 80? Could it be used to give a measure of precision maybe? Discuss this with those around you, and write some notes into the notebook.
Finally, you might want to knit your notebook so you have everything in one spot.

If you want to have a look at the code for that function, just click on it in the environment. You’ll notice it uses the sum command to add up the number of rows that contain the value specified. Don’t worry if this looks complicated - the goal for this exercise is to learn about the sampling process, not the code so much!↩︎