"This unit applies probabilistic reasoning to sampling, introducing students to sampling distributions of statistics they will use when performing inference in Units 6 and 7. Students should understand that sample statistics can be used to estimate corresponding population parameters and that measures of center (mean) and variability (standard deviation) for these sampling distributions can be determined directly from the population parameters when certain sampling criteria are met. For large enough samples from any population, these sampling distributions can be approximated by a normal distribution. Simulating sampling distributions helps students to understand how the values of statistics vary in repeated random sampling from populations with known parameters." -- College Board, AP Statistics course description
A sampling distribution is a distribution where we take ALL possible samples of a given size and put those sample statistics together as a data set.
For example, let's say we are looking at average number of snap peas taken from a field. If we take all possible samples of size 30, average each field, and then average those averages together, we would get a REALLY good picture of what the population parameter was (which is likely unrealistic to actually calculate). Sampling distributions are important because they lead the way to statistical inference: the act of making a prediction or testing a claim regarding a population parameter.
image courtesy of: pixabay.com
The first type of sampling distribution you will encounter is a sampling distribution for proportions used to estimate a population proportion.
For a sampling distribution for proportions, we will take the sample proportion from all possible samples of our given size and average those together to find the mean of our sampling distribution. Our standard deviation is found using a formula given on the
reference page. Once you have those two things, you have the crux of a sampling distribution for population proportion.
As we get into statistical inference, you'll find that sampling distributions hinge on certain conditions that make our sampling distributions an accurate portrayal of our population proportion.
The first and possibly most important condition necessary for creating a sampling distribution is that our sample is randomly selected. If our sample is not randomly selected, then all the math and calculations we do are all for naught because our point estimate, or sample statistic, is biased. 😱
In order for the standard deviation formula to be accurate, our samples have to be chosen independently of one another. Since we are sampling without replacement, this is technically impossible. However, by checking the 10% condition, we can determine that the amount of dependence is so negligible that our samples are essentially independent.
In order to check this condition, you need to make sure that the population is at least 10 times our sample size! ✅
In order to eventually calculate the probability of obtaining certain samples using a sampling distribution, we need to verify that our sampling distribution is approximately normal.
For categorical data (proportions), we need to check the large counts condition, which states that the number of expected successes and failures are at least 10. In other words, np is greater than or equal to 10 and n(1-p) is greater than or equal to 10.
When dealing with means, our center is the average of all of our sample means from all possible samples of size n. In other words, it's the average of the averages. Our standard deviation is found by dividing our population standard deviation by the square root of our sample size. As our sample size increases, our standard deviation decreases, which plays a huge part in why a large sample size is vital in accurately estimating our population mean. 🤓
As you will find as we get into statistical inference, sampling distributions hinge on certain conditions that make our sampling distributions an accurate portrayal of our population proportion.
Just as with estimating population proportions, it is essential that our sampling distribution is based on random samples. No mathematics or fancy statistics can "fix" a biased sample. 😕
Again, as with population proportions, we must check the 10% condition the same way as we do for population proportions
Our check to be sure that our sampling distribution is normal is different than our condition for population proportions. In order to make sure the sampling distribution for our mean is normal, we must verify one of two things: either that our population is normally distributed or our sample size is at least 30. This is known as the central limit theorem.
The last type of sampling distribution we encounter is when we are seeing if there is a difference in two populations. In this type of sampling distribution, our center is the difference in our two samples (which is presumably 0 if the two populations are not different). The necessary formulas for the center and spread of these sampling distributions can be found on the
reference page. This plays a huge part in statistical inference when checking if two populations are in fact different, which is essential in experimental studies.
In order to check the conditions for inference when there are two samples, you are basically doing the same checks above but doing it twice: checking randomness, independence, and normality for both samples. 🏡