[A, SfS] Chapter 5: Confidence Intervals: 5.2: CI for a mean
Confidence Interval for the Population Mean of a Quantitative Variable
Confidence Interval for the Population Mean of a Quantitative Variable
In this lesson, you will learn how to compute a confidence interval for the population mean of a variable.
In section 3.2 we defined as the quantile of the standard normal distribution which has probability to its left, such that . This is the notation used in probability. But in statistics, it will be more convenient to use this notation for the quantile which has probability to its right, such that .
So from here on in this course, denotes the quantile of the standard normal distribution which has probability to its right, in the right tail of the standard normal distribution. Then, because the standard normal distribution is symmetric about zero, denotes the quantile of the standard normal distribution which has probability to its left, in the left tail of the standard normal distribution. So if then and . (In R, this distinction is made by adding the setting or in the command.)
Suppose is a quantitative variable which, if could be measured on every element of some population of interest, would have a mean of . Because represents the value of for a “typical” element of the population, we would like to have a precise estimate of its value, based on a preferred level of confidence.
So we plan to select a random sample of size , and measure on each element of the sample, to obtain . From these measurements we will compute the sample mean and the sample standard deviation .
There are several options for the method of computing a confidence interval for , based on the answers to three questions. These questions are:
- Can we assume that has a normal distribution on the population?
- Do we know the value of the population variance of ?
- Can we consider the sample size to be large?
We first consider the situation in which the answer to questions 1 and 2 is “Yes”. If so, the answer to question 3 is irrelevant.
We assume and the value of is known. Then
This means that:
If we rearrange this inequality with some algebraic manipulation, we have:
Hence
Confidence Interval for a Population Mean
Suppose and the value of the population variance is known.
Then the formula for computing a confidence interval for the population mean , based on a random sample of size for which the sample size is , is:
In this setting, is the margin of error of the CI, and the CI can be reported as:
Controlling the Margin of Error
As you can plainly see, increasing will decrease the margin of error.
If you are planning a study and you prefer a CI for to have a margin of error no larger than some while maintaining the same confidence level , you must determine the minimum sample size necessary to achieve this goal.
That is, if you want then you need:
Consider the population of all steel wires of a certain type. Let denote the breaking strength (in kN) of a wire from this population.
Suppose that we know that , but the value of the mean breaking strength of this type of steel wire is unknown. We want to estimate using a CI. Suppose that for a particular sample with we find kN.
For a CI, , so . Then the middle of the density falls between and .
In we can find using the command:
qnorm(0.025,low=F)
orqnorm(0.975)
We find .
Then a CI for is:
So the mean breaking strength of this type of steel wire is estimated to be a value between kN and kN, with confidence.
In this example, the margin or error of the CI is:
To achieve a margin of error no larger than , we must increase the sample size to at least:
Going back to the three questions, suppose the answer to the first question is either “no” or “I’m not sure”, but the answers to the second and third questions are both “yes”. That is, we don’t know if has a normal distribution on the population, but the sample size is large and the value of is known.
The Central Limit Theorem tells us that has an approximate distribution, so we can assume
Now suppose the answer to the first question is still either “no” or “I’m not sure”, the answer to the third question is still “yes”, but the answer to the second question is “no”. That is, we don’t know the actual value of , which is actually much more realistic.
If the sample size is large, then the value of the sample variance should be close to the actual value of . Combined with the Central Limit Theorem, we can say that
Confidence Interval for a Population Mean, Unknown Population Variance, Large Sample
Suppose has a non-normal or otherwise unknown distribution, the value of the population variance is unknown, but the sample size is large enough for the Central Limit Theorem to apply.
Then the formula for computing the upper and lower bounds of the CI for is:
Finally, suppose the answer to the first question is “yes” and the answers to the second and third questions are “no”. So but we don’t know the value of and the sample size is not large.
In this case, we have to work with again, but this random variable does not have an approximate distribution.
However, it can be shown in this setting that has the Student’s t-distribution with degrees of freedom. So we would write:
Now the structure of the CI is essentially the same, but in place of we use , the quantile of the distribution for which of the area below the density curve falls to its left (and thus of the area falls to its right).
Confidence Interval for a Population Mean, Unknown Population Variance, Small Sample
Suppose , but the value of the population variance is unknown, and the sample size is not large enough for the Central Limit Theorem to apply.
Then the formula for computing the upper and lower bounds of the CI for is:
Calculating Quantiles of the Student's t-Distribution in R
To calculate the quantile of the Student's t-distribution with degrees of freedom in R use:
> qt(1-α/2, n-1)
or
> qt(α/2, n-1, low=F)
Consider again the population of all steel wires of a certain type, where denotes the breaking strength (in kN) of a wire from this population. We assume with unknown.
Let’s estimate the mean breaking strength of this type of steel wire using a CI. Suppose that for a particular sample with we find kN and kN.
For a CI, , so . Then the middle of the distribution falls between and .
In R we can find using the command:
> qt(0.005,9,low=F)
or
> qt(0.995,9)
We find .
Then a CI for is:
We didn’t address every possible situation from our three questions (such as if the answers to all three questions are “no”), but there are methods available for those situations as well. Those are not covered in this course.
Or visit omptest.org if jou are taking an OMPT exam.