Confidence Interval for the Difference Between Two Population Means for a Quantitative Variable

[A, SfS] Chapter 5: Confidence Intervals: 5.4: CI for difference in means

Confidence Interval for the Difference Between Two Population Means for a Quantitative Variable

Given a quantitative variable measured on two different groups, how different are the two means of that variable between the two groups?

#\text{}#

Now suppose there are two distinct, independent populations, arbitrarily labelled #1# and #2#, and the continuous variable #X# is measured on random samples from both populations. For population #1#, the mean and variance of #X# are denoted #\mu_1# and #\sigma^2_1#, respectively, while for population #2#, the mean and variance of #X# are denoted #\mu_2# and #\sigma^2_2#, respectively.

In this setting, we are not interested in estimating the values of #\mu_1# or #\mu_2#, but in estimating their difference, #\mu_1 - \mu_2#. The random sample from population #1# has size #n_1#, sample mean #\bar{X}_1# and sample variance #s_1^2#, while the random sample from population #2# has size #n_2#, sample mean #\bar{X}_2# and sample variance #s_2^2#.

As when we were estimating the value of a single population mean, the procedure for computing a #(1 - \alpha)100\%# confidence interval for the difference in two population means depends on the answers to several questions. This time there are four questions. These questions are:

Can we assume that #X# has a normal distribution on each of the two populations?
Do we know the values of the population variances #\sigma^2_1# and #\sigma^2_2# of #X#?
Can we assume that #\sigma_1^2=\sigma_2^2#?
Can we consider both the sample sizes #n_1# and #n_2# to be large?

We first consider the situation in which the answers to questions 1 and 2 are “Yes” and the answer to question 3 is “No” (or “I’m not sure”). If so, the answer to question 4 is irrelevant.

Confidence Interval for a difference in Population Means, Normally Distributed, Unequal Population Variances

Suppose a continuous variable #X# is normally distributed on two populations. Furthermore, suppose the population variances are known but unequal.

Then the difference between sample means is distributed as follows:\[\bar{X}_1 - \bar{X}_2 \sim N \bigg( \mu_1 - \mu_2, \cfrac{\sigma_1^2}{n_1} + \cfrac{\sigma_2^2}{n_2}\bigg)\]

The margin of error of a #(1 - \alpha)100\%# confidence interval for #\mu_1 - \mu_2# is then:

\[z_{\alpha /2}\sqrt{\cfrac{\sigma^2_1}{n_1} + \cfrac{\sigma^2_2}{n_2}}\]

The lower and upper bounds of the CI are computed by subtracting this margin of error from #\bar{X}_1 - \bar{X}_2# and by adding the margin of error to #\bar{X}_1 - \bar{X}_2#, respectively.

#\text{}#

But suppose the answer to question 3 is “Yes”. If #\sigma^2_1 = \sigma^2_2# then we don’t need to distinguish them using subscripts, we can just use #\sigma^2# for both.

Confidence Interval for a difference in Population Means, Normally Distributed, Equal Population Variances

Suppose a continuous variable #X# is normally distributed on two populations. Furthermore, suppose the population variances are known and equal.

Then the difference between sample means is distributed as follows:\[\bar{X}_1 - \bar{X}_2 \sim N \bigg( \mu_1 - \mu_2, \cfrac{\sigma^2}{n_1} + \cfrac{\sigma^2}{n_2}\bigg) = N \bigg( \mu_1 - \mu_2, \sigma^2 \Big(\cfrac{1}{n_1} + \cfrac{1}{n_2}\Big)\bigg)\]

The margin of error of a #(1 - \alpha)100\%# confidence interval for #\mu_1 - \mu_2# is then:

\[z_{\alpha /2}\,\sigma\sqrt{\cfrac{1}{n_1} + \cfrac{1}{n_2}}\]

The lower and upper bounds of the CI are computed by subtracting this margin of error from #\bar{X}_1 - \bar{X}_2# and by adding the margin of error to #\bar{X}_1 - \bar{X}_2#, respectively.

If the answer to question 1 is “No” (we can’t assume normal distributions), but the answer to question 4 is “Yes” (we have large sample sizes), then the Central Limit Theorem allows us to use the same formulas given above to compute the margin of error and the bounds of the CI, depending again on the answer to question 3.

#\text{}#

Now we consider a more realistic scenario in which the answer to question 2 is “No”, i.e., we do not know the values of #\sigma_1^2# and #\sigma_2^2#. We must estimate them with the values of #s_1^2# and #s_2^2#.

If the answer to question 4 (large sample sizes) is “Yes” then the answer to question 1 (normality) is irrelevant.

Confidence Interval for a difference in Population Means, Large Samples, Unequal and Unknown Population Variances

Suppose #X# is a continuous variable measured on two large samples drawn from two populations with unequal or otherwise unknown variances.

Then the sample variances will be close in value to the population variances and the difference in sample means will be approximately normally distributed with the following parameters:

\[\bar{X}_1 - \bar{X}_2 \sim N\bigg(\mu_1 - \mu_2, \cfrac{s_1^2}{n_1} + \cfrac{s^2_2}{n_2}\bigg)\]

The margin of error of a #(1 - \alpha)100\%# confidence interval for #\mu_1 - \mu_2# is then:

\[z_{\alpha /2}\sqrt{\cfrac{s^2_1}{n_1} + \cfrac{s^2_2}{n_2}}\]

The lower and upper bounds of the CI are computed by subtracting this margin of error from #\bar{X}_1 - \bar{X}_2# and by adding the margin of error to #\bar{X}_1 - \bar{X}_2#, respectively.

Important: Even if #s_1^2 = s_2^2# we cannot assume based on this alone that #\sigma_1^2 = \sigma_2^2#. It is quite possible for the sample variances to be the same even if the population variances are quite different.

#\text{}#

If the answer to question 3 is “Yes” (equal variances), then we can estimate the common variance #\sigma^2# by combining the two sample variances together to create the pooled sample variance.

Confidence Interval for a difference in Population Means, Large Samples, Equal and Unknown Population Variances

Suppose #X# is a continuous variable measured on two large samples drawn from two populations with equal variances.

In such cases the pooled sample variance can be calculated with the following formula:

\[s_p^2 = \cfrac{(n_1 -1)s^2_1 + (n_2 - 1)s^2_2}{n_1 + n_2 - 2}\]

Then the difference in sample means will be approximately normally distributed with the following parameters:

\[\bar{X}_1 - \bar{X}_2 \sim N\Bigg(\mu_1 - \mu_2, s_p^2\bigg(\cfrac{1}{n_1} + \cfrac{1}{n_2}\bigg)\Bigg)\]

The margin of error of a #(1 - \alpha)100\%# confidence interval for #\mu_1 - \mu_2# is then:

\[z_{\alpha /2}\,s_p\sqrt{\cfrac{1}{n_1} + \cfrac{1}{n_2}}\]

The lower and upper bounds of the CI are computed by subtracting this margin of error from #\bar{X}_1 - \bar{X}_2# and by adding the margin of error to #\bar{X}_1 - \bar{X}_2#, respectively.

#\text{}#

Finally, we consider the situation in which the answers to questions 2 and 4 are both “No” but the answer to question 1 (normality) is “Yes”. Once again we must use the Student’s t-distribution, because although #\bar{X}_1 - \bar{X}_2# has a normal distribution we cannot estimate the variance of that distribution accurately.

First suppose that the answer to question 3 (equal variances) is “No”.

Confidence Interval for a difference in Population Means, Normally Distributed, Small Samples, Unequal and Unknown Population Variances

Suppose #X# is a continuous variable measured on two small samples drawn from two normally-distributed populations with unequal variances.

In such cases, the computation of the degrees of freedom of the t-distribution requires a bit of labour. We represent the degrees of freedom with the Greek letter #\nu# ("nu"). Its value is:

\[\nu = \cfrac{\bigg(\cfrac{s_1^2}{n_1} + \cfrac{s_2^2}{n_2}\bigg)^2}{\cfrac{(s_1^2/n_1)^2}{n_1 - 1} + \cfrac{(s_2^2/n_2)^2}{n_2 - 1}}\]

(See below for an easy way to compute #\nu# using #\mathtt{R}#.)

Then for a #(1 - \alpha)100\%# confidence interval for #\mu_1 - \mu_2# we use the quantile #t_{\nu,\alpha/2}# to obtain the following margin of error:

\[t_{\nu,\alpha /2}\sqrt{\cfrac{s^2_1}{n_1} + \cfrac{s^2_2}{n_2}}\]

The lower and upper bounds of the CI are computed by subtracting this margin of error from #\bar{X}_1 - \bar{X}_2# and by adding the margin of error to #\bar{X}_1 - \bar{X}_2#, respectively.

Calculating Degrees of Freedom Using R

You can add the following user-defined function to your #\mathrm{R}# workspace if you need to compute the degrees of freedom #\nu# for this situation when the population variances are not assumed equal:

DF = function(sd1,sd2,n1,n2){
return((sd1^2/n1+sd2^2/n2)^2/((sd1^2/n1)^2/(n1-1)+(sd2^2/n2)^2/(n2-1)))
}

For example, suppose #s_1 = 1.7#, #s_2 = 1.9#, #n_1 = 12# and #n_2 = 13#. Then

> DF(1.7,1.9,12,13)

would return the degrees of freedom #\nu#.

#\text{}#

However, if the answer to question 3 is “Yes”, i.e., we can assume the population variances are equal, then once again we use the pooled sample variance.

Confidence Interval for a difference in Population Means, Normally Distributed, Small Samples, Equal and Unknown Population Variances

Suppose #X# is a continuous variable measured on two small samples drawn from two normally-distributed populations with equal variances.

The calculation of the pooled variance is the same as before:

\[s_p^2 = \cfrac{(n_1 -1)s^2_1 + (n_2 - 1)s^2_2}{n_1 + n_2 - 2}\]

When equal variances are assumed, the degrees of freedom of the t-distribution is calculated as follows:

\[\nu = n_1 + n_2 - 2\]

Then for a #(1 - \alpha)100\%# confidence interval for #\mu_1 - \mu_2# we use the quantile #t_{\nu, \alpha/2}# to obtain the following margin of error:

\[t_{\nu, \alpha /2}\,s_p\sqrt{\cfrac{1}{n_1} + \cfrac{1}{n_2}}\]

The lower and upper bounds of the CI are computed by subtracting this margin of error from #\bar{X}_1 - \bar{X}_2# and by adding the margin of error to #\bar{X}_1 - \bar{X}_2#, respectively.

We haven’t addressed every possible combination of answers to the 4 questions, but for other possible combinations we require techniques beyond the scope of this course.