In Chapter 4 of the textbook Practical Methods for Design and Analysis of Complex Surveys, handling nonsampling errors concentrates on techniques to adjust for unit nonresponse and item nonresponse. Unit nonresponse refers to the situation in which data are not available within the survey data set for a number of sampling units. Reweighting can then be used and applied to the observations from the respondents, with the auxiliary information available for both the respondents and the nonrespondents. Item nonresponse means that in the data set to be analysed some values are present for a sample element, but at least for one item a value is missing for that element. Imputation implies simply that a missing value of the study variable y for a sample element k in the data matrix is substituted by an imputed value.
In Training Key 114, the effect of unit nonresponse on the bias of an estimator is demonstrated by reproducing the results of Example 4.1
In Training Key 117, reweighting is demonstrated by reproducing the results of Example 4.2. A reweighted Horvitz-Thompson estimator, the response homogeneity group (RHG) method and ratio estimation are applied for a SRSWOR sample involving some degree of unit nonresponse.
In Training Key 123, single imputation and multiple imputation are demonstrated by reproducing the results of Example 4.3. Mean imputation, the nearest neighbor method and ratio estimation, providing single imputation methods, are applied for a SRSWOR sample involving some degree of item nonresponse. In addition, multiple imputation is demonstrated briefly.
Evaluation of the bias due to nonresponse in the Province’91 population.
Nonresponse can be harmful in two ways:
Estimation can be biased if the characteristics of the nonrespondents systematically differ from those of the respondets. For example, in the case of a population total \(T\), this difference may cause nonresponse bias defined as \(BIAS(\hat{t}) = E(\hat{t}_{HT(r)}) - T\), where \(\hat{t}_{HT(r)}\) is a Horvitz-Thompson estimator calculated from the respondent data.
Estimation can be less efficient than planned, because under unit nonresponse, the number of measurements is less than the original sample size \((n(r) < n)\) thus decreasing the denominator of the standard error formula.
We investigate the nonresponse bias in the Province’91 population in a hypotetical case by assuming that all the southern municipalities were unable to complete in time the records for the unemployed. These municipalities are Kuhmoinen, Joutsa, Luhanka, Leivonmäki and Toivakka. A variable RES_POP will be created to indicate that a valid response has been received (RES_POP = 1) or has not received (RES_POP = 2), from a municipality. The population of municipalities is thus divided into two subpopulations, the group of respondents (N_1 = 27) and the group of nonrespondents (N_2 = 5).
# load Province'91 data (see Chapter 2)
load("province91.Rdata")
library(tidyverse)
# add variable RESP_POP to province91 data
obs <- c(18, 9, 22, 21, 30) # municipalities: Kuhmoinen, Joutsa, Luhanka, Leivonmäki and Toivakka
# create a RESP_POP indicator: RESP_POP=2 if valid responses has not been received
province91$RESP_POP[province91$Id %in% obs] <- 2
`%notin%` <- Negate(`%in%`) # create notin operator
# RESP_POP=1 if valid responses has been received
province91$RESP_POP[province91$Id %notin% obs] <- 1
print("List data:")
## [1] "List data:"
province91
## # A tibble: 32 x 10
## Stratum Cluster Id Municipality POP91 LAB91 UE91 HOU85 URB85 RESP_POP
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 Jyväskylä 67200 33786 4123 26881 1 1
## 2 1 2 2 Jämsä 12907 6016 666 4663 1 1
## 3 1 2 3 Jämsänkoski 8118 3818 528 3019 1 1
## 4 1 2 4 Keuruu 12707 5919 760 4896 1 1
## 5 1 3 5 Saarijärvi 10774 4930 721 3730 1 1
## 6 1 5 6 Suolahti 6159 3022 457 2389 1 1
## 7 1 3 7 Äänekoski 11595 5823 767 4264 1 1
## 8 2 5 8 Hankasalmi 6080 2594 391 2179 0 1
## 9 2 6 9 Joutsa 4594 2069 194 1823 0 2
## 10 2 7 10 Jyväskmlk 29349 13727 1623 9230 0 1
## # ... with 22 more rows
print("Calculate group totals, sizes and means by response groups:")
## [1] "Calculate group totals, sizes and means by response groups:"
province91 %>%
group_by(RESP_POP) %>%
summarise(Freq = n(), UE91_sum = sum(UE91), UE91_mean = mean(UE91))
## # A tibble: 2 x 4
## RESP_POP Freq UE91_sum UE91_mean
## <dbl> <int> <dbl> <dbl>
## 1 1 27 14475 536.
## 2 2 5 623 125.
print("Calculate group totals, sizes and means for whole data:")
## [1] "Calculate group totals, sizes and means for whole data:"
province91 %>%
summarise(Freq = n(), UE91_sum = sum(UE91), UE91_mean = mean(UE91))
## # A tibble: 1 x 3
## Freq UE91_sum UE91_mean
## <int> <dbl> <dbl>
## 1 32 15098 472.
print("Means by response groups: RESP_POP == 1")
## [1] "Means by response groups: RESP_POP == 1"
UE91_m1 <- province91 %>%
filter(RESP_POP == 1) %>%
summarise(UE91_m1 = mean(UE91))
print("Means by response groups: RESP_POP == 2")
## [1] "Means by response groups: RESP_POP == 2"
UE91_m2 <- province91 %>%
filter(RESP_POP == 2) %>%
summarise(UE91_m2 = mean(UE91))
Expected value of the total estimator and unit response bias:
print("Expected value of the total estimator is")
## [1] "Expected value of the total estimator is"
(EXPECTED <- 32*UE91_m1)
## UE91_m1
## 1 17155.56
print("Nonresponse bias is")
## [1] "Nonresponse bias is"
(BIAS <- 5*(UE91_m1-UE91_m2))
## UE91_m1
## 1 2057.556
The bias calculated under this setting amounts to 2057.6 which is quite large.
Techniques for adjusting to unit nonresponse are worked out in this Training Key by reweighting for an SRSWOR sample drawn from the Province’91 population. The methods include reweighting by the response homogeneity group method and ratio estimation. In addition, accounting for the extra variation because of reweighting is illustrated in variance estimation of a HT estimator of the total of UE91.
Please download the SAS code for your own further training. Select your own sample or several samples and exercise reweighting with different sample sizes for a SRSWOR sample. The macro parameters used in the application are DATA=data, n = sample size and SEED = seed for the random number generator. You may choose \(1 < n < 32\) elements in the sample and by changing the seed different sample configuration will be obtained.
NOTE! You need to have access to SAS in your computer.
Imputation implies that a missing value of the study variable \(y\) for a sample element \(k\) in the data matrix is substituted by an imputed value \(\hat{y}_k\). For example, in some computer packages, a technique called mean imputation is available, in which an overall respondent mean \(\bar{y}_{(r)}\), calculated from the respondent values of the study variable, is inserted in place of the missing values for that variable. Then the imputed value for element \(k\) is \(\hat{y}_k= \bar{y}_{(r)}\). This method is not generally valid, and alternative methods are demonstrated in this Training Key. The methods include single imputation methods and multiple imputation methods. Further instructions will be given once you start.
In multiple imputation, we predict \(m\) values \(\hat{y}_1, \ldots, \hat{y}_j, \ldots, \hat{y}_m\) for each missing item. We thus create m completed data sets. Further instructions will be given once you start.
Please download the SAS code for your own further training. Select your own sample or several samples and exercise imputation with different sample sizes for a SRSWOR sample. The macro parameters used in the application are DATA = data set now, n = sample size, SEED = seed for the random number generator, VAR = dependent variable, AUX = auxiliary variable and REP = (in MI) for number of complete data sets (recommendation m = 2,3,4 or 5). You may choose \(1 < n < 32\) elements in the sample and by changing the seed a different sample configuration will be obtained.
NOTE! You need to have access to SAS in your computer.