In Chapter 4 of the textbook Practical Methods for Design and Analysis of Complex Surveys, handling nonsampling errors concentrates on techniques to adjust for unit nonresponse and item nonresponse. Unit nonresponse refers to the situation in which data are not available within the survey data set for a number of sampling units. Reweighting can then be used and applied to the observations from the respondents, with the auxiliary information available for both the respondents and the nonrespondents. Item nonresponse means that in the data set to be analysed some values are present for a sample element, but at least for one item a value is missing for that element. Imputation implies simply that a missing value of the study variable y for a sample element k in the data matrix is substituted by an imputed value.

In Training Key 114, the effect of unit nonresponse on the bias of an estimator is demonstrated by reproducing the results of Example 4.1

In Training Key 117, reweighting is demonstrated by reproducing the results of Example 4.2. A reweighted Horvitz-Thompson estimator, the response homogeneity group (RHG) method and ratio estimation are applied for a SRSWOR sample involving some degree of unit nonresponse.

In Training Key 123, single imputation and multiple imputation are demonstrated by reproducing the results of Example 4.3. Mean imputation, the nearest neighbor method and ratio estimation, providing single imputation methods, are applied for a SRSWOR sample involving some degree of item nonresponse. In addition, multiple imputation is demonstrated briefly.

TRAINING KEY 114: Nonresponse

INTRODUCTION

A) REFERENCE EXAMPLE 4.1: SAS CODE AND OUTPUT

Evaluation of the bias due to nonresponse in the Province’91 population.

Nonresponse can be harmful in two ways:

  1. Estimation can be biased if the characteristics of the nonrespondents systematically differ from those of the respondets. For example, in the case of a population total \(T\), this difference may cause nonresponse bias defined as \(BIAS(\hat{t}) = E(\hat{t}_{HT(r)}) - T\), where \(\hat{t}_{HT(r)}\) is a Horvitz-Thompson estimator calculated from the respondent data.

  2. Estimation can be less efficient than planned, because under unit nonresponse, the number of measurements is less than the original sample size \((n(r) < n)\) thus decreasing the denominator of the standard error formula.

Start SAS Training Key

We investigate the nonresponse bias in the Province’91 population in a hypotetical case by assuming that all the southern municipalities were unable to complete in time the records for the unemployed. These municipalities are Kuhmoinen, Joutsa, Luhanka, Leivonmäki and Toivakka. A variable RES_POP will be created to indicate that a valid response has been received (RES_POP = 1) or has not received (RES_POP = 2), from a municipality. The population of municipalities is thus divided into two subpopulations, the group of respondents (N_1 = 27) and the group of nonrespondents (N_2 = 5).

# load Province'91 data (see Chapter 2)
load("province91.Rdata")
library(tidyverse)

# add variable RESP_POP to province91 data
obs <- c(18, 9, 22, 21, 30) # municipalities: Kuhmoinen, Joutsa, Luhanka, Leivonmäki and Toivakka 
# create a RESP_POP indicator: RESP_POP=2 if valid responses has not been received
province91$RESP_POP[province91$Id %in% obs] <- 2
`%notin%` <- Negate(`%in%`) # create notin operator
# RESP_POP=1 if valid responses has been received
province91$RESP_POP[province91$Id %notin% obs] <- 1

print("List data:")
## [1] "List data:"
province91
## # A tibble: 32 x 10
##    Stratum Cluster    Id Municipality POP91 LAB91  UE91 HOU85 URB85 RESP_POP
##      <dbl>   <dbl> <dbl> <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1       1       1     1 Jyväskylä    67200 33786  4123 26881     1        1
##  2       1       2     2 Jämsä        12907  6016   666  4663     1        1
##  3       1       2     3 Jämsänkoski   8118  3818   528  3019     1        1
##  4       1       2     4 Keuruu       12707  5919   760  4896     1        1
##  5       1       3     5 Saarijärvi   10774  4930   721  3730     1        1
##  6       1       5     6 Suolahti      6159  3022   457  2389     1        1
##  7       1       3     7 Äänekoski    11595  5823   767  4264     1        1
##  8       2       5     8 Hankasalmi    6080  2594   391  2179     0        1
##  9       2       6     9 Joutsa        4594  2069   194  1823     0        2
## 10       2       7    10 Jyväskmlk    29349 13727  1623  9230     0        1
## # ... with 22 more rows
print("Calculate group totals, sizes and means by response groups:")
## [1] "Calculate group totals, sizes and means by response groups:"
province91 %>% 
  group_by(RESP_POP) %>% 
  summarise(Freq = n(), UE91_sum = sum(UE91), UE91_mean = mean(UE91))
## # A tibble: 2 x 4
##   RESP_POP  Freq UE91_sum UE91_mean
##      <dbl> <int>    <dbl>     <dbl>
## 1        1    27    14475      536.
## 2        2     5      623      125.
print("Calculate group totals, sizes and means for whole data:")
## [1] "Calculate group totals, sizes and means for whole data:"
province91 %>% 
  summarise(Freq = n(), UE91_sum = sum(UE91), UE91_mean = mean(UE91))
## # A tibble: 1 x 3
##    Freq UE91_sum UE91_mean
##   <int>    <dbl>     <dbl>
## 1    32    15098      472.
print("Means by response groups: RESP_POP == 1")
## [1] "Means by response groups: RESP_POP == 1"
UE91_m1 <- province91 %>% 
  filter(RESP_POP == 1) %>% 
  summarise(UE91_m1 = mean(UE91))

print("Means by response groups: RESP_POP == 2")
## [1] "Means by response groups: RESP_POP == 2"
UE91_m2 <- province91 %>% 
  filter(RESP_POP == 2) %>% 
  summarise(UE91_m2 = mean(UE91))

Expected value of the total estimator and unit response bias:

print("Expected value of the total estimator is")
## [1] "Expected value of the total estimator is"
(EXPECTED <- 32*UE91_m1)
##    UE91_m1
## 1 17155.56
print("Nonresponse bias is")
## [1] "Nonresponse bias is"
(BIAS <- 5*(UE91_m1-UE91_m2))
##    UE91_m1
## 1 2057.556

The bias calculated under this setting amounts to 2057.6 which is quite large.

TRAINING KEY 117: Reweighting

INTRODUCTION

A) REFERENCE EXAMPLE 4.2: SAS CODE AND OUTPUT

Techniques for adjusting to unit nonresponse are worked out in this Training Key by reweighting for an SRSWOR sample drawn from the Province’91 population. The methods include reweighting by the response homogeneity group method and ratio estimation. In addition, accounting for the extra variation because of reweighting is illustrated in variance estimation of a HT estimator of the total of UE91.

Start SAS Training Key

B) INTERACTIVE SAS USE

Please download the SAS code for your own further training. Select your own sample or several samples and exercise reweighting with different sample sizes for a SRSWOR sample. The macro parameters used in the application are DATA=data, n = sample size and SEED = seed for the random number generator. You may choose \(1 < n < 32\) elements in the sample and by changing the seed different sample configuration will be obtained.

NOTE! You need to have access to SAS in your computer.

Download SAS code

TRAINING KEY 123: Imputation for Item Nonresponse

INTRODUCTION

A) REFERENCE EXAMPLE 4.3: SAS CODE AND OUTPUT (Single Imputation)

Imputation implies that a missing value of the study variable \(y\) for a sample element \(k\) in the data matrix is substituted by an imputed value \(\hat{y}_k\). For example, in some computer packages, a technique called mean imputation is available, in which an overall respondent mean \(\bar{y}_{(r)}\), calculated from the respondent values of the study variable, is inserted in place of the missing values for that variable. Then the imputed value for element \(k\) is \(\hat{y}_k= \bar{y}_{(r)}\). This method is not generally valid, and alternative methods are demonstrated in this Training Key. The methods include single imputation methods and multiple imputation methods. Further instructions will be given once you start.

Start

B) REFERENCE EXAMPLE 4.3: SAS CODE AND OUTPUT (Multiple Imputation)

In multiple imputation, we predict \(m\) values \(\hat{y}_1, \ldots, \hat{y}_j, \ldots, \hat{y}_m\) for each missing item. We thus create m completed data sets. Further instructions will be given once you start.

Start

C) INTERACTIVE SAS USE

Please download the SAS code for your own further training. Select your own sample or several samples and exercise imputation with different sample sizes for a SRSWOR sample. The macro parameters used in the application are DATA = data set now, n = sample size, SEED = seed for the random number generator, VAR = dependent variable, AUX = auxiliary variable and REP = (in MI) for number of complete data sets (recommendation m = 2,3,4 or 5). You may choose \(1 < n < 32\) elements in the sample and by changing the seed a different sample configuration will be obtained.

NOTE! You need to have access to SAS in your computer.

Download SAS code, Single Imputation macro

Download SAS code, Multiple Imputation macro

Further Reading

Chapter 4: Further Reading

TRAINING KEY 114: Nonresponse

TRAINING KEY 117: Reweighting

TRAINING KEY 123: Imputation

  • Schafer J. L. (2000) Analysis of Incomplete Multivariate Data New York: Chapman & Hall