In Chapter 5 of the textbook Practical Methods for Design and Analysis of Complex Surveys, various approximate techniques for variance and covariance estimation of nonlinear estimators are introduced. Variance estimation is needed to obtain standard error estimates of sample means and proportions for the total population and, more importantly, for various subpopulations. In modelling procedures, variance estimates of estimated model coefficients, such as regression coefficients, are needed for proper test statistics. The linearization method is used as the basic approximation method. Alternative methods include balanced half-samples, jackknife and bootstrap, which are based on sample reuse techniques. The variance approximation methods are demonstrated for the Mini-Finland Health Survey, providing a complex analytical survey in which stratified cluster sampling is used with regional stratification and two regional sample clusters per stratum. A more complex sampling design is provided by the Occupational Health Care Survey (OHC), to be used for covariance matrix estimation of a vector of nonlinear estimators. The sampling design of the OHC Survey is a combination of stratified one-stage and two-stage sampling with industrial establishments as clusters.

In Training Key 145, the linearization method is demonstrated by reproducing the results of Example 5.1. The Mini-Finland Health Survey data set is used. Two different response variables are used: Chronic morbidity, providing an example of a binary variable, and systolic blood pressure, providing an example of a continuous variable.

In Training Key 158, the jackknife technique is demonstrated by reproducing the results of Example 5.3.

In Training Key 162, the bootstrap technique is demonstrated by reproducing the results of Example 5.4. An option is provided for the examination of the distribution of bootstrap estimates with varying number of generated bootstrap samples.

5.1 The Mini-Finland health survey

Download SAS data | Download R data | Download SPSS data | Download Excel file

MFH SURVEY OVERVIEW

Introduction to MFH Survey sampling design

The Mini-Finland Health Survey was designed to obtain a comprehensive picture of health and of the need for care in Finnish adults, and to develop methods for monitoring health in the population. A two-stage stratified cluster-sampling design was used in such away that one cluster was sampled from each of the 40 geographical strata. The one-cluster-per-stratum design was used to attain a deep stratification of the population of the clusters. The sample of 8000 persons was allocated to achieve an epsem sample (equal probability of selection method; see Section 3.2). Collapsed stratum technique was used in variance estimation with linearization and sample re-use methods. See pages 132-137 and 146 in Lehtonen, R. & Pahkinen, E. (2004)

The list of variables and the MFH data set are shown below.

Mini-Finland Health Survey data set: Subpopulation of persons aged 30-64 years

Variable Label
STR Stratum ID
CLU1 Cluster ID for the 1st cluster
CHRON1 Sample sum of Chronic morbidity for 1st cluster
SYSBP1 Sample sum for Systolic blood pressure for 1st cluster
n2 Number of sample elements in 2nd cluster
CLU2 Cluster ID for the 2nd cluster
CHRON2 Sample sum of Chronic morbidity for 2nd cluster
SYSBP2 Sample sum for Systolic blood pressure for 2nd cluster
n2 Number of sample elements in 2nd cluster

TRAINING KEY 145: Linearization method

INTRODUCTION

Example 5.1. In this Training Key we concentrate on the use of the {linearization method](concepts/concepts.html) in the MFH Survey. We examine the linearization method in the estimation of the variance of a subpopulation proportion estimator for the binary response variable CHRON (chronic morbidity) and a subpopulation mean estimator for the continuous response variable SYSBP (systolic blood pressure). The subgroup considered is males aged 30-64 years.

A) LINEARIZATION METHOD

We will demonstrate how the estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the linearization method.

Start SAS Trainig Key

The estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the linearization method. The reason for the use of linearization is that the proportion and mean estimators should be treated here as nonlinear estimators of type \(y/x\) where both the numerator \(y\) and the denominator \(x\) are random variables. The denominator \(x\) is random because the sampling design of the MFH Survey is a stratified two-stage design where the cluster sample sizes are not fixed in advance, and the subgroup size is not fixed.

load("mfh.Rdata")

mfh$Chr_Sum <- mfh$CHRON1 + mfh$CHRON2
mfh$Sys_Sum <- mfh$SYSBP1 + mfh$SYSBP2
mfh$n_Sum <- mfh$n1 + mfh$n2
mfh$Chr_Var <- (mfh$CHRON1 - mfh$CHRON2)^2
mfh$Sys_Var <- (mfh$SYSBP1 - mfh$SYSBP2)^2
mfh$n_Var <- (mfh$n1 - mfh$n2)^2
mfh$Chr_n_Cov <- (mfh$CHRON1 - mfh$CHRON2)*(mfh$n1 - mfh$n2)
mfh$Sys_n_Cov <- (mfh$SYSBP1 - mfh$SYSBP2)*(mfh$n1 - mfh$n2)

library(tidyverse)
Sumall <- mfh %>% summarise(Chr_Sum=sum(Chr_Sum), Sys_Sum=sum(Sys_Sum), n_Sum=sum(n_Sum), 
                  Chr_Var=sum(Chr_Var), Sys_Var=sum(Sys_Var), n_Var=sum(n_Var), 
                  Chr_n_Cov=sum(Chr_n_Cov), Sys_n_Cov=sum(Sys_n_Cov))
library(knitr)
kable(Sumall, caption = "**The cluster sample sums of CHRON and PHYS and population total N. Their corresponding variances and covariances.**")
The cluster sample sums of CHRON and PHYS and population total N. Their corresponding variances and covariances.
Chr_Sum Sys_Sum n_Sum Chr_Var Sys_Var n_Var Chr_n_Cov Sys_n_Cov
1073 382678 2699 1545 50469516 2527 1435 349962
paste("The variable phat refers to CHRON and is the proportion estimate of chronic morbidity in 30-64 years old males, V_des_phat is the corresponding variance approximation of CHRON. The variable yhat refers to SYSBP and is the estimate of mean systolic blood pressure in 30-64 years old males, V_des_yhat is the corresponding variance approximation of SYSBP.")
## [1] "The variable phat refers to CHRON and is the proportion estimate of chronic morbidity in 30-64 years old males, V_des_phat is the corresponding variance approximation of CHRON. The variable yhat refers to SYSBP and is the estimate of mean systolic blood pressure in 30-64 years old males, V_des_yhat is the corresponding variance approximation of SYSBP."
paste("phat is")
## [1] "phat is"
(phat <- Sumall$Chr_Sum / Sumall$n_Sum)
## [1] 0.3975546
paste("V_des_phat is")
## [1] "V_des_phat is"
(V_des_phat <- (phat^2)*(Sumall$Chr_Sum^-2 * Sumall$Chr_Var + Sumall$n_Sum^-2 * Sumall$n_Var-2*(Sumall$Chr_Sum * Sumall$n_Sum)^-1 * Sumall$Chr_n_Cov))
## [1] 0.0001102888
paste("yhat is")
## [1] "yhat is"
(yhat <-  Sumall$Sys_Sum /  Sumall$n_Sum)
## [1] 141.7851
paste("V_des_yhat is")
## [1] "V_des_yhat is"
(V_des_yhat <-  (yhat^2) * ( Sumall$Sys_Sum^-2 *  Sumall$Sys_Var +  Sumall$n_Sum^-2 *  Sumall$n_Var-2*( Sumall$Sys_Sum *  Sumall$n_Sum)^-1 *  Sumall$Sys_n_Cov))
## [1] 0.2788127
paste("Deff_phat is")
## [1] "Deff_phat is"
(Deff_phat <- V_des_phat/(phat*(1-phat)/ Sumall$n_Sum))
## [1] 1.242853

B) INTERACTIVE SAS USE

Please download the SAS code for your own further linearization method training. One of tasks is to try to complete the code in order to calculate the design effects (deffs) of the CHRON proportion estimator and the SYSBP mean estimator. Instructions will be given in the SAS code once you have downloaded it.

NOTE! You need to have access to SAS in your computer.

Download SAS code

TRAINING KEY 158: The JRR technique

INTRODUCTION

Example 5.3. Similar to the other sample re-use methods and the linearization method, the JRR technique (Jackknife repeated replications) can be used for variance approximation of a nonlinear estimator under a complex sampling design. In this Training Key we apply the JRR technique for variance approximation of a subpopulation proportion estimator for the binary response variable CHRON (chronic morbidity) and a subpopulation mean estimator for the continuous response variable SYSBP (systolic blood pressure) in the MFH survey. The subgroup considered is males aged 30-64 years.

A) JRR TECHNIQUE

In this part we will demonstrate how the estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the JRR technique. The reason for the use of an approximate variance estimator is that the proportion and mean estimators should be treated here as nonlinear estimators of type y/x where both the numerator y and the denominator x are random variables. The denominator x is random because the sampling design of the MFH Survey is a stratified two-stage design where the cluster sample sizes are not fixed in advance, and the subgroup size is not fixed.

Start SAS Trainig Key

B) INTERACTIVE SAS USE

Please download the SAS macro JRR for your own further JRR technique training. Instructions will be given in the code once you download it.

NOTE! You need to have access to SAS in your computer.

Download SAS code

TRAINING KEY 162: Bootstrap Technique

INTRODUCTION

Example 5.4. Similar to the other sample re-use methods and the linearization method, the bootstrap can be used for variance approximation of a nonlinear estimator under a complex sampling design. In this Training Key we apply the BOOT technique for variance approximation of a subpopulation proportion estimator for the binary response variable CHRON (chronic morbidity) and a subpopulation mean estimator for the continuous response variable SYSBP (systolic blood pressure) in the MFH survey. The subgroup considered is males aged 30-64 years.

A) BOOTSTRAP TECHNIQUE

In this part we will demonstrate how the estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the bootstrap method. We will also demonstrate how the number of bootstrap samples generated affects to the distribution of bootstrap estimates. The reason for the use of an approximative variance estimator is that the proportion and mean estimators should be treated here as nonlinear estimators of type y/x where both the numerator y and the denominator x are random variables. The denominator x is random because the sampling design of the MFH Survey is a stratified two-stage design where the cluster sample sizes are not fixed in advance, and the subgroup size is not fixed.

Start SAS Trainig Key

B) INTERACTIVE SAS USE

Please download the SAS macro BOOT for your own further bootstrap method training. Instructions will be given in the code once you download it.

NOTE! You need to have access to SAS in your computer.

Download SAS code

OHC Survey overview

Download SAS data | Download R data | Download SPSS data | Download Text file

OHC SURVEY OVERVIEW

Introduction to OHC Survey sampling design

The sampling design of the OHC Survey is an example of stratified cluster sampling where both one-stage and two-stage sampling have been used. The pedacogical data set constructed for training purposes in this example includes a total of 250 clusters, i.e. industrial establishments, organized in five strata (stratums 2 to 6), and a total of 7841 persons. Stratification is based on the type of industry and cluster size (the number of salaried employees). Clusters having at least 10 employees are included in the OHC example data set. There is variable number of clusters per stratum in the design. The average cluster sample size is 11.2 employees. A more detailed description of the study design and sampling design of the OHC Survey are given in Section 5.6 of Lehtonen and Pahkinen (2004).

To give you an idea of conditional distributions and correlations of response variable and predictor variables, basic descriptive statistics with the correlation matrix of AGE, PHYS, CHRON, PSYCH and PSYCH2 also are shown, separately for both SEXes.

The list of variables in the OHC Survey data set is shown below.

Variable Label
CLUSTER Cluster Identification
STRATUM Stratum Identification, 2 to 6
SUBJECT CLUSTER is transformed to a new variable SUBJECT having values 1,2,…,250
ID Identification number
SEX 1 males, 2 females
AGE in years, range 15 to 64
AGE 2 1 = -44, 2 = 45-
PHYS Physical Health Hazards, 1 present, 0 otherwise
CHRON Chronic morbidity, 1 present, 0 otherwise
PSYCH Psychic Strain, standardized first principal component of nine psychic symptoms
PSYCH2 Psychic Strain, constructed from PSYCH such that score below median = 0 and above median = 1
PSYCH3 Psychic Strain, constructed from PSYCH. 1 = none, 2 = some, 3 = severe

Further Reading

Chapter 5: Further Reading

TRAINING KEY 145: Linearization Method

  • Kish L. (1995) Methods for design effects Journal of Official Statistics 11 55-77.
  • Wolter K.M. (1985) Introduction to Variance Estimation New York:Springer.

TRAINING KEY 159: JRR Technique

  • Frankel M.R. (1971) Inference from Survey Samples. Ann Abor: Institute for Social research, The University of Michigan.
  • Wolter K.M. (1985) Introduction to Variance Estimation New York:Springer.

TRAINING KEY 162: Bootstrap Method

  • Kalton G.(1983) Introduction to Survey Sampling Beverly Hills: Sage.
  • Rao J.N.K., Wu C.F.J. and Yue K. (1992) Some recent work on resampling methods for complex surveys Survey Methodology 18 209-217.
  • Verma V., Scott, C. and O’Muircheartaigh C. (1980) Sample designs and sampling errors for the World Fertility Survey Journal of the Royal Statistical Society A143 431-473.
  • Wolter K.M. (1985) Introduction to Variance Estimation New York:Springer.