In Chapter 5 of the textbook Practical Methods for Design and Analysis of Complex Surveys, various approximate techniques for variance and covariance estimation of nonlinear estimators are introduced. Variance estimation is needed to obtain standard error estimates of sample means and proportions for the total population and, more importantly, for various subpopulations. In modelling procedures, variance estimates of estimated model coefficients, such as regression coefficients, are needed for proper test statistics. The linearization method is used as the basic approximation method. Alternative methods include balanced half-samples, jackknife and bootstrap, which are based on sample reuse techniques. The variance approximation methods are demonstrated for the Mini-Finland Health Survey, providing a complex analytical survey in which stratified cluster sampling is used with regional stratification and two regional sample clusters per stratum. A more complex sampling design is provided by the Occupational Health Care Survey (OHC), to be used for covariance matrix estimation of a vector of nonlinear estimators. The sampling design of the OHC Survey is a combination of stratified one-stage and two-stage sampling with industrial establishments as clusters.
In Training Key 145, the linearization method is demonstrated by reproducing the results of Example 5.1. The Mini-Finland Health Survey data set is used. Two different response variables are used: Chronic morbidity, providing an example of a binary variable, and systolic blood pressure, providing an example of a continuous variable.
In Training Key 158, the jackknife technique is demonstrated by reproducing the results of Example 5.3.
In Training Key 162, the bootstrap technique is demonstrated by reproducing the results of Example 5.4. An option is provided for the examination of the distribution of bootstrap estimates with varying number of generated bootstrap samples.
Download SAS data | Download R data | Download SPSS data | Download Excel file
The Mini-Finland Health Survey was designed to obtain a comprehensive picture of health and of the need for care in Finnish adults, and to develop methods for monitoring health in the population. A two-stage stratified cluster-sampling design was used in such away that one cluster was sampled from each of the 40 geographical strata. The one-cluster-per-stratum design was used to attain a deep stratification of the population of the clusters. The sample of 8000 persons was allocated to achieve an epsem sample (equal probability of selection method; see Section 3.2). Collapsed stratum technique was used in variance estimation with linearization and sample re-use methods. See pages 132-137 and 146 in Lehtonen, R. & Pahkinen, E. (2004)
The list of variables and the MFH data set are shown below.
Mini-Finland Health Survey data set: Subpopulation of persons aged 30-64 years
Variable | Label |
---|---|
STR | Stratum ID |
CLU1 | Cluster ID for the 1st cluster |
CHRON1 | Sample sum of Chronic morbidity for 1st cluster |
SYSBP1 | Sample sum for Systolic blood pressure for 1st cluster |
n2 | Number of sample elements in 2nd cluster |
CLU2 | Cluster ID for the 2nd cluster |
CHRON2 | Sample sum of Chronic morbidity for 2nd cluster |
SYSBP2 | Sample sum for Systolic blood pressure for 2nd cluster |
n2 | Number of sample elements in 2nd cluster |
Example 5.1. In this Training Key we concentrate on the use of the {linearization method](concepts/concepts.html) in the MFH Survey. We examine the linearization method in the estimation of the variance of a subpopulation proportion estimator for the binary response variable CHRON (chronic morbidity) and a subpopulation mean estimator for the continuous response variable SYSBP (systolic blood pressure). The subgroup considered is males aged 30-64 years.
We will demonstrate how the estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the linearization method.
The estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the linearization method. The reason for the use of linearization is that the proportion and mean estimators should be treated here as nonlinear estimators of type \(y/x\) where both the numerator \(y\) and the denominator \(x\) are random variables. The denominator \(x\) is random because the sampling design of the MFH Survey is a stratified two-stage design where the cluster sample sizes are not fixed in advance, and the subgroup size is not fixed.
load("mfh.Rdata")
mfh$Chr_Sum <- mfh$CHRON1 + mfh$CHRON2
mfh$Sys_Sum <- mfh$SYSBP1 + mfh$SYSBP2
mfh$n_Sum <- mfh$n1 + mfh$n2
mfh$Chr_Var <- (mfh$CHRON1 - mfh$CHRON2)^2
mfh$Sys_Var <- (mfh$SYSBP1 - mfh$SYSBP2)^2
mfh$n_Var <- (mfh$n1 - mfh$n2)^2
mfh$Chr_n_Cov <- (mfh$CHRON1 - mfh$CHRON2)*(mfh$n1 - mfh$n2)
mfh$Sys_n_Cov <- (mfh$SYSBP1 - mfh$SYSBP2)*(mfh$n1 - mfh$n2)
library(tidyverse)
Sumall <- mfh %>% summarise(Chr_Sum=sum(Chr_Sum), Sys_Sum=sum(Sys_Sum), n_Sum=sum(n_Sum),
Chr_Var=sum(Chr_Var), Sys_Var=sum(Sys_Var), n_Var=sum(n_Var),
Chr_n_Cov=sum(Chr_n_Cov), Sys_n_Cov=sum(Sys_n_Cov))
library(knitr)
kable(Sumall, caption = "**The cluster sample sums of CHRON and PHYS and population total N. Their corresponding variances and covariances.**")
Chr_Sum | Sys_Sum | n_Sum | Chr_Var | Sys_Var | n_Var | Chr_n_Cov | Sys_n_Cov |
---|---|---|---|---|---|---|---|
1073 | 382678 | 2699 | 1545 | 50469516 | 2527 | 1435 | 349962 |
paste("The variable phat refers to CHRON and is the proportion estimate of chronic morbidity in 30-64 years old males, V_des_phat is the corresponding variance approximation of CHRON. The variable yhat refers to SYSBP and is the estimate of mean systolic blood pressure in 30-64 years old males, V_des_yhat is the corresponding variance approximation of SYSBP.")
## [1] "The variable phat refers to CHRON and is the proportion estimate of chronic morbidity in 30-64 years old males, V_des_phat is the corresponding variance approximation of CHRON. The variable yhat refers to SYSBP and is the estimate of mean systolic blood pressure in 30-64 years old males, V_des_yhat is the corresponding variance approximation of SYSBP."
paste("phat is")
## [1] "phat is"
(phat <- Sumall$Chr_Sum / Sumall$n_Sum)
## [1] 0.3975546
paste("V_des_phat is")
## [1] "V_des_phat is"
(V_des_phat <- (phat^2)*(Sumall$Chr_Sum^-2 * Sumall$Chr_Var + Sumall$n_Sum^-2 * Sumall$n_Var-2*(Sumall$Chr_Sum * Sumall$n_Sum)^-1 * Sumall$Chr_n_Cov))
## [1] 0.0001102888
paste("yhat is")
## [1] "yhat is"
(yhat <- Sumall$Sys_Sum / Sumall$n_Sum)
## [1] 141.7851
paste("V_des_yhat is")
## [1] "V_des_yhat is"
(V_des_yhat <- (yhat^2) * ( Sumall$Sys_Sum^-2 * Sumall$Sys_Var + Sumall$n_Sum^-2 * Sumall$n_Var-2*( Sumall$Sys_Sum * Sumall$n_Sum)^-1 * Sumall$Sys_n_Cov))
## [1] 0.2788127
paste("Deff_phat is")
## [1] "Deff_phat is"
(Deff_phat <- V_des_phat/(phat*(1-phat)/ Sumall$n_Sum))
## [1] 1.242853
Please download the SAS code for your own further linearization method training. One of tasks is to try to complete the code in order to calculate the design effects (deffs) of the CHRON proportion estimator and the SYSBP mean estimator. Instructions will be given in the SAS code once you have downloaded it.
NOTE! You need to have access to SAS in your computer.
Example 5.3. Similar to the other sample re-use methods and the linearization method, the JRR technique (Jackknife repeated replications) can be used for variance approximation of a nonlinear estimator under a complex sampling design. In this Training Key we apply the JRR technique for variance approximation of a subpopulation proportion estimator for the binary response variable CHRON (chronic morbidity) and a subpopulation mean estimator for the continuous response variable SYSBP (systolic blood pressure) in the MFH survey. The subgroup considered is males aged 30-64 years.
In this part we will demonstrate how the estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the JRR technique. The reason for the use of an approximate variance estimator is that the proportion and mean estimators should be treated here as nonlinear estimators of type y/x where both the numerator y and the denominator x are random variables. The denominator x is random because the sampling design of the MFH Survey is a stratified two-stage design where the cluster sample sizes are not fixed in advance, and the subgroup size is not fixed.
Please download the SAS macro JRR for your own further JRR technique training. Instructions will be given in the code once you download it.
NOTE! You need to have access to SAS in your computer.
Example 5.4. Similar to the other sample re-use methods and the linearization method, the bootstrap can be used for variance approximation of a nonlinear estimator under a complex sampling design. In this Training Key we apply the BOOT technique for variance approximation of a subpopulation proportion estimator for the binary response variable CHRON (chronic morbidity) and a subpopulation mean estimator for the continuous response variable SYSBP (systolic blood pressure) in the MFH survey. The subgroup considered is males aged 30-64 years.
In this part we will demonstrate how the estimation of the variance of a combined ratio type proportion or mean estimator can be carried out with the bootstrap method. We will also demonstrate how the number of bootstrap samples generated affects to the distribution of bootstrap estimates. The reason for the use of an approximative variance estimator is that the proportion and mean estimators should be treated here as nonlinear estimators of type y/x where both the numerator y and the denominator x are random variables. The denominator x is random because the sampling design of the MFH Survey is a stratified two-stage design where the cluster sample sizes are not fixed in advance, and the subgroup size is not fixed.
Please download the SAS macro BOOT for your own further bootstrap method training. Instructions will be given in the code once you download it.
NOTE! You need to have access to SAS in your computer.
Download SAS data | Download R data | Download SPSS data | Download Text file
The sampling design of the OHC Survey is an example of stratified cluster sampling where both one-stage and two-stage sampling have been used. The pedacogical data set constructed for training purposes in this example includes a total of 250 clusters, i.e. industrial establishments, organized in five strata (stratums 2 to 6), and a total of 7841 persons. Stratification is based on the type of industry and cluster size (the number of salaried employees). Clusters having at least 10 employees are included in the OHC example data set. There is variable number of clusters per stratum in the design. The average cluster sample size is 11.2 employees. A more detailed description of the study design and sampling design of the OHC Survey are given in Section 5.6 of Lehtonen and Pahkinen (2004).
To give you an idea of conditional distributions and correlations of response variable and predictor variables, basic descriptive statistics with the correlation matrix of AGE, PHYS, CHRON, PSYCH and PSYCH2 also are shown, separately for both SEXes.
The list of variables in the OHC Survey data set is shown below.
Variable | Label |
---|---|
CLUSTER | Cluster Identification |
STRATUM | Stratum Identification, 2 to 6 |
SUBJECT | CLUSTER is transformed to a new variable SUBJECT having values 1,2,…,250 |
ID | Identification number |
SEX | 1 males, 2 females |
AGE | in years, range 15 to 64 |
AGE 2 | 1 = -44, 2 = 45- |
PHYS | Physical Health Hazards, 1 present, 0 otherwise |
CHRON | Chronic morbidity, 1 present, 0 otherwise |
PSYCH | Psychic Strain, standardized first principal component of nine psychic symptoms |
PSYCH2 | Psychic Strain, constructed from PSYCH such that score below median = 0 and above median = 1 |
PSYCH3 | Psychic Strain, constructed from PSYCH. 1 = none, 2 = some, 3 = severe |