In Chapter 8 of the textbook Practical Methods for Design and Analysis of Complex Surveys, multivariate analysis for complex surveys is discussed in the case of one response variable and a set of predictor or explanatory variables. For this kind of analysis situation, logistic models and linear models are widely used. Proper methods are introduced for fitting these models for intra-cluster correlated response variables from complex sampling designs. Design-based analysis of categorical data using a logit ANOVA model, and methods for design-based logistic and linear regression analysis, are discussed and applied for data from a complex sample survey.
In Training Key 277, design-based logit ANOVA modelling is examined reproducing the results of Example 8.1. A step-wise ANOVA model building procedure is demonstrated. A program for generalized weighted least squares (GWLS) estimation is examined in detail. The Occupational Health Care Survey data set is used.
In Training Key 288, logistic analysis of covariance (ANCOVA) is demonstrated for a binary response variable and the results of Example 8.2 are reproduced. Pseudolikelihood (PML) estimation is used for the OHC Survey data set, accounting for the sampling complexities. An option is provided for a detailed examination of the role of interaction effects in a logistic ANCOVA model.
In Training Key 293, a linear ANCOVA model is fitted for a continuous response variable from the OHC Survey data set by using weighted least squares estimation.
The results of Example 8.4 will be reproduced.
In Training Key 298, alternative multivariate analyses for a binary response variable and a continuous response variable from the OHC Survey data set are demonstrated. The methods include the PML and generalized estimating equations (GEE) estimation for fixed-effects logistic and linear models and residual maximum likelihood (REML) estimation for logistic and linear mixed models. The methods extend the methodology presented in the textbook. It will be noted that closely agreeing numerical results can be obtained by the different methods available.
In this Training Key we will practice design-based multivariate model fitting for categorical data in the OHC survey setting. In point A, the aim is to model the variation of domain proportions of the binary response variable PSYCH2 (psychic strain; 1 present, 0 otherwise) across the eight domains formed by a cross-classification of SEX, a two-category variable AGE (-44 years, 45- years) and the variable PHYS (physical health hazards of work; 1 present, 0 otherwise) by fitting the model in stepwise manner. In point B, the aim is to describe and practice the GWLS method from more detailed and theoretical perspective. Point C is the option for interactive analysis.
In this part we will exercise how to fit a logit ANOVA model in the OHC survey setting described above. We use pseudolikelihood (PML) estimation in this exercise.
Main effects logit ANOVA model is fitted for the OHC Survey data to examine the variation of domain proportions of the binary response variable PSYCH2 (Psychic strain, 1 present; 0 otherwise) across the eight domains formed by a cross-classification of variables SEX, two-category AGE and binary PHYS (Physical health hazards, 1 present; 0 otherwise).
We will fit a reduced model that is parsimonious and fits reasonably well. The reduced model can be used for the interpretation of the relationships of the predictors with the response variable. In this context PHYS is the main subject matter predictor. Age-sex adjusted odds ratio (OR) statistics will be calculated for PHYS and can be used for the interpretation of the results.
print("Load ohc data:")
## [1] "Load ohc data:"
load("ohc.Rdata")
library(tidyverse)
library(survey)
# Construct a ohc.design object.
# This ohc.design object will be used for all subsequent analysis commands:
# NOTE. nest = TRUE
ohc.design <- svydesign(ids = ~CLUSTER, data = ohc, strata = ~STRATUM, nest = TRUE)
print("Results for Logit-ANOVA / Model terms: Intercept Sex Age2 Phys:")
## [1] "Results for Logit-ANOVA / Model terms: Intercept Sex Age2 Phys:"
fit.reduced <- svyglm(PSYCH2 ~ as_factor(SEX) + as_factor(AGE2) + as_factor(PHYS), family=binomial(link = "logit"), ohc.design)
summary(fit.reduced)
##
## Call:
## svyglm(formula = PSYCH2 ~ as_factor(SEX) + as_factor(AGE2) +
## as_factor(PHYS), design = ohc.design, family = binomial(link = "logit"))
##
## Survey design:
## svydesign(ids = ~CLUSTER, data = ohc, strata = ~STRATUM, nest = TRUE)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.34579 0.04724 -7.319 3.67e-12 ***
## as_factor(SEX)2 0.49397 0.05921 8.343 5.55e-15 ***
## as_factor(AGE2)2 0.12339 0.05788 2.132 0.034 *
## as_factor(PHYS)1 0.28607 0.05798 4.934 1.50e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.00005)
##
## Number of Fisher Scoring iterations: 4
print("Odds ratios and confidence intervals:")
## [1] "Odds ratios and confidence intervals:"
exp(cbind(coefficients(fit.reduced), confint(fit.reduced)))
## 2.5 % 97.5 %
## (Intercept) 0.7076602 0.645076 0.7763162
## as_factor(SEX)2 1.6388142 1.459256 1.8404663
## as_factor(AGE2)2 1.1313215 1.009985 1.2672352
## as_factor(PHYS)1 1.3311856 1.188195 1.4913848
This option is related to Example 8.1 and is for those who wish to gain deeper understanding on how the GWLS method functions in logit ANOVA. SAS/IML macro is provided with the description of each step of the model fitting procedure. Full design-based (DES) option and model-assisted BIN option (assuming simple random sampling) are compared.
Further instructions are given in the code once you download.
Example 8.2: In this Training Key we concentrate on performing design-based logistic ANCOVA (Analysis of Covariance) modelling with the pseudolikelihood or PML method. We fit a multivariate logistic ANCOVA model by entering in the model some of the predictors as continuous measurements and some as discrete variables. We will also demonstrate the effect of interaction terms in the interpretation of the results. The data is again from the OHC Survey.
In this part we will study how to perform design-based logistic ANCOVA modelling with the PML method for a binary response variable in the case of three continuous predictors and one discrete predictor. We will build the model by removing interaction terms from a model including all possible interaction terms of the discrete predictor with the continuous predictors. Graphical displays are used to show the effect of presence or absence of an interaction term on predicted proportions calculated for a given logistic ANCOVA model. The results can be compared with those from the logit ANOVA exercise (key 277).
The binary response variable PSYCH2 measures psychic strain (1: present, 2: otherwise). AGE (in years), PHYS (binary; Physical health hazards; 1 present, 0 otherwise) and CHRON (binary; Chronic morbidity; 1 present, 0 otherwise) are treated as continuous predictors. SEX is the discrete predictor. Thus there are four predictors available. Because SEX*AGE is statistically significant interaction term it is included in the model.
print("Load ohc data:")
## [1] "Load ohc data:"
load("ohc.Rdata")
# relevel variable sex (to compare the results from Table 8.8)
ohc$sex <- factor(ohc$SEX, levels = c(1,2))
ohc$sex <- relevel(ohc$sex, ref = "2")
# define the design
ohc.design <- svydesign(ids = ~CLUSTER, data = ohc, strata = ~STRATUM, nest = TRUE)
# fit the ancova model: LOGIT(PSYCH2) = INTERCEPT + SEX + AGE + PHYS + CHRON + SEX*AGE
fit.ancova <- svyglm(PSYCH2 ~ sex * AGE + CHRON + PHYS, family=binomial(link = "logit"), ohc.design)
print("Results for Logistic ANCOVA:")
## [1] "Results for Logistic ANCOVA:"
summary(fit.ancova)
##
## Call:
## svyglm(formula = PSYCH2 ~ sex * AGE + CHRON + PHYS, design = ohc.design,
## family = binomial(link = "logit"))
##
## Survey design:
## svydesign(ids = ~CLUSTER, data = ohc, strata = ~STRATUM, nest = TRUE)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.196442 0.157232 1.249 0.2127
## sex1 -0.992571 0.203339 -4.881 1.92e-06 ***
## AGE -0.004560 0.004060 -1.123 0.2624
## CHRON 0.564103 0.057469 9.816 < 2e-16 ***
## PHYS 0.276471 0.059596 4.639 5.76e-06 ***
## sex1:AGE 0.013081 0.005111 2.559 0.0111 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.00009)
##
## Number of Fisher Scoring iterations: 4
print("Odds ratio estimates and 95% confidence intervals for the final logistic ANCOVA model:")
## [1] "Odds ratio estimates and 95% confidence intervals for the final logistic ANCOVA model:"
exp(cbind(coefficients(fit.ancova), confint(fit.ancova)))
## 2.5 % 97.5 %
## (Intercept) 1.2170647 0.8942882 1.6563412
## sex1 0.3706225 0.2487996 0.5520953
## AGE 0.9954501 0.9875610 1.0034022
## CHRON 1.7578697 1.5706133 1.9674517
## PHYS 1.3184692 1.1731185 1.4818289
## sex1:AGE 1.0131668 1.0030681 1.0233671
Please compare the results with the logit ANOVA exercise (KEY 277). What conclusions can be drawn?
Please first download the reduced version of the OHC data set (only relevant variables are included) and save it to the disk.
Then download the enclosed SAS code for your own further training of model fitting by logistic ANCOVA.
NOTE! You need to have access to SAS and SUDAAN in your computer.
Example 8.4. In this Training Key we concentrate on performing design-based linear ANCOVA (Analysis of Covariance) modelling with the weighted least squares WLS method. We fit a linear ANCOVA model by entering in the model some of the predictors as continuous measurements and some as discrete variables. The data we are using are again from the OHC survey.
In this part we will study how to perform design-based linear ANCOVA modelling with the PML method for a continuous response in the case of three continuous predictors, one discrete predictor and one continuous interaction term. Instructions will be given once you start.
A linear ANCOVA model is fitted on the original continuous variable PSYCH, whose values are scores of the first standardized principal component of nine psychic symptoms. Thus, the average of PSYCH is zero and the variance is one. We include one qualitative variable (SEX) and three continuous predictors (AGE, PHYS and CHRON), and also the pair-wise interaction of SEX and AGE. Thus, the fitted model is:
PSYCH = Intercept + SEX + AGE + PHYS + CHRON + SEX*AGE.
fit.lin.ancova <- svyglm(PSYCH ~ sex * AGE + CHRON + PHYS, family=gaussian(), deff=TRUE, ohc.design)
print("Results for Linear ANCOVA:")
## [1] "Results for Linear ANCOVA:"
summary(fit.lin.ancova)
##
## Call:
## svyglm(formula = PSYCH ~ sex * AGE + CHRON + PHYS, design = ohc.design,
## family = gaussian(), deff = TRUE)
##
## Survey design:
## svydesign(ids = ~CLUSTER, data = ohc, strata = ~STRATUM, nest = TRUE)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.208e-02 8.315e-02 -0.145 0.8846
## sex1 -4.975e-01 9.968e-02 -4.991 1.16e-06 ***
## AGE -5.224e-05 2.122e-03 -0.025 0.9804
## CHRON 3.922e-01 2.941e-02 13.334 < 2e-16 ***
## PHYS 1.772e-01 2.899e-02 6.114 3.92e-09 ***
## sex1:AGE 5.712e-03 2.536e-03 2.252 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9419035)
##
## Number of Fisher Scoring iterations: 2
NOTE! The signs of model coefficients and the t-test results follow a similar pattern to those in corresponding logistic ANCOVA model in TRAINING KEY 288. However, model coefficients have different interpretations from those in the logistic model. In a logit ANCOVA we were working on a logit scale on the binary response whereas we are now dealing with continuous measurements. Therefore the results of the linear ANCOVA model can be interpreted in the usual linear regression context.
Please first download the reduced version of the OHC data set (only relevant variables are included) and save it to the disk.
Then download the enclosed SAS code for your own further training of a linear ANCOVA model fitting. NOTE! You need to have access to SAS and SUDAAN in your computer.
In this Training Key, accounting for the sampling design complexities is studied for multivariate survey analysis. The main concern is in the clustering effects. Design-based and model-based analysis options are compared empirically, using the appropriate SUDAAN and SAS analysis procedures, and R functions. We use the OHC Survey data set in our analyses.
For a continuous study variable, modeling takes place under a fixed-effects linear model and under a mixed linear model.
The continuous variable PSYCH is the first principal component of nine psychic symptoms (explaining 42 % of the total variation of the symptoms). The variable PSYCH is standardized for mean zero and variance one. Psychic strain increases with increasing the value of PSYCH. The overall design effect of PSYCH is about 2 indicating substantial clustering effect.
Our main explanatory variable is CHRON (Chronic morbidity). Age and sex are used as additional predictors in the models; we calculate sex-age adjusted figures.
library(nlme)
# PSYCH = Intercept + AGE + SEX + CHRON
# For R functions, the variable CLUSTER is transformed to a new variable SUBJECT having values 1,2,...,250.
print("Results for Linear mixed-effects model:")
## [1] "Results for Linear mixed-effects model:"
summary(lme(PSYCH ~ AGE + SEX + CHRON, random = ∼1|SUBJECT, data = ohc))
## Linear mixed-effects model fit by REML
## Data: ohc
## AIC BIC logLik
## 21842.84 21884.64 -10915.42
##
## Random effects:
## Formula: ~1 | SUBJECT
## (Intercept) Residual
## StdDev: 0.1462131 0.9638634
##
## Fixed effects: PSYCH ~ AGE + SEX + CHRON
## Value Std.Error DF t-value p-value
## (Intercept) -0.5395868 0.05415573 7588 -9.963614 0.0000
## AGE 0.0025240 0.00109003 7588 2.315554 0.0206
## SEX 0.2319722 0.02392106 7588 9.697405 0.0000
## CHRON 0.3967436 0.02516863 7588 15.763412 0.0000
## Correlation:
## (Intr) AGE SEX
## AGE -0.714
## SEX -0.627 -0.007
## CHRON 0.069 -0.275 0.003
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.6533735 -0.7564286 -0.2508930 0.5381757 4.9672219
##
## Number of Observations: 7841
## Number of Groups: 250
library(gee)
# PSYCH = Intercept + AGE + SEX + CHRON
results.gee <- gee(PSYCH ~ AGE + SEX + CHRON, id = SUBJECT, data = ohc,
family = gaussian, corstr = 'exchangeable')
print("Results:")
[1] “Results:”
library(stargazer)
stargazer(results.gee, type='html')
Dependent variable: | |
PSYCH | |
AGE | 0.003** |
(0.001) | |
SEX | 0.233*** |
(0.027) | |
CHRON | 0.397*** |
(0.029) | |
Constant | -0.542*** |
(0.059) | |
Observations | 7,841 |
Note: | p<0.1; p<0.05; p<0.01 |
For a binary study variable, modeling takes place under a fixed-effects logistic model and under a mixed logistic model. For a fixed-effects logistic model, GEE estimation is used with the SUDAAN procedure LOGISTIC (RLOGIST), the SAS procedure GENMOD and the R function gee. For a mixed logistic model, REML estimation is used with the SAS macro GLIMMIX and the R function glmmPQL. Further instructions will be given once you start.
Please download the SAS code for your own further training (NOTE! you need to have access to SUDAAN (a SAS callable version)).
Further instructions are given in the code once you download.