Concepts

ANCOVA

Analysis of covariance is used to test the main and interaction effects of categorical variableson a continuous dependent variable, controlling for the effects of selected other continuous variables which covary with the dependent. The control variable is called the “covariate.” There may be more than one covariate. One may also perform planned comparison or post hoc comparisons to see which values of a factor contribute most to the explanation of the dependent.

See chapter 8 in Lehtonen, R & Pahkinen, E. (2004)

Auxiliary Variable

The auxiliary variable represents additional information on the finite population and is usually assumed known for all the N population elements.

See page 16 in Lehtonen, R. & Pahkinen, E. (2004)

The BOOT Technique

Similar to the other sample reuse methods, the bootstrap can be used for varianceapproximation of a non-linear estimator under a complex sampling design. The method, however, differs from BRR and JRR in many respects, e.g. the generationof pseudosamples is quite different. We consider the bootstrap technique for variance estimation of a ratio estimator under a two-stage stratified epsem design where a constant number of clusters (which may be greater than two) is drawn with replacement from each stratum.

See page 160 in Lehtonen, R. & Pahkinen, E. (2004)

The BRR Technique

The technique of balanced half-samples (BHS, BRR) was introduced by McCarthy (1966, 1969) for variance approximation of a nonlinear estimator under an epsem design, where a large number of strata are formed and exactly two clusters are drawn with replacement from each stratum. The way of forming pseudosamples in the BRR technique starts from the fact that, with H strata and \(m_h = 2\) sample clusters per stratum, the total sample can be split into \(2^H\) overlapping half-samples each with \(H\) sample clusters.

See page 150-> in Lehtonen, R. & Pahkinen, E. (2004)

Cluster sampling (CLU)

The population is assumed to be readily divided into naturally formed subgroups called clusters. Subgroups often used in practice are, for example, clusters of pupils in schools and clusters of people in households. A sample of clusters is drawn from the population of clusters. If the clusters are internally homogeneous, which is usually the case, then cluster sampling is less efficient than simple random sampling.

See pages 17, 60 and 70 in Lehtonen, R & Pahkinen, E. (2004)

Coefficient of variation

The coefficient of variation is, for example, for an estimator \(\hat{t}\) of the total \(T\)

\(c.v(\hat{t}) = s.e(\hat{t}) / \hat{t}\).

It measures the stability of the estimators.

See page 29 in Lehtonen, R & Pahkinen, E. (2004)

Design effect DEFF

A convenient way to evaluate a sampling design is to compare the design variance of an estimator to the design variance from a references sampling scheme of the same (expected) sample size. Usually, simple random sampling without replacement is chosen as reference. For example, for an estimator \(\hat{t}\) of the total \(T\), the ratio of the two design variances is called the design effect and abbreviated to DEFF. In practice, an estimate of the design effect is calculated using the corresponding variance estimators for the sample data set. An estimator of the design effect is thus \(deff_{p(s)}(\hat{t}) = \hat{v}_{p(s)}(\hat{t})/\hat{v}_{srs}(\hat{t})\), where \(p(s)\) refers to the actual sampling design.

See page 15 in Lehtonen, R & Pahkinen, E. (2004)

Design-based approach

When a survey is analysed in practice, it is emphasized that the estimation should take into account the structure of the sampling scheme. To accomplish this, the analysis is carried out using theso-called design-based approach. An essential property of the design-based approach is that any of the complexities due to the sampling scheme can be properly accounted in the estimation. These complexities can arise, for example, when elements have unequal selection probabilities.

See page 10 in Lehtonen, R & Pahkinen, E. (2004)

Generalized estimating equations (GEE)

A preliminary GEE method with an independent correlation assumption relates to the standard PML method where observations are assumed independent within clusters for the estimation of the regression coefficients, but are allowed to be correlated for the estimation of the covariance matrix of the estimated regression coefficients. In a more advanced GEE method, assuming an exchangeable correlation structure, observations are allowed to be correlated within clusters in the estimation of both regression coefficients and the covariance matrix of estimated regression coefficients.

See pages 287-289 in Lehtonen, R & Pahkinen, E. (2004)

Generalized regression estimator (GREG)

Under a design-based approach for domain estimation, auxiliary data are usually incorporated in an estimation procedure by model-assisted techniques. Linear regression models are often adopted in the construction of generalized regression estimators (GREG; Särndal, et al. 1992; Estevao et al. 1995) that are design-unbiased. GREG estimators can often be expected to be precise especially for domains large enough but in the smallest domains, they tend to have large variance. Thus, small domains seem to call for model-dependent (or composite) estimators.

See page 97 in Lehtonen, R & Pahkinen, E. (2004)

Generalized Weighted Least Squares (GWLS)

The GWLS method of generalized weighted least squares estimation provides a simple technique for the analysis of categorical data with ANOVA-type logit and linear models on domain proportions. Allowing all the complexities of a sampling design including stratification, clustering and weighting, the design-based option provides a generally valid GWLS analysis. Analysis under the weighted or unweighted SRS options assuming simple random sampling serves as a reference when studying the effects of clustering and weighting on results. This simple noniterativemethod will be discussed in Section 8.3 for logit and linear modelling of categorical data. The GWLS method, introduced in Grizzle et al. (1969) andKoch et al. (1975).

See pages 269-> in Lehtonen, R. & Pahkinen, E. (2004)

Horvitz-Thompson estimator

As described in Horvitz and Thompson (1952) a Horvitz-Thompson estimator is an unbiased estimator of a population total of a finite population, applicable in the general case where individuals are sampled with unequal probabilities.

See pages 25, 118 in Lehtonen, R. & Pahkinen, E. (2004)

Imputation

Imputation for item nonresponse means that a missing value of a measurement \(y_k\) is filled in by a predicted value \(\hat{y}_k\). The goal of imputation is to achieve a complete data matrix for further analysis. Imputation can be performed under single or multiple imputation methods.

See pages 115 and 122 in Lehtonen, R & Pahkinen, E. (2004)

The JRR Technique

The particular jackknife method based on jackknife repeated replications has many features of the BRR technique, since only the method of forming the pseudo samples is different. We consider the JRR technique in the simplest case where the number of sample clusters per stratum is exactly two, and the clusters are assumed to be drawn with replacement. JRR variance estimators are derived for a ratio estimator, which is a subpopulation proportion or estimator.

See page 156 in Lehtonen, R. & Pahkinen, E. (2004)

Linearization method

In the estimation of the variance of a general non-linear estimator, we adopt the method based on the so-called Taylor series expansion. The method is usually called the linearization method because we first reduce the original non-linear quantity to an approximate linear quantity by using the linear terms of the corresponding Taylor series expansion, and then construct the variance formula and an estimator of the variance of this linearized quantity.

See page 141 in Lehtonen, R. & Pahkinen, E. (2004)

THE MINI-FINLAND HEALTH SURVEY (MFH)

The Mini-Finland Health Survey was designed to obtain a comprehensive picture of health and the need for care in Finnish adults, and to develop methods for monitoring health in the population. A two-stage stratified cluster-sampling design was used in such away that one cluster was sampled from each of the 40 geographical strata. The one-cluster-per-stratum design was used to attain a deep stratification of the population of the clusters. The sample of 8000 persons was allocated to achieve an epsem sample (equal probability of selection method; see Section 3.2). Collapsed stratum technique was used in variance estimation with linearization and sample re-use methods.

See pages 132-137 and 146 in Lehtonen, R. & Pahkinen, E. (2004)

Model-assisted estimation

Model-assisted estimation refers to the property of the estimators that models such as linear regression are used in incorporating the auxiliary information in the estimation procedure for the finite-population parameters of interest, such as totals.

See page 87 in Lehtonen, R & Pahkinen, E. (2004)

Nonresponse

Failure to obtain all the intended measurements or responses from all the selected sample members is called nonresponse. Nonresponse causes missing data, i.e. results in a data set whose size for the study variable y is smaller than planned. Two types of missing data (unit nonresponse and item nonresponse) are distinguished for a sample element.

See pages 112-113 in Lehtonen, R & Pahkinen, E. (2004)

THE OCCUPATIONAL HEALTH CARE SURVEY (OHC)

The sampling design of the OHC Survey is an example of stratified cluster samplingin which both one-stage and two-stage sampling are used. Thus, the OHC Survey sampling design is slightly more complex than the MFH Survey. Moreover, in the OHC Survey sampling design a large number (250) of sample clusters are available, and the design produces noticeable clustering effects for several response variables. Therefore, this sampling design is very suitable for examining the effects of clustering in the analysis of complex surveys. The OHC Survey will be used for further examples given in Chapters 7 and 8.

See pages 166-171 and 180 in Lehtonen, R. & Pahkinen, E. (2004)

Planned domain structure

For planned domains on the other hand, the domain sample sizes are fixed in advance by stratification. Stratified sampling in connection with a suitable allocation scheme is often used in practical applications.

See pages 188-190 in Lehtonen, R & Pahkinen, E. (2004)

Pseudo Maximum Likelihood (PML)

The PML method of pseudolikelihood estimation can be used in analysis situations similar to the GWLS method, but its main applications are in logistic regression with continuous predictors. Under the design-based option, the PML method provides valid logit analysis for complex surveys. It is also beneficial for the PMLmethod that the number of sample clusters is large, and similar adjustments are available for unstable cases, as for the GWLS method.The PML method covers not only logistic regression models but also other model types from the class of generalized linear models. So, linear models on continuous responses are also covered.

See page 261 in Lehtonen, R. & Pahkinen, E. (2004)

Poststratification

Poststratification can be used for improvement of efficiency of an estimator if a discrete auxiliary variable is available. This variable is used to stratify the sample data set after the sample has been selected.

See pages 88-90 in Lehtonen, R & Pahkinen, E. (2004)

PPS sampling

Sampling with PPS provides a practical technique when sampling from populations with large variation in the values of the study variable and often gives a considerable gain in efficiency. Efficiency varies considerably according to the type of parameter to be estimated. Under PPS sampling an auxiliary size measure (\(z\)) must be available and for efficient estimation the size measure should be strongly related to the study variable \(y\). A condition for this is that the variable pair \((z, y2/z)\) is positively correlated. The ratio \(Y_k / Z_k\) must also remain constant over the population.

See pages 49 to 61 in Lehtonen, R & Pahkinen, E. (2004)

PPS sampling without replacement (PPSWOR)

When selecting without replacement, a new problem arises concerning the computation of inclusion probabilities. With the selection of the first element, the single-draw probability is exactly \(\pi_k = p_k = Z_k / T_z\). When the first sample element has been selected, the single draw selection probability changes because the total \(T_z\) of the remaining \(N - 1\) elements in the population decreases.

See page 51 in Lehtonen, R & Pahkinen, E. (2004)

PROVINCE’91 population

To illustrate the main ideas, a small data set under the title Provinceí91 has been taken from the official statistics of Finland. This data set will be used as a sampling frame in Chapters 2 to 4. Finland is divided into 14 provinces from which one has been selected for demonstration. This province comprises 32 municipalities and had a total population of 254 584 inhabitants on 31 December, 1991. The data set is presented in Table 2.1.

See pages 18-19 in Lehtonen, R. & Pahkinen, E. (2004)

Ratio Estimation of Population Total

Ratio estimation can be used to improve the efficiency of the estimation of \(T\), if a continuous auxiliary variable \(z\) is available. The population total \(T_z\) and the \(n\) sample values \(z_k\) of \(z\) are required for this method. This information can be used to improve the estimation of \(T\) by first calculating the sample estimator of the ratio \(R = T/T_z\) and multiplying it by the known total \(T_z\). Ratio estimation of the total can be very efficient if the ratio \(Y_k / Z_k\) of the values of the study and auxiliary variables is nearly constant across the population. Ratio estimators are usually effective but slightly biased.

See pages 93-95 in Lehtonen, R & Pahkinen, E. (2004)

Regression estimation of totals

Regression estimation of the population total \(T\) of a study variable \(y\) is based on the linear regression between \(y\) and a continuous auxiliary variable \(z\). The linear regression can, for example, be given by \(E_M(y_k) = \alpha + \beta z_k\) with a variance \(V_M(y_k) = \sigma^2\), where \(y_k\) are independent random variables with the population values \(Y_k\) as their assumed realizations, \(\alpha\), \(\beta\) and \(\sigma^2\) are unknown parameters, \(Z_k\) are known population values of \(z\), and \(E_M\) and \(V_M\) refer respectively to the expectation and variance under the model. A regression estimator is given by

\(\hat{t}_{reg} = \hat{t}_{HT} + b(T_Z - \hat{t}_{ZHT})\) where \(\hat{t}_{HT}\) is the Horwitz-Thompson estimator of the total \(T\), \(b\) is the estimated slope coefficient, and \(T_Z\) and \(\hat{t}_{ZHT}\) are the known population total of \(z\) and the HT estimator of \(T_Z\), respectively.

See pages 97-101 in Lehtonen, R & Pahkinen, E. (2004)

Simple Random Sampling Without Replacement (SRSWOR)

This forms the basis of many other more complex sampling plans and is the ‘gold standard’ against which all other sampling plans are compared. It often happens that more complex sampling plans consist of a series of simple random samples that are combined in a complex fashion.

In this design, once the frame of units has been enumerated, a sample of size n is selected without replacement from the N population units.

See pages 22-> in Lehtonen, R. & Pahkinen, E. (2004)

Stratified sampling (STR)

In stratified sampling, the target population is divided into non-overlapping subpopulations called strata. These are conceptually regarded as separate populations in which sampling can be performed independently. To carry out stratification, appropriate auxiliary information is required in the sampling frame. Regional, demographic and sosioeconomic variables are often used as the stratifying auxiliary variables.

See pages 16 and 61 in Lehtonen, R & Pahkinen, E. (2004)

Systematic sampling (SYS)

Systematic sampling is one of the most frequently used sample selection tecniques. A list of populaton elements or a computerized register serves as the selection frame from which every qth element can be systematically selected. Systematic sampling may in some cases be more effective than simple random sampling.

See pages 16 and 37 in Lehtonen, R & Pahkinen, E. (2004)

Systematic PPS sampling (PPSSYS)

In this method, the properties of systematic sampling and sampling proportional to size are combined into a single sampling scheme. In ordinary systematic sampling, the sampling interval is determined by the quotient \(q = N/n\). In systematic PPS sampling, the sampling interval is given by \(q = Tz/n\). As in the ordinary onerandom-start systematic sampling, we first select a random number from the closed interval \([1, q]\).

See pages 51-52 in Lehtonen, R & Pahkinen, E. (2004)

Unit Nonresponse

All intended measurements for a selected sample element could be missing, e.g. owing to a refusal to participate in a personal interview. Unit nonresponse results in a sample data set whose size \(n_{(r)}\) is smaller than the intended sample size \(n\), thus increasing the standard errors of the estimates.

See pages 113-114 in Lehtonen, R & Pahkinen, E. (2004)

Unplanned domain structure

A domain is defined unplanned, if the domain sample size \(n_{sd}\) is not fixed in the sampling design. This is the case in which the desired domain structure is not a part of the sampling design. Thus, the domain sample sizes are random quantities introducing an increase in the variance estimates of domain estimators. In addition, an extremely small number (even zero) of sample elements in a domain can be realized in this case, if the domain size in the population is small.

See pages 188-190 in Lehtonen, R & Pahkinen, E. (2004)