Complex Surveys A Guide to Analysis Using R
Thomas Lumley
University of Washington
Department of Biostatistics
Seattle, Washington
Acknowledgments
Preface
Acronyms
1 Basic Tools
1.1 Goals of inference
1.1.1 Population or process?
1.1.2 Probability samples
1.1.3 Sampling weights
1.1.4 Design effects
An introduction to the data
1.2.1 Real surveys
1.2.2 Populations
1.3.1 Obtaining R
1.3.2 Obtaining the survey package
1.2
1.3 Obtaining the software
1.4 Using R
1.4.1 Reading plain text data
1.4.2
1.4.3 Simple computations
Exercises
Reading data from other packages
Simple and Stratified sampling
2.1
2.2
2.3
2.4
2.5
2.6
Analyzing simple random samples
2.1.1 Confidence intervals
2.1.2
Stratified sampling
Replicate weights
2.3.1
2.3.2
Other population summaries
2.4.1 Quantiles
2.4.2 Contingency tables
Estimates in subpopulations
Design of stratified samples
Exercises
Describing the sample to R
Specifying replicate weights to R
Creating replicate weights in R
Cluster sampling
3.1
3.2
3.3
3.4
Introduction
3.1.1
3.1.2 Single-stage and multistage designs
Describing multistage designs to R
3.2.1 Strata with only one PSU
3.2.2 How good is the single-stage approximation?
3.2.3 Replicate weights for multistage samples Sampling by size
3.3.1 Loss of information from sampling clusters Repeated measurements
Exercises Why clusters: the NHANES I1 design Graphics
4.1
4.2 Plotting a table
4.3 One continuous variable Why is survey data different?
4.3.1 Graphs based on the distribution function
CONTENTS
4.4
4.5
4.6
4.3.2
Tho continuous variables
4.4.1 Scatterplots
4.4.2 Aggregation and smoothing
4.4.3 Scatterplot smoothers
Conditioning plots
Maps
4.6.1 Design and estimation issues
4.6.2 Drawing maps in R
Exercises
Graphs based on the density
5 Ratios and linear regression
5.1
5.2
5.3
Ratio estimation
5.1.1 Estimating ratios
5.1.2 Ratios for subpopulation estimates
5.1.3 Ratio estimators of totals
Linear regression
5.2.1 The least-squares slope as an estimated population
summary
5.2.2 Regression estimation of population totals
5.2.3 Confounding and other criteria for model choice
5.2.4 Linear models in the survey package
Is weighting needed in regression models?
Exercises
6 Categorical data regression
6.1 Logistic regression
6.2 Ordinal regression
6.3 Loglinear models
6.1.1 Relative risk regression
6.2.1 Other cumulative link models
6.3.1 Choosing models.
6.3.2 Linear association models
Exercises
7 Post-stratification, raking and calibration
7.1 Introduction
7.2 Post-stratification
7.3 Raking
7.4 Generalized raking, GREG estimation, and calibration
7.4.1 Calibration in R
Selecting auxiliary variables for non-response
7.6.1 Direct standardization
7.6.2 Standard error estimation
Exercises
7.5 Basu’s elephants
7.6
8 Two-phase sampling
8.1
8.2
8.3
8.4
8.5
Multistage and multiphase sampling
Sampling for stratification
The case-control design
8.3.1
8.3.2 Frequency matching
Sampling from existing cohorts
8.4.1 Logistic regression
8.4.2
8.4.3 Survival analysis
8.4.4 Case-cohort designs in R
Using auxiliary information from phase one
8.5.1
8.5.2 Two-phase designs
8.5.3
Exercises
* Simulations: efficiency of the design-based estimator
Two-phase case-control designs in R
Population calibration for regression models
Some history of the two-phase calibration estimator
9 Missing data
9.1
9.2
9.3
Item non-response
Two-phase estimation for missing data
9.2.1 Calibration for item non-response
9.2.2 Models for response probability
9.2.3 Effect on precision
9.2.4 * Doubly-robust estimators
Imputation of missing data
9.3.1
9.3.2 Example: NHANES 111 imputations
Exercises
Describing multiple imputations to R 10 * Causal inference
10.1 IPTW estimators
10.1.1 Randomized trials and calibration
10.1.2 Estimated weights for IPTW
10.1.3 Double robustness
10.2 Marginal Structural Models
Appendix A: Analytic Details
A. 1
A.2
A.3
A.4
A.5
A.6
Asymptotics
A. 1.1
A. 1.2 Asymptotic unbiasedness
A.1.3 Asymptotic normality and consistency
Variances by linearization
A.2.1 Subpopulation inference
Tests in contingency tables
Multiple imputation
Calibration and influence functions
Calibration in randomized trials and ANCOVA
Embedding in an infinite sequence
Appendix B: Basic R
B. 1 Reading data
B.2 Data manipulation
B. 1.1 Plain text data
B.2.1 Merging
B.2.2 Factors
B.3 Randomness
B.4 Methods and objects
B.5 * Writing functions
B.5.1 Repetition
B.5.2 Strings
Appendix C: Computational details
C. 1 Linearization
C. 1.1
C.2 Replicate weights
C.2.1 Choice of estimators
C.2.2 Hadamard matrices
Generalized linear models and expected information
(2.3 Scatterplot smoothers
C.4 Quantiles
C.5 Bug reports and feature requests
Appendix D: Database-backed design objects
D.l Large data
D.2 Setting up database interfaces
D.2.1 ODBC
D.2.2 DBI
Appendix E: Extending the package
E. 1
E.2 Using a Poisson model
E.3 Replicate weights
E.4 Linearization
A case study: negative binomial regression
References
Author Index
Topic Index
附件列表