Exploring the data and models introduced in the causal chapter of Lumley’s (2010) book on surveys – and a few other models to see how they compare.
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.2 v dplyr 1.0.7
v tidyr 1.1.3 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.1
-- Conflicts ---------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(survey)
Loading required package: grid
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Loading required package: survival
Attaching package: ‘survey’
The following object is masked from ‘package:graphics’:
dotchart
library(srvyr)
Attaching package: ‘srvyr’
The following object is masked from ‘package:stats’:
filter
library(modelsummary)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
# LMMs
library(lme4)
# CEM
library(MatchIt)
library(lmtest) #coeftest
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
library(sandwich) #vcovCL
The data are over here, in a collection from another book mentioned by Lumley.
dat <- read.csv("kidsmoking.txt", sep = "\t")
head(dat)
It has the right number of rows – first good sign that it’s the same dataset.
nrow(dat)
[1] 654
Fix the factors:
dat <- dat %>%
mutate(smoke = factor(smoke),
smoke = recode(smoke, `0` = "No", `1` = "Yes"),
sex = factor(sex),
sex = recode(sex, `0` = "Female", `1` = "Male"))
Lumley mentions a family ID, but it’s not there. My guess is that it’s embedded in the overall ID with the last two digits providing the child ID and the rest the family. Let’s give that a go.
dat <- dat %>%
mutate(family_id = floor(id/100))
Should be 290 families:
unique(dat$family_id) %>% length()
[1] 290
YES.
Now we need a variable to encode whether each child has a sibling who smokes. For each family we can count the total number of (child) smokers. Then for a given child we work out if a sib smokes by rule:
Does child of focus smoke? If yes, then sib smokes if are there 2 or more smokers in the family. If no, then sib smokes if are there 1 or more smokers in the family.
Let’s give that a go. First, count the child smokers in each family.
fam_smokers <- dat %>%
group_by(family_id) %>%
summarise(fam_smokers = sum(smoke == "Yes"))
Merge the count back into the dataframe and work out whether each child has a sib who smokes:
dat <- dat %>%
left_join(fam_smokers) %>%
mutate(
sib_smokes = ifelse(smoke == "Yes",
fam_smokers >= 2,
#else
fam_smokers >= 1)
)
Joining, by = "family_id"
Done.
Finally, make the IDs factors.
dat <- dat %>%
mutate(id = factor(id),
family_id = factor(family_id))
Now some pictures:
dat %>%
ggplot(aes(smoke, fev1)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 0.5) +
geom_boxplot(alpha = 0.2, colour = "magenta") +
labs(x = "Smoking",
y = expression(FEV[1]~(l/s))) +
coord_flip()
set.seed(1337)
dat %>%
ggplot(aes(age, fev1, colour = smoke)) +
geom_jitter(alpha = 0.4, width = 0.1, height = 0) +
labs(x = "Age",
y = expression(FEV[1]~(l/s)),
colour = "Smoking") +
scale_colour_manual(values = c("darkgrey", "magenta")) +
theme_bw()
what’s the earliest age someone smokes?
dat %>%
mutate(age = ordered(age)) %>%
tabyl(age, smoke)
age No Yes
3 2 0
4 9 0
5 28 0
6 37 0
7 54 0
8 85 0
9 93 1
10 76 5
11 81 9
12 50 7
13 30 13
14 18 7
15 9 10
16 6 7
17 6 2
18 4 2
19 1 2
It’s 9, but, as Lumley points out, it’s only one person.
dat10plus <- dat %>%
filter(age >= 10)
Just going to copy the analyses to start with, to check it’s really the same dataset:
m1 <- glm(I(smoke == "Yes") ~ age * sex, data = dat10plus,
family = binomial)
m2 <- glm(I(smoke == "Yes") ~ height * sex, data = dat10plus,
family = binomial)
m1a <- glm(I(smoke == "Yes") ~ (age + sib_smokes) * sex, data = dat10plus,
family = binomial)
m2a <- glm(I(smoke == "Yes") ~ (height + sib_smokes) * sex, data = dat10plus,
family = binomial)
dat10plus <- dat10plus %>%
mutate(
pi1 = ifelse(smoke == "Yes",
fitted(m1),
1 - fitted(m1)),
pi2 = ifelse(smoke == "Yes",
fitted(m2),
1 - fitted(m2)),
pi1a = ifelse(smoke == "Yes",
fitted(m1a),
1 - fitted(m1a)),
pi2a = ifelse(smoke == "Yes",
fitted(m2a),
1 - fitted(m2a)),
w1 = 1 / pi1,
w2 = 1 / pi2,
w1a = 1 / pi1a,
w2a = 1 / pi2a
)
Note below I had apparently confused as_survey
by including a formula in the weights option: the reciprocal of the propensity score. Easily fixed and a reminder to check the design object is setup correctly before using it in analyses.
the_designs <- list()
the_designs[["m1"]] <- dat10plus %>%
as_survey(ids = family_id,
weights = w1)
the_designs[["m2"]] <- dat10plus %>%
as_survey(ids = family_id,
weights = w2)
the_designs[["m1a"]] <- dat10plus %>%
as_survey(ids = family_id,
weights = w1a)
the_designs[["m2a"]] <- dat10plus %>%
as_survey(ids = family_id,
weights = w2a)
the_designs[["unweighted"]] <- dat10plus %>%
as_survey(ids = family_id,
weights = NULL)
the_designs
$m1
1 - level Cluster Sampling design (with replacement)
With (209) clusters.
Called via srvyr
Sampling variables:
- ids: family_id
- weights: w1
Data variables: id (int), age (int), fev1 (dbl), height (dbl),
sex (fct), smoke (fct), family_id (dbl), fam_smokers (int),
sib_smokes (lgl), pi1 (dbl), pi2 (dbl), pi1a (dbl), pi2a (dbl),
w1 (dbl), w2 (dbl), w1a (dbl), w2a (dbl)
$m2
1 - level Cluster Sampling design (with replacement)
With (209) clusters.
Called via srvyr
Sampling variables:
- ids: family_id
- weights: w2
Data variables: id (int), age (int), fev1 (dbl), height (dbl),
sex (fct), smoke (fct), family_id (dbl), fam_smokers (int),
sib_smokes (lgl), pi1 (dbl), pi2 (dbl), pi1a (dbl), pi2a (dbl),
w1 (dbl), w2 (dbl), w1a (dbl), w2a (dbl)
$m1a
1 - level Cluster Sampling design (with replacement)
With (209) clusters.
Called via srvyr
Sampling variables:
- ids: family_id
- weights: w1a
Data variables: id (int), age (int), fev1 (dbl), height (dbl),
sex (fct), smoke (fct), family_id (dbl), fam_smokers (int),
sib_smokes (lgl), pi1 (dbl), pi2 (dbl), pi1a (dbl), pi2a (dbl),
w1 (dbl), w2 (dbl), w1a (dbl), w2a (dbl)
$m2a
1 - level Cluster Sampling design (with replacement)
With (209) clusters.
Called via srvyr
Sampling variables:
- ids: family_id
- weights: w2a
Data variables: id (int), age (int), fev1 (dbl), height (dbl),
sex (fct), smoke (fct), family_id (dbl), fam_smokers (int),
sib_smokes (lgl), pi1 (dbl), pi2 (dbl), pi1a (dbl), pi2a (dbl),
w1 (dbl), w2 (dbl), w1a (dbl), w2a (dbl)
$unweighted
1 - level Cluster Sampling design (with replacement)
With (209) clusters.
Called via srvyr
Sampling variables:
- ids: family_id
Data variables: id (int), age (int), fev1 (dbl), height (dbl),
sex (fct), smoke (fct), family_id (dbl), fam_smokers (int),
sib_smokes (lgl), pi1 (dbl), pi2 (dbl), pi1a (dbl), pi2a (dbl),
w1 (dbl), w2 (dbl), w1a (dbl), w2a (dbl)
the_mods <- lapply(the_designs, \(d) svyglm(fev1 ~ smoke, design = d))
the_mods[["LMM"]] <- lmer(fev1 ~ smoke + (1|family_id),
data = dat10plus)
modelsummary(the_mods,
fmt = 2,
gof_omit = ".*",
coef_map = c("smokeYes" = "Smoking"))
m1 | m2 | m1a | m2a | unweighted | LMM | |
---|---|---|---|---|---|---|
Smoking | 0.11 | 0.03 | 0.19 | 0.17 | 0.15 | 0.19 |
(0.13) | (0.16) | (0.15) | (0.13) | (0.10) | (0.10) |
The coefficients are still pointing the wrong way: smoking implies better lung capacity. Lumley tries the analysis again by subseting to children aged 12 or above and with a doubly-robust approach, also known as augmented IPW (AIPW). I’m curious to see what happens with coarsened exact matching.
mat <- matchit(smoke ~ age + height + sex + sib_smokes,
data = dat10plus,
method = "cem",
estimate = "ATE")
mat
A matchit object
- method: Coarsened exact matching
- number of obs.: 345 (original), 129 (matched)
- target estimand: ATT
- covariates: age, height, sex, sib_smokes
summary(mat)
Call:
matchit(formula = smoke ~ age + height + sex + sib_smokes, data = dat10plus,
method = "cem", estimate = "ATE")
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff.
age 13.5938 11.8043 0.7825
height 66.0781 64.6616 0.4636
sexFemale 0.6094 0.4413 0.3445
sexMale 0.3906 0.5587 -0.3445
sib_smokesTRUE 0.3438 0.1708 0.3641
Var. Ratio eCDF Mean eCDF Max
age 1.4697 0.1789 0.4085
height 0.6335 0.0881 0.2177
sexFemale . 0.1681 0.1681
sexMale . 0.1681 0.1681
sib_smokesTRUE . 0.1729 0.1729
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff.
age 12.5135 12.5135 0.000
height 66.2568 65.9878 0.088
sexFemale 0.6216 0.6216 0.000
sexMale 0.3784 0.3784 0.000
sib_smokesTRUE 0.2162 0.2162 0.000
Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
age 1.0112 0.0000 0.0000 0.0000
height 0.8981 0.0222 0.1149 0.2456
sexFemale . 0.0000 0.0000 0.0000
sexMale . 0.0000 0.0000 0.0000
sib_smokesTRUE . 0.0000 0.0000 0.0000
Percent Balance Improvement:
Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
age 100 97.1 100.0 100.0
height 81 76.5 74.8 47.2
sexFemale 100 . 100.0 100.0
sexMale 100 . 100.0 100.0
sib_smokesTRUE 100 . 100.0 100.0
Sample Sizes:
Control Treated
All 281. 64
Matched (ESS) 61.9 37
Matched 92. 37
Unmatched 189. 27
Discarded 0. 0
Here’s what the help recommends:
mat_dat <- match.data(mat)
cem_fit <- lm(fev1 ~ smoke, data = mat_dat, weights = weights)
coeftest(cem_fit, vcov. = vcovCL, cluster = ~subclass)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.340655 0.103134 32.3913 <2e-16 ***
smokeYes 0.061724 0.089078 0.6929 0.4896
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Try also using {survey}:
cem_design <- mat_dat %>%
as_survey(
ids = subclass,
weights = weights
)
cem_design
1 - level Cluster Sampling design (with replacement)
With (32) clusters.
Called via srvyr
Sampling variables:
- ids: subclass
- weights: weights
Data variables: id (int), age (int), fev1 (dbl), height (dbl),
sex (fct), smoke (fct), family_id (dbl), fam_smokers (int),
sib_smokes (lgl), pi1 (dbl), pi2 (dbl), pi1a (dbl), pi2a (dbl),
w1 (dbl), w2 (dbl), w1a (dbl), w2a (dbl), weights (dbl),
subclass (fct)
svyglm(fev1 ~ smoke, design = cem_design) %>%
summary()
Call:
svyglm(formula = fev1 ~ smoke, design = cem_design)
Survey design:
Called via srvyr
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.34065 0.10273 32.519 <2e-16 ***
smokeYes 0.06172 0.08873 0.696 0.492
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.4161119)
Number of Fisher Scoring iterations: 2
Also try a multilevel model with the weights. {WeMix} seems to do it.