Conducting Exposome-Wide Association Studies
This module covers the statistical machinery behind ExWAS:
svydesign() and svyglm() vs. lm()NHANES uses complex survey sampling — not simple random sampling:
Consequence: Standard lm() produces biased standard errors and invalid p-values.
The survey package in R handles complex survey designs:
Two key functions:
svydesign() — declare the survey designsvyglm() — fit regression models accounting for the designSurvey-weighted regression:
Ordinary regression (ignoring survey design):
| method | term | estimate | std.error | p.value |
|---|---|---|---|---|
| svyglm | (Intercept) | 20.6623 | 0.1447 | 0 |
| svyglm | RIDAGEYR | 0.1455 | 0.0034 | 0 |
| lm | (Intercept) | 20.4280 | 0.0778 | 0 |
| lm | RIDAGEYR | 0.1416 | 0.0020 | 0 |
Note: coefficients may be similar, but standard errors differ — and that affects p-values.
Including multiple predictors in a single model:
\[\text{BMI} = \alpha + \beta_1 \cdot \text{Age} + \beta_2 \cdot \text{Sex} + \beta_3 \cdot \text{Race} + \beta_4 \cdot \text{Income} + \epsilon\]
| term | estimate | std.error | p.value |
|---|---|---|---|
| (Intercept) | 20.4261 | 0.2579 | 0.0000 |
| RIDAGEYR | 0.1524 | 0.0034 | 0.0000 |
| male | -0.2090 | 0.1329 | 0.1302 |
| black | 1.5605 | 0.1810 | 0.0000 |
| mexican | 0.9928 | 0.1902 | 0.0000 |
| other_hispanic | 0.5532 | 0.2813 | 0.0620 |
| other_eth | -0.0129 | 0.6124 | 0.9833 |
| INDFMPIR | -0.0620 | 0.0569 | 0.2879 |
Each coefficient represents the association holding all other variables constant:
RIDAGEYR: change in BMI per 1-year increase in age, adjusted for sex, race, incomemale: difference in BMI for males vs. females, adjusted for age, race, incomeblack, mexican, etc.: difference in BMI relative to white (reference group)The distribution is right-skewed — most values are low with a long right tail.
log(LBXBPB + 0.001) ensures all values are defined| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 98.294 | 0.628 | 156.535 | 0 |
| log(LBXBPB + 0.001) | 4.322 | 0.664 | 6.511 | 0 |
A 1-unit change in log(exposure) corresponds to a ~2.7-fold change in the raw exposure (since \(e^1 \approx 2.72\)).
Different exposures have different units (ug/dL, ng/mL, etc.). To compare across exposures:
\[z = \frac{x - \bar{x}}{s_x}\]
NHData.train <- NHData.train %>%
mutate(lead_scaled = scale(log(LBXBPB + 0.001)))
NHData.train %>%
summarize(
raw_mean = mean(log(LBXBPB + 0.001), na.rm = TRUE),
raw_sd = sd(log(LBXBPB + 0.001), na.rm = TRUE),
scaled_mean = mean(lead_scaled, na.rm = TRUE),
scaled_sd = sd(lead_scaled, na.rm = TRUE)
) %>%
kable(digits = 3) %>% kable_styling()| raw_mean | raw_sd | scaled_mean | scaled_sd |
|---|---|---|---|
| 0.46 | 0.713 | 0 | 1 |
After scaling:
For each exposure \(E_j\), we fit:
\[\text{Phenotype} = \alpha + \beta_j \cdot \text{scale}(\log(E_j + 0.001)) + \gamma_1 \cdot \text{Age} + \gamma_2 \cdot \text{Sex} + \gamma_3 \cdot \text{Race} + \gamma_4 \cdot \text{Income} + \epsilon\]
When testing \(m\) hypotheses simultaneously:
The simplest approach — divide \(\alpha\) by the number of tests:
\[\alpha_{\text{Bonf}} = \frac{\alpha}{m} = \frac{0.05}{160} = 0.000313\]
A less conservative approach controlling the False Discovery Rate:
# Simulated p-values for demonstration
set.seed(42)
pvalues <- c(runif(10, 0, 0.001), runif(150, 0, 1)) # 10 "real" + 150 null
# Bonferroni
bonf_adjusted <- p.adjust(pvalues, method = "bonferroni")
# Benjamini-Hochberg FDR
fdr_adjusted <- p.adjust(pvalues, method = "BH")
tibble(
raw_p = sort(pvalues)[1:5],
bonferroni = sort(bonf_adjusted)[1:5],
fdr = sort(fdr_adjusted)[1:5]
) %>% kable(digits = 6) %>% kable_styling()| raw_p | bonferroni | fdr |
|---|---|---|
| 0.000135 | 0.021547 | 0.01363 |
| 0.000239 | 0.038223 | 0.01363 |
| 0.000286 | 0.045782 | 0.01363 |
| 0.000519 | 0.083055 | 0.01363 |
| 0.000642 | 0.102679 | 0.01363 |
tibble(raw = sort(pvalues), bonf = sort(bonf_adjusted), fdr = sort(fdr_adjusted),
rank = 1:length(pvalues)) %>%
pivot_longer(cols = c(bonf, fdr), names_to = "method", values_to = "adjusted_p") %>%
ggplot(aes(x = rank, y = adjusted_p, color = method)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0.05, linetype = "dashed") +
scale_y_log10() +
labs(x = "Rank", y = "Adjusted p-value (log scale)", title = "Bonferroni vs. FDR") +
theme_minimal()A confounder C is a variable that:
If we don’t adjust for C, the E-P association is biased.
DAGs represent causal relationships:
C
/ \
v v
E P
Example: Age (C) affects both lead exposure (E) and BMI (P).
In our ExWAS model, we adjust for:
| Confounder | Why? |
|---|---|
| Age | Affects both exposure levels and health outcomes |
| Sex | Biological differences in metabolism and exposure |
| Race/Ethnicity | Sociostructural differences in exposure and health |
| Income (INDFMPIR) | Socioeconomic factors affect exposure and health |
A fundamental limitation of screening approaches like ExWAS:
Each exposure-phenotype pair has a different confounding structure — but we apply the same covariate set to all tests.
Lead → BMI: confounders = age, SES, housing, occupation
Vitamin D → BMI: confounders = age, SES, physical activity, sun exposure
Cotinine → BMI: confounders = age, SES, alcohol use, diet quality
There is no single DAG for the exposome — there are hundreds of DAGs, one per association.
Using a “one-size-fits-all” covariate set means:
Example: Adjusting for BMI when testing diet → glucose may block a causal path (over-adjust), while failing to adjust for physical activity when testing vitamin D → BMI may leave confounding open (under-adjust).
This does not invalidate ExWAS — but it means:
\(R^2\) measures the proportion of variance in the outcome explained by the model:
\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\]
How much additional variance does the exposure explain beyond confounders?
\[\Delta R^2 = R^2_{\text{full}} - R^2_{\text{base}}\]
# Base model (confounders only)
r2_base <- summary(lm(BMXBMI ~ RIDAGEYR + male +
black + mexican + other_hispanic + other_eth +
INDFMPIR,
data = NHData.train))$r.squared
# Full model (confounders + exposure)
r2_full <- summary(lm(BMXBMI ~ scale(log(LBXBPB + 0.001)) +
RIDAGEYR + male +
black + mexican + other_hispanic + other_eth +
INDFMPIR,
data = NHData.train))$r.squared
tibble(r2_base = r2_base, r2_full = r2_full,
delta_r2 = r2_full - r2_base) %>%
kable(digits = 5) %>% kable_styling()| r2_base | r2_full | delta_r2 |
|---|---|---|
| 0.23431 | 0.25109 | 0.01678 |
svydesign()svyglm() adjusting for confounderssvydesign, svyglm) are essential for NHANESModule 4: The nhanespewas Package
pe_flex_adjust()This course is supported by the National Institutes of Health (NIH):