Module 7: Advanced Topics

Conducting Exposome-Wide Association Studies

Chirag J Patel

Overview

This module covers advanced ExWAS topics:

A. Meta-analysis across NHANES cycles

B. Interaction testing

C. Microbiome-exposome studies

D. Future directions

Part A: Meta-Analysis Across NHANES Cycles

Why Meta-Analyze?

Each NHANES cycle (2 years) has a limited sample size
Associations may vary across cycles due to:
- Changing exposure levels over time
- Demographic shifts
- Assay changes
Meta-analysis pools estimates across cycles for:
- Greater statistical power
- Assessment of heterogeneity over time

Strategy: Per-Cycle Estimation

Run the ExWAS association separately in each NHANES cycle
Collect the estimate and standard error from each cycle
Pool using a meta-analytic model
Assess heterogeneity

Meta-Analysis with nhanespewas

library(nhanespewas)
con <- connect_pewas_data()

# pe_flex_adjust returns results per cycle
result <- pe_flex_adjust(
  phenotype = "BMXBMI",
  exposure = "LBXBPB",
  adjustment_model = adjustment_models[[4]],
  con = con
)

# Extract per-cycle estimates
per_cycle <- result %>%
  map_dfr(~ tidy(.), .id = "cycle") %>%
  filter(grepl("LBXBPB", term))

Per-Cycle Estimates

per_cycle %>%
  select(cycle, estimate, std.error, p.value) %>%
  kable(digits = 4) %>% kable_styling()

The stanley_meta() UWLS Method

The nhanespewas package provides stanley_meta() using the Unrestricted Weighted Least Squares (UWLS) method:

# UWLS meta-analysis
meta_result <- stanley_meta(
  estimates = per_cycle$estimate,
  standard_errors = per_cycle$std.error
)

meta_result

UWLS is robust to heterogeneity and does not require distributional assumptions about the random effects.

DerSimonian-Laird with metafor

Compare with the traditional DerSimonian-Laird random-effects model:

library(metafor)

rma_result <- rma(
  yi = per_cycle$estimate,
  sei = per_cycle$std.error,
  method = "DL"
)

summary(rma_result)

Comparing Meta-Analytic Methods

tibble(
  method = c("UWLS (stanley_meta)", "DL (metafor::rma)"),
  estimate = c(meta_result$estimate, rma_result$b[1]),
  se = c(meta_result$se, rma_result$se[1]),
  p.value = c(meta_result$p.value, rma_result$pval[1])
) %>%
  kable(digits = 4) %>% kable_styling()

Forest Plot

Visualize per-cycle estimates and the pooled estimate:

forest_data <- per_cycle %>%
  select(cycle, estimate, std.error) %>%
  mutate(
    ci_lower = estimate - 1.96 * std.error,
    ci_upper = estimate + 1.96 * std.error,
    label = paste0("Cycle ", cycle)
  )

# Add pooled estimate
pooled <- tibble(
  label = "Pooled (DL)",
  estimate = rma_result$b[1],
  ci_lower = rma_result$ci.lb,
  ci_upper = rma_result$ci.ub
)

forest_data <- bind_rows(
  forest_data %>% select(label, estimate, ci_lower, ci_upper),
  pooled
) %>%
  mutate(label = fct_inorder(label))

ggplot(forest_data, aes(x = estimate, y = label)) +
  geom_point(size = 3) +
  geom_errorbarh(aes(xmin = ci_lower, xmax = ci_upper), height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(x = "Effect Estimate", y = "", title = "Forest Plot: Lead and BMI") +
  theme_minimal()

Heterogeneity Statistics

I² statistic: Proportion of variability due to heterogeneity (vs. chance)

I² = 0%: no heterogeneity
I² = 25%: low
I² = 50%: moderate
I² = 75%: high

Cochran’s Q test: Tests whether variability exceeds sampling error

tibble(
  I2 = rma_result$I2,
  Q = rma_result$QE,
  Q_pvalue = rma_result$QEp,
  tau2 = rma_result$tau2
) %>%
  kable(digits = 3) %>% kable_styling()

Stouffer’s Method

An alternative to pooling effect sizes is to pool p-values using Stouffer’s method:

\[Z = \frac{\sum_{i=1}^{k} w_i \cdot \Phi^{-1}(p_i)}{\sqrt{\sum_{i=1}^{k} w_i^2}}\]

# Weighted Stouffer's method (weight by sqrt(n))
stouffer_z <- function(pvalues, weights = NULL) {
  z_scores <- qnorm(1 - pvalues / 2)  # two-sided
  if (is.null(weights)) weights <- rep(1, length(pvalues))
  z_combined <- sum(weights * z_scores) / sqrt(sum(weights^2))
  p_combined <- 2 * (1 - pnorm(abs(z_combined)))
  return(list(z = z_combined, p = p_combined))
}

stouffer_result <- stouffer_z(per_cycle$p.value)
stouffer_result

Meta-ExWAS: Looping Over All Exposures

run_meta_exwas <- function(exposure_vars, phenotype, adj_model, con) {
  map_dfr(exposure_vars, function(exp_var) {
    tryCatch({
      result <- pe_flex_adjust(phenotype, exp_var, adj_model, con)
      per_cycle <- result %>%
        map_dfr(~ tidy(.), .id = "cycle") %>%
        filter(grepl(exp_var, term))

      if (nrow(per_cycle) < 2) return(NULL)

      rma_fit <- rma(yi = per_cycle$estimate,
                     sei = per_cycle$std.error, method = "DL")
      tibble(
        exposure = exp_var,
        estimate = rma_fit$b[1],
        se = rma_fit$se[1],
        p.value = rma_fit$pval[1],
        I2 = rma_fit$I2,
        n_cycles = nrow(per_cycle)
      )
    }, error = function(e) NULL)
  })
}

Part B: Interaction Testing

Interaction Testing

Question: Does the exposure-phenotype association differ by a modifier (e.g., sex, race)?

\[P = \beta_0 + \beta_1 E + \beta_2 M + \beta_3 (E \times M) + \text{covariates}\]

\(\beta_3\) is the interaction term — tests whether the E-P association differs across levels of M.

Interaction with nhanespewas

# Test interaction between lead and sex on BMI
# Some pe_flex_adjust implementations support interact_with parameter
result_interaction <- pe_flex_adjust(
  phenotype = "BMXBMI",
  exposure = "LBXBPB",
  adjustment_model = adjustment_models[[4]],
  con = con,
  interact_with = "RIAGENDR"
)

Wald Test for Interaction

The Wald test assesses whether the interaction term is significantly different from zero:

# Extract interaction term from the model
result_interaction %>%
  map_dfr(~ tidy(.), .id = "cycle") %>%
  filter(grepl(":", term)) %>%
  select(cycle, term, estimate, std.error, p.value) %>%
  kable(digits = 4) %>% kable_styling()

A significant p-value indicates that the lead-BMI association differs between males and females.

Part C: Microbiome-Exposome Studies

The Microbiome as an Exposure

The gut microbiome is part of the exposome:

Shaped by diet, medications, environment
Influences metabolism, immunity, disease risk
Compositional data requiring special statistical treatment

CLR Transformation

Microbiome data is compositional (proportions sum to 1). The Centered Log-Ratio (CLR) transformation addresses this:

\[\text{CLR}(x_i) = \log\left(\frac{x_i}{\text{geometric mean}(x)}\right)\]

# CLR transformation function
clr_transform <- function(x) {
  # Replace zeros with small pseudocount
  x[x == 0] <- 0.5
  log_x <- log(x)
  log_x - mean(log_x)
}

# Example: apply to a row of microbiome abundances
# abundances <- c(100, 50, 20, 5, 1)
# clr_transform(abundances)

Microbiome-ExWAS Approach

Obtain microbiome composition data (e.g., 16S rRNA or shotgun metagenomics)
Apply CLR transformation to taxa abundances
Treat each CLR-transformed taxon as an “exposure”
Run ExWAS: phenotype ~ CLR(taxon) + covariates
Correct for multiple testing
Validate in independent cohorts

Part D: Future Directions

Causal Inference Methods

Moving from association to causation:

Mendelian Randomization (MR)

Use genetic variants as instruments for exposures
If a gene variant affects the exposure and is associated with the outcome, this suggests causality
Exploits random assortment of alleles at conception

Negative Control Exposures

Test an exposure known NOT to cause the outcome
If it shows an association, indicates residual confounding

Machine Learning for Mixtures

Exposures occur as mixtures, not in isolation:

LASSO (Least Absolute Shrinkage)

Selects the most important exposures from a large panel
Handles correlated predictors
Identifies sparse models

BKMR (Bayesian Kernel Machine Regression)

Models the joint effect of exposure mixtures
Estimates individual and interaction effects
Accounts for non-linearity

LASSO for Exposure Selection

library(glmnet)

# Prepare exposure matrix
# X <- NHData.train %>%
#   select(all_of(ExposureList)) %>%
#   mutate(across(everything(), ~ scale(log(. + 0.001)))) %>%
#   as.matrix()
# y <- NHData.train$BMXBMI

# Fit LASSO
# cv_fit <- cv.glmnet(X, y, alpha = 1)
# coef(cv_fit, s = "lambda.min")

LASSO selects exposures that independently predict the phenotype, addressing correlation among exposures.

The Per-Association Confounding Problem

Recall from Module 3: each exposure-phenotype pair has a different confounding structure, but ExWAS applies the same covariate set to all tests.

This is arguably the single biggest methodological challenge in exposome-wide screening.

Bias Type	Problem	Example
Unmeasured confounding	Missing confounder not in the model	Occupational exposure not in NHANES
Over-adjustment	Conditioning on a collider or descendant	Adjusting for BMI when it’s on the causal path
Differential confounding	Bias direction/magnitude differs per exposure	Age confounds lead-BMI differently than vitamin D-BMI

Mitigation Strategy 1: Negative Controls

Negative control exposures and negative control outcomes detect residual confounding:

A negative control exposure should have no biological effect on the outcome — if it appears significant, confounding is likely
A negative control outcome should not be affected by the exposure — significance suggests bias

Example: If lead is associated with both glucose (plausible) and a phenotype with no biological link (implausible), the latter flags confounding.

Lipsitch, Tchetgen Tchetgen, Cohen (2010) introduced this framework for epidemiology.

Mitigation Strategy 2: Mendelian Randomization

Mendelian randomization (MR) uses genetic variants as instrumental variables:

Gene (Z) → Exposure (E) → Phenotype (P)
              ↑
         Confounders (C)

Because genotypes are assigned at conception (random with respect to confounders), MR estimates are less susceptible to confounding.

Requires a strong genetic instrument for the exposure
Assumptions: relevance, independence, exclusion restriction
Can be applied to ExWAS hits as a second-stage validation

Key limitation: For many environmental exposures, GWAS have not been conducted or are severely underpowered — making valid genetic instruments hard to find. Unlike well-studied traits (e.g., BMI, lipids), most chemical exposures lack large-scale GWAS, so MR may only be feasible for a small subset of ExWAS hits.

Mitigation Strategy 3: Multiple Adjustment Models

Already built into nhanespewas (Module 4):

9 adjustment models from unadjusted to fully adjusted
If an association is sensitive to covariate choice, it is more likely confounded
If it is robust across models, residual confounding is less likely (though not ruled out)
The atlas (Module 9) found 15% of associations reversed sign between models — a direct measure of confounding sensitivity

Mitigation Strategy 4: Data-Driven Confounder Selection

Emerging methods that select confounders per association:

Double/Debiased Machine Learning (DML)

Uses ML (e.g., LASSO, random forest) to flexibly model both the exposure and the outcome as functions of potential confounders
Estimates the causal effect after partialing out the ML-predicted confounding
Does not require pre-specifying which variables are confounders

Targeted Maximum Likelihood Estimation (TMLE)

Semiparametric method that combines ML-based nuisance estimation with targeted bias correction
Provides valid inference even with flexible confounder models

Both approaches allow the confounders to differ per exposure without manually specifying each DAG.

Mitigation Strategy 5: Triangulation

No single method eliminates confounding. Triangulation combines evidence from multiple approaches with different biases:

Approach	Bias Profile
ExWAS (observational)	Unmeasured confounding, reverse causation
Mendelian randomization	Pleiotropy, weak instruments
Negative controls	Detects but doesn’t correct bias
Longitudinal studies	Residual confounding, attrition
Cross-population replication	Different confounding structures

If an association survives across approaches with different biases, it is more likely to be real.

Key references:

Munafò MR, Davey Smith G. Robust research needs many lines of evidence. Nature 2018; 553:399-401.
Lawlor DA, Tilling K, Davey Smith G. Triangulation in aetiological epidemiology. Int J Epidemiol 2016; 45(6):1866-1886.
Munafò MR, Higgins JPT, Davey Smith G. Triangulating evidence through the inclusion of genetically informed designs. Cold Spring Harb Perspect Med 2021; 11:a040659.

A Two-Stage Workflow

A practical approach that addresses per-association confounding:

Stage 1 — Screen broadly (ExWAS)

Apply uniform covariate set to hundreds of exposures
Accept that some findings are confounded
Use FDR and replication to reduce false positives

Stage 2 — Investigate deeply (per-hit)

For each top hit, construct an exposure-specific DAG
Apply targeted methods: MR, negative controls, DML/TMLE
Triangulate across methods and populations

This separates discovery (broad, practical) from validation (rigorous, per-association).

Multi-Omics Integration

The exposome intersects with other -omics layers:

Layer	Data Type	Integration
Genomics	SNPs, PRS	Gene-environment interaction
Transcriptomics	Gene expression	Exposure response signatures
Metabolomics	Metabolite levels	Intermediate biomarkers of exposure
Proteomics	Protein levels	Biomarkers of exposure
Epigenomics	DNA methylation	Exposure memory

Building Shiny Dashboards

Interactive exploration of ExWAS results:

library(shiny)

# A Shiny app could allow users to:
# 1. Select a phenotype
# 2. Choose adjustment models
# 3. View interactive volcano plots
# 4. Explore individual associations
# 5. Compare across NHANES cycles

# See: https://shiny.rstudio.com/

Course Recap

Module	Key Skill
1	Exposome concepts, NHANES structure
2	Tidyverse data manipulation and visualization
3	Survey-weighted regression, FDR, confounding
4	nhanespewas package and database
5	Running a full ExWAS pipeline
6	Interpreting and visualizing results
7	Meta-analysis, interactions, advanced methods

Key Takeaways

The exposome complements the genome in understanding chronic disease
NHANES provides a rich, publicly available resource for ExWAS
Proper statistical methods (survey weighting, FDR, replication) are essential
The nhanespewas package streamlines the analysis workflow
Always consider confounding, reverse causation, and measurement error
Meta-analysis across cycles increases power and assesses heterogeneity
Future directions include causal inference, ML, and multi-omics

Resources

nhanespewas package: https://github.com/chiragjp/nhanespewas
PE Atlas: pe.exposomeatlas.com
NHANES: https://www.cdc.gov/nchs/nhanes/index.htm
Patel et al. 2010: PLoS ONE, ExWAS for T2D
Patel et al. 2016: Scientific Data, NHANES exposome database
Chung et al. 2024: Exposome journal, ExWAS review

Cleaning Up

disconnect_pewas_data(con)

Thank You

Congratulations on completing the ExWAS course!

You now have the skills to:

Design and conduct an ExWAS
Interpret and visualize results
Replicate and validate findings
Apply advanced methods (meta-analysis, interaction testing)
Critically evaluate exposome research

Questions? Comments? Reach out!

Supported By

This course is supported by the National Institutes of Health (NIH):

National Institute of Environmental Health Sciences (NIEHS): R01ES032470, U24ES036819
National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK): R01DK137993