Module 4: The nhanespewas Package

Conducting Exposome-Wide Association Studies

Chirag J Patel

Overview

This module introduces the nhanespewas R package:

  1. Installation from GitHub
  2. Database download from Figshare
  3. Connecting and disconnecting
  4. Exploring the phenotype and exposure catalogs
  5. Retrieving data tables
  6. Survey weight handling
  7. Adjustment models framework
  8. First look at pe_flex_adjust()

Why nhanespewas?

In Modules 1-3, we used a bundled .Rdata file with pre-selected variables.

The nhanespewas package provides access to:

  • All NHANES variables across multiple survey cycles
  • A SQLite database with structured storage
  • Built-in functions for ExWAS with proper survey weighting
  • Flexible adjustment models (9+ covariate scenarios)
  • Meta-analysis and interaction testing tools

Installing nhanespewas

Install from GitHub using devtools:

# Install devtools if needed
install.packages("devtools")

# Install nhanespewas
devtools::install_github("chiragjp/nhanespewas")
library(nhanespewas)

Downloading the Database

The package uses a SQLite database hosted on Figshare:

# Download the database (only need to do this once)
# This will download a ~2GB file to your working directory
download_pewas_data()

The database contains NHANES data organized into tables matching the original NHANES data files.

Connecting to the Database

# Connect to the database
con <- connect_pewas_data()

This returns a DBI connection object to the SQLite database.

Important: Always disconnect when you’re done:

# When finished
disconnect_pewas_data(con)

The Phenotype Catalog (p_catalog)

The p_catalog contains metadata about all available phenotype variables:

# View the phenotype catalog
p_catalog

# How many phenotype variables?
nrow(p_catalog)
# What columns does it have?
colnames(p_catalog)

Exploring p_catalog

# Search for BMI-related variables
p_catalog %>%
  filter(grepl("body mass", var_desc, ignore.case = TRUE))
# Search for glucose
p_catalog %>%
  filter(grepl("glucose", var_desc, ignore.case = TRUE))

The Exposure Catalog (e_catalog)

The e_catalog contains metadata about environmental exposure variables:

# View the exposure catalog
e_catalog

# How many exposure variables?
nrow(e_catalog)

Exploring e_catalog

# Search for lead-related exposures
e_catalog %>%
  filter(grepl("lead", var_desc, ignore.case = TRUE))
# All metals
e_catalog %>%
  filter(grepl("metal|cadmium|mercury|lead|arsenic", var_desc, ignore.case = TRUE))

Finding Tables for a Variable

Each NHANES variable comes from a specific data table. Use get_table_names_for_varname():

# Which tables contain BMI?
get_table_names_for_varname("BMXBMI", con)
# Which tables contain fasting glucose?
get_table_names_for_varname("LBXGLU", con)

Retrieving Data Tables

Use get_tables() to pull actual data from the database:

# Get BMI data from a specific table
bmi_data <- get_tables("BMX", con)
head(bmi_data)
# Get demographic data
demo_data <- get_tables("DEMO", con)
head(demo_data)

Survey Weight Handling

NHANES uses different weights for different subsample components. The package helps identify the correct weight:

# Determine the appropriate weight for a variable
figure_out_weight("LBXBPB")

Different subsample components have different weights:

  • WTMEC2YR: MEC exam weight (most common)
  • WTSA2YR: Subsample A weight
  • WTSB2YR: Subsample B weight

Multi-Year Weights

When combining multiple NHANES cycles, weights must be adjusted:

# Calculate multi-year weight
figure_out_multiyear_weight("LBXBPB", cycles = c(1, 2))

The multi-year weight is typically the 2-year weight divided by the number of cycles being combined.

The Adjustment Models Framework

nhanespewas provides 9 standard adjustment scenarios via adjustment_models:

# View all adjustment models
adjustment_models

Adjustment Model Details

Model Covariates
1 Unadjusted (exposure only)
2 Age, sex
3 Age, sex, race/ethnicity
4 Age, sex, race/ethnicity, income
5 Age, sex, race/ethnicity, income, BMI
6 Age, sex, race/ethnicity, income, smoking
7 Age, sex, race/ethnicity, income, BMI, smoking
8 Age, sex, race/ethnicity, income + urinary creatinine
9 Age, sex, race/ethnicity, income + dietary variables

Examining a Specific Adjustment Model

# Model 4: age, sex, race, income (our standard model)
adjustment_models[[4]]
# List the covariate names for each model
lapply(adjustment_models, function(x) x$covariates)

pe_flex_adjust(): The Core Function

pe_flex_adjust() runs a single phenotype-exposure association with flexible adjustment:

pe_flex_adjust(
  phenotype,           # phenotype variable name (string)
  exposure,            # exposure variable name (string)
  adjustment_model,    # which adjustment model to use
  con,                 # database connection
  ...
)

Running Your First Association

# Connect to database
con <- connect_pewas_data()

# Run: BMI ~ Blood Lead, adjusted for age, sex, race, income
result <- pe_flex_adjust(
  phenotype = "BMXBMI",
  exposure = "LBXBPB",
  adjustment_model = adjustment_models[[4]],
  con = con
)

Examining the Result

# The result is a list of models (one per survey cycle)
names(result)

# Look at the first cycle's result
result[[1]]

Extracting Results from the Models List

# Extract tidy coefficients from each cycle
results_tidy <- result %>%
  map_dfr(~ tidy(.), .id = "cycle")

results_tidy %>%
  filter(grepl("LBXBPB", term)) %>%
  select(cycle, term, estimate, std.error, p.value) %>%
  kable(digits = 4) %>% kable_styling()

Extracting Model Fit Statistics

# Extract model-level statistics
results_glance <- result %>%
  map_dfr(~ glance(.), .id = "cycle")

results_glance %>%
  select(cycle, r.squared, adj.r.squared, nobs) %>%
  kable(digits = 4) %>% kable_styling()

Comparing Adjustment Models

# Run the same association with different adjustment levels
results_by_adj <- map(1:4, function(i) {
  pe_flex_adjust(
    phenotype = "BMXBMI",
    exposure = "LBXBPB",
    adjustment_model = adjustment_models[[i]],
    con = con
  )
})

Sensitivity to Adjustment

# Compare the lead coefficient across adjustment models
sensitivity <- map_dfr(1:4, function(i) {
  results_by_adj[[i]] %>%
    map_dfr(~ tidy(.), .id = "cycle") %>%
    filter(grepl("LBXBPB", term)) %>%
    mutate(adj_model = i)
})

sensitivity %>%
  select(adj_model, cycle, estimate, p.value) %>%
  kable(digits = 4) %>% kable_styling(font_size = 9)

Checking Exposure Data Type

Before running an ExWAS, check whether each exposure is continuous or categorical:

# Check data type of an exposure
check_e_data_type("LBXBPB", con)  # continuous
check_e_data_type("SMQ020", con)  # categorical (smoking status)

This determines whether to use linear regression (continuous) or treat the exposure as a factor.

Working with Categorical Exposures

For categorical exposures, the package handles factor encoding:

# Categorical exposure example
result_cat <- pe_flex_adjust(
  phenotype = "BMXBMI",
  exposure = "SMQ020",  # ever smoked
  adjustment_model = adjustment_models[[4]],
  con = con
)

Adding Custom Covariates

You can modify adjustment models to include additional covariates:

# Add BMI as a covariate when studying glucose
custom_model <- adjustment_models[[4]]
custom_model$covariates <- c(custom_model$covariates, "BMXBMI")

result_with_bmi <- pe_flex_adjust(
  phenotype = "LBXGLU",
  exposure = "LBXBPB",
  adjustment_model = custom_model,
  con = con
)

Database Structure Summary

The nhanespewas SQLite database contains:

Component Description
Data tables Raw NHANES data by table name and cycle
p_catalog Phenotype variable metadata
e_catalog Exposure variable metadata
Demographics DEMO tables with survey design variables
Weights Survey weights for each subsample

Best Practices

  1. Always disconnect from the database when done
  2. Check data types before running associations
  3. Use appropriate weights — the package handles this for you
  4. Start with adjustment model 4 (age, sex, race, income) as the default
  5. Compare across adjustment models to assess sensitivity
  6. Use tryCatch() for error handling when looping over many exposures

Cleaning Up

# Always disconnect when done
disconnect_pewas_data(con)

Summary

  • nhanespewas provides programmatic access to all NHANES data via SQLite
  • Catalogs (p_catalog, e_catalog) help discover available variables
  • get_tables() retrieves raw data; get_table_names_for_varname() finds table names
  • Survey weights are handled automatically by figure_out_weight()
  • 9 adjustment models provide standardized covariate scenarios
  • pe_flex_adjust() is the core function for running associations

What’s Next?

Module 5: Conducting an ExWAS

  • Building the ExWAS loop over all exposures
  • Error handling with tryCatch
  • FDR correction
  • Replication across NHANES series
  • Parallelization with furrr

Supported By

This course is supported by the National Institutes of Health (NIH):

  • National Institute of Environmental Health Sciences (NIEHS): R01ES032470, U24ES036819
  • National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK): R01DK137993