Module 4: The nhanespewas Package

Conducting Exposome-Wide Association Studies

Chirag J Patel

Overview

This module introduces the nhanespewas R package:

Installation from GitHub
Database download from Figshare
Connecting and disconnecting
Exploring the phenotype and exposure catalogs
Retrieving data tables
Survey weight handling
Adjustment models framework
First look at pe_flex_adjust()

Why nhanespewas?

In Modules 1-3, we used a bundled .Rdata file with pre-selected variables.

The nhanespewas package provides access to:

All NHANES variables across multiple survey cycles
A SQLite database with structured storage
Built-in functions for ExWAS with proper survey weighting
Flexible adjustment models (9+ covariate scenarios)
Meta-analysis and interaction testing tools

Installing nhanespewas

Install from GitHub using devtools:

# Install devtools if needed
install.packages("devtools")

# Install nhanespewas
devtools::install_github("chiragjp/nhanespewas")

library(nhanespewas)

Downloading the Database

The package uses a SQLite database hosted on Figshare:

# Download the database (only need to do this once)
# This will download a ~2GB file to your working directory
download_pewas_data()

The database contains NHANES data organized into tables matching the original NHANES data files.

Connecting to the Database

# Connect to the database
con <- connect_pewas_data()

This returns a DBI connection object to the SQLite database.

Important: Always disconnect when you’re done:

# When finished
disconnect_pewas_data(con)

The Phenotype Catalog (p_catalog)

The p_catalog contains metadata about all available phenotype variables:

# View the phenotype catalog
p_catalog

# How many phenotype variables?
nrow(p_catalog)

# What columns does it have?
colnames(p_catalog)

Exploring p_catalog

# Search for BMI-related variables
p_catalog %>%
  filter(grepl("body mass", var_desc, ignore.case = TRUE))

# Search for glucose
p_catalog %>%
  filter(grepl("glucose", var_desc, ignore.case = TRUE))

The Exposure Catalog (e_catalog)

The e_catalog contains metadata about environmental exposure variables:

# View the exposure catalog
e_catalog

# How many exposure variables?
nrow(e_catalog)

Exploring e_catalog

# Search for lead-related exposures
e_catalog %>%
  filter(grepl("lead", var_desc, ignore.case = TRUE))

# All metals
e_catalog %>%
  filter(grepl("metal|cadmium|mercury|lead|arsenic", var_desc, ignore.case = TRUE))

Finding Tables for a Variable

Each NHANES variable comes from a specific data table. Use get_table_names_for_varname():

# Which tables contain BMI?
get_table_names_for_varname("BMXBMI", con)

# Which tables contain fasting glucose?
get_table_names_for_varname("LBXGLU", con)

Retrieving Data Tables

Use get_tables() to pull actual data from the database:

# Get BMI data from a specific table
bmi_data <- get_tables("BMX", con)
head(bmi_data)

# Get demographic data
demo_data <- get_tables("DEMO", con)
head(demo_data)

Survey Weight Handling

NHANES uses different weights for different subsample components. The package helps identify the correct weight:

# Determine the appropriate weight for a variable
figure_out_weight("LBXBPB")

Different subsample components have different weights:

WTMEC2YR: MEC exam weight (most common)
WTSA2YR: Subsample A weight
WTSB2YR: Subsample B weight

Multi-Year Weights

When combining multiple NHANES cycles, weights must be adjusted:

# Calculate multi-year weight
figure_out_multiyear_weight("LBXBPB", cycles = c(1, 2))

The multi-year weight is typically the 2-year weight divided by the number of cycles being combined.

The Adjustment Models Framework

nhanespewas provides 9 standard adjustment scenarios via adjustment_models:

# View all adjustment models
adjustment_models

Adjustment Model Details

Model	Covariates
1	Unadjusted (exposure only)
2	Age, sex
3	Age, sex, race/ethnicity
4	Age, sex, race/ethnicity, income
5	Age, sex, race/ethnicity, income, BMI
6	Age, sex, race/ethnicity, income, smoking
7	Age, sex, race/ethnicity, income, BMI, smoking
8	Age, sex, race/ethnicity, income + urinary creatinine
9	Age, sex, race/ethnicity, income + dietary variables

Examining a Specific Adjustment Model

# Model 4: age, sex, race, income (our standard model)
adjustment_models[[4]]

# List the covariate names for each model
lapply(adjustment_models, function(x) x$covariates)

pe_flex_adjust(): The Core Function

pe_flex_adjust() runs a single phenotype-exposure association with flexible adjustment:

pe_flex_adjust(
  phenotype,           # phenotype variable name (string)
  exposure,            # exposure variable name (string)
  adjustment_model,    # which adjustment model to use
  con,                 # database connection
  ...
)

Running Your First Association

# Connect to database
con <- connect_pewas_data()

# Run: BMI ~ Blood Lead, adjusted for age, sex, race, income
result <- pe_flex_adjust(
  phenotype = "BMXBMI",
  exposure = "LBXBPB",
  adjustment_model = adjustment_models[[4]],
  con = con
)

Examining the Result

# The result is a list of models (one per survey cycle)
names(result)

# Look at the first cycle's result
result[[1]]

Extracting Results from the Models List

# Extract tidy coefficients from each cycle
results_tidy <- result %>%
  map_dfr(~ tidy(.), .id = "cycle")

results_tidy %>%
  filter(grepl("LBXBPB", term)) %>%
  select(cycle, term, estimate, std.error, p.value) %>%
  kable(digits = 4) %>% kable_styling()

Extracting Model Fit Statistics

# Extract model-level statistics
results_glance <- result %>%
  map_dfr(~ glance(.), .id = "cycle")

results_glance %>%
  select(cycle, r.squared, adj.r.squared, nobs) %>%
  kable(digits = 4) %>% kable_styling()

Comparing Adjustment Models

# Run the same association with different adjustment levels
results_by_adj <- map(1:4, function(i) {
  pe_flex_adjust(
    phenotype = "BMXBMI",
    exposure = "LBXBPB",
    adjustment_model = adjustment_models[[i]],
    con = con
  )
})

Sensitivity to Adjustment

# Compare the lead coefficient across adjustment models
sensitivity <- map_dfr(1:4, function(i) {
  results_by_adj[[i]] %>%
    map_dfr(~ tidy(.), .id = "cycle") %>%
    filter(grepl("LBXBPB", term)) %>%
    mutate(adj_model = i)
})

sensitivity %>%
  select(adj_model, cycle, estimate, p.value) %>%
  kable(digits = 4) %>% kable_styling(font_size = 9)

Checking Exposure Data Type

Before running an ExWAS, check whether each exposure is continuous or categorical:

# Check data type of an exposure
check_e_data_type("LBXBPB", con)  # continuous
check_e_data_type("SMQ020", con)  # categorical (smoking status)

This determines whether to use linear regression (continuous) or treat the exposure as a factor.

Working with Categorical Exposures

For categorical exposures, the package handles factor encoding:

# Categorical exposure example
result_cat <- pe_flex_adjust(
  phenotype = "BMXBMI",
  exposure = "SMQ020",  # ever smoked
  adjustment_model = adjustment_models[[4]],
  con = con
)

Adding Custom Covariates

You can modify adjustment models to include additional covariates:

# Add BMI as a covariate when studying glucose
custom_model <- adjustment_models[[4]]
custom_model$covariates <- c(custom_model$covariates, "BMXBMI")

result_with_bmi <- pe_flex_adjust(
  phenotype = "LBXGLU",
  exposure = "LBXBPB",
  adjustment_model = custom_model,
  con = con
)

Database Structure Summary

The nhanespewas SQLite database contains:

Component	Description
Data tables	Raw NHANES data by table name and cycle
`p_catalog`	Phenotype variable metadata
`e_catalog`	Exposure variable metadata
Demographics	DEMO tables with survey design variables
Weights	Survey weights for each subsample

Best Practices

Always disconnect from the database when done
Check data types before running associations
Use appropriate weights — the package handles this for you
Start with adjustment model 4 (age, sex, race, income) as the default
Compare across adjustment models to assess sensitivity
Use tryCatch() for error handling when looping over many exposures

Cleaning Up

# Always disconnect when done
disconnect_pewas_data(con)

Summary

nhanespewas provides programmatic access to all NHANES data via SQLite
Catalogs (p_catalog, e_catalog) help discover available variables
get_tables() retrieves raw data; get_table_names_for_varname() finds table names
Survey weights are handled automatically by figure_out_weight()
9 adjustment models provide standardized covariate scenarios
pe_flex_adjust() is the core function for running associations

What’s Next?

Module 5: Conducting an ExWAS

Building the ExWAS loop over all exposures
Error handling with tryCatch
FDR correction
Replication across NHANES series
Parallelization with furrr

Supported By

This course is supported by the National Institutes of Health (NIH):

National Institute of Environmental Health Sciences (NIEHS): R01ES032470, U24ES036819
National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK): R01DK137993