Module 9: Untargeted Exposomics with LC-HRMS

Conducting Exposome-Wide Association Studies

Chirag J Patel

Overview

This module covers the next frontier of exposome measurement:

Limitations of targeted exposure panels
What is LC-HRMS?
What can it measure?
Untargeted vs. targeted exposomics
The annotation challenge
Connecting LC-HRMS to the ExWAS framework
Statistical considerations at massive scale
Open challenges and future directions

The Measurement Problem

Throughout this course, we have worked with targeted NHANES exposure panels:

~619 exposures across 10 waves
Each analyte requires a pre-specified assay
Only measures what we already know to look for

But the true chemical exposome is vastly larger:

Estimated tens of thousands of chemicals in commercial use
Endogenous metabolites, dietary compounds, drugs, microbiome products
Transformation products and reactive intermediates
Novel chemicals not yet cataloged

How do we measure what we don’t know to look for?

What is LC-HRMS?

Liquid Chromatography-High Resolution Mass Spectrometry couples two technologies:

Liquid Chromatography (LC)

Separates chemicals in a biological sample (blood, urine) by passing them through a column
Different compounds elute at different retention times based on their physicochemical properties
Common column chemistries: reverse-phase C18, anion exchange (AE), HILIC

High-Resolution Mass Spectrometry (HRMS)

Ionizes eluting molecules and measures their mass-to-charge ratio (m/z) with very high accuracy (<5 ppm mass error)
Common instruments: Orbitrap, quadrupole time-of-flight (Q-TOF)
High resolution distinguishes chemicals with nearly identical nominal masses

How LC-HRMS Works

The workflow for a single sample:

Extract chemicals from a biological specimen (blood, serum, urine)
Inject extract into the LC column
Separate compounds by retention time (minutes)
Ionize eluting molecules (electrospray ionization)
Measure accurate mass (m/z) at high resolution
Detect thousands of molecular features (m/z-retention time pairs)

A single LC-HRMS run detects 5,000-20,000+ features per sample.

Each feature is a potential chemical signal — an endogenous metabolite, an exogenous exposure, a drug, or a transformation product.

What Can LC-HRMS Measure?

Category	Examples
Endogenous metabolites	Amino acids, lipids, bile acids, acylcarnitines, organic acids, steroids
Exogenous chemicals	Pesticides, plasticizers (phthalates, BPA), PFAS, flame retardants, PAHs, heavy metal complexes
Drugs and their metabolites	Pharmaceuticals, phase I/II metabolites (glucuronides, sulfates)
Dietary compounds	Polyphenols, phytochemicals, food additives, caffeine metabolites
Microbiome-derived	TMAO, hippuric acid, p-cresol sulfate, indoles, secondary bile acids
Adducts	HSA-Cys34 adducts from reactive electrophiles (oxidative stress, pollutants)

The key insight: LC-HRMS captures both the internal chemical environment (endogenous) and external exposures (exogenous) simultaneously.

The Blood Exposome

Rappaport et al. (2014) defined the blood exposome as the totality of chemicals circulating in blood from both endogenous and exogenous sources:

Chemicals enter the blood from external sources (air, water, diet, drugs, occupation)
Chemicals also arise from endogenous processes (inflammation, oxidative stress, lipid peroxidation, gut microbiome)
The blood integrates all sources into a single measurable compartment
LC-HRMS can profile this integrated signal in a single analytical run

This “top-down” approach (measure what’s in the blood) complements the “bottom-up” approach (measure every external source) used in traditional exposure assessment.

Reference: Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A. The blood exposome and its role in discovering causes of disease. Environ Health Perspect 2014; 122(8):769-774.

Untargeted vs. Targeted: A Comparison

Dimension	Targeted (e.g., NHANES)	Untargeted LC-HRMS
Coverage	Hundreds of pre-specified chemicals	Thousands of features (known + unknown)
Selection	Must know what to measure a priori	Agnostic, discovery-based
Quantification	Absolute (ng/mL) with reference standards	Semi-quantitative (relative intensity)
Sensitivity	Very high for targeted analytes (ppb-ppt)	Lower for trace xenobiotics
Discovery	Limited to known chemicals	Can find novel/unexpected exposures
Annotation	Known identity	~80-95% of features are unannotated
Sample volume	Large volumes for full panel	Small volumes (< 100 \(\mu\)L)
Cost	Expensive per analyte	Cost-effective per feature

Targeted and untargeted approaches are complementary, not competing.

The Annotation Challenge

The single biggest bottleneck in untargeted exposomics:

Only ~5% of detected features are confidently annotated.

The Schymanski confidence levels provide a standardized framework:

Level	Confidence	Evidence Required
1	Confirmed	Reference standard match (RT + MS + MS/MS)
2	Probable	Library MS/MS spectral match
3	Tentative	Molecular formula, partial structural evidence
4	Formula	Unequivocal molecular formula only
5	Mass	Exact mass (m/z) only

Most features in an untargeted run are Level 4-5 — the “dark matter” of the exposome.

Reference: Schymanski EL, et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol 2014; 48(4):2097-2098.

Why Dark Matter Matters

The ~80-95% of unannotated features are a mix of:

Truly novel chemicals not in any database
Known chemicals missing from spectral libraries
Transformation products (metabolites of metabolites)
Adducts and in-source fragments (analytical artifacts)
Informatic noise (false peaks from feature detection algorithms)

This means that an LC-HRMS-based ExWAS may find strong associations with features we cannot yet identify.

Strategies to reduce dark matter:

Expand spectral libraries (MassBank, HMDB, METLIN, mzCloud)
Improve computational annotation (SIRIUS, MS-DIAL, GNPS)
Use molecular networking to group related unknowns

Connecting LC-HRMS to ExWAS

The ExWAS framework from Modules 4-5 extends naturally to untargeted data:

# Conceptual LC-HRMS ExWAS
# Instead of ~619 targeted NHANES exposures,
# we now have ~10,000 LC-HRMS features

features <- colnames(lchrms_matrix)  # m/z_RT feature IDs

results <- map_dfr(features, function(feat) {
  tryCatch({
    mod <- svyglm(
      as.formula(paste("scale(phenotype) ~ scale(", feat, ") + covariates")),
      design = dsn
    )
    tidy(mod) %>% filter(grepl(feat, term)) %>%
      mutate(feature = feat)
  }, error = function(e) NULL)
})

# Apply FDR
results <- results %>%
  mutate(fdr = p.adjust(p.value, method = "BH"))

The same statistical machinery applies — but the scale changes dramatically.

The Scale Problem

Setting	Features tested	Bonferroni threshold
NHANES ExWAS (Module 5)	~619	\(8 \times 10^{-5}\)
LC-HRMS ExWAS	~10,000	\(5 \times 10^{-6}\)
LC-HRMS \(\times\) multiple phenotypes	~100,000+	\(5 \times 10^{-7}\)

With 10,000 features:

At \(\alpha = 0.05\), expect 500 false positives by chance
Many features are highly correlated (same compound, different adducts; co-eluting chemicals; shared pathways)
Effective number of independent tests is smaller than the total — but hard to estimate

Statistical Considerations

Key differences from targeted ExWAS:

1. Feature correlation is extreme

In-source fragments, adducts ([M+H]+, [M+Na]+, [M+K]+), and isotope peaks of the same compound are perfectly correlated
Co-eluting compounds from the same biological pathway are highly correlated
Must account for this when interpreting significance

2. Semi-quantitative data

Relative intensities, not absolute concentrations
Log-transformation and scaling (Module 3) still apply
But comparability across studies requires careful normalization

3. Missing data patterns differ

Features below the limit of detection are common
Missingness may be informative (low abundance \(\neq\) zero exposure)
Imputation strategies matter more than in targeted panels

A Practical Workflow

Integrating LC-HRMS with the ExWAS pipeline:

Step 1: Pre-processing

Peak detection and alignment (e.g., XCMS, MS-DIAL, MZmine)
Blank subtraction, batch correction, normalization
Filter to reproducible features (present in >50% of QC samples)

Step 2: ExWAS scan

Log-transform, scale, apply ExWAS loop (Modules 3-5)
FDR correction across all features

Step 3: Annotation of hits

Prioritize significant features for identification
Match to spectral databases (Level 1-3 annotation)
Use molecular networking to group related unknowns

Step 4: Validation

Replication in independent samples
Targeted confirmation of top hits with reference standards
Triangulation with targeted NHANES data where available

Batch Effects: A Critical Challenge

Large-scale LC-HRMS studies run samples across multiple batches:

Signal drift within a batch (instrument sensitivity changes over hours)
Batch-to-batch variation (column aging, solvent lots, temperature)
Long-term drift (months between batches in a large cohort)

These systematic variations can confound biological signals.

Mitigation strategies:

Pooled QC samples run every 10-20 samples to monitor drift
Reference standardization (Go, Walker et al. 2015)
Signal normalization: median fold change, LOESS regression, ComBat
Randomized run order to decouple batch from biological variables

Reference: Go YM, Walker DI, et al. Reference standardization for mass spectrometry and high-resolution metabolomics applications to exposome research. Toxicol Sci 2015; 148(2):531-543.

Scalable Workflows for Population Studies

Moving LC-HRMS from small studies to population-scale cohorts:

Hu, Walker et al. (2021) demonstrated a scalable single-step extraction workflow:

Combined LC-HRMS and GC-HRMS in a single extraction
Validated across hundreds of samples
Demonstrated reproducibility for population-scale deployment
Detected both endogenous metabolites and exogenous chemicals

This is the kind of infrastructure needed to create a “next-generation NHANES” with untargeted exposomics.

Reference: Hu X, Walker DI, et al. A scalable workflow to characterize the human exposome. Nat Commun 2021; 12:5575.

What LC-HRMS Has Already Found

Selected findings from untargeted exposome studies:

Novel exposure-disease associations not detectable with targeted panels (e.g., previously unmeasured dietary metabolites associated with cardiovascular risk)
Exposure-metabolite networks revealing how exogenous chemicals perturb endogenous pathways (Jeong et al., Sci Rep 2021)
Occupational exposures detected in firefighters vs. office workers through differential metabolomic profiles
Environmental chemical mixtures that co-occur and may have joint effects on health

The discovery potential is the key advantage — finding associations with chemicals we didn’t know to measure.

The Future: Untargeted ExWAS at Scale

Imagine combining the PE Atlas approach (Module 8) with untargeted LC-HRMS:

Current (NHANES targeted)	Future (LC-HRMS untargeted)
619 exposures	10,000-20,000 features
Known chemicals only	Known + unknown chemicals
Pre-specified assays	Discovery-driven
~120,000 associations	~3-6 million associations
Targeted replication	Targeted confirmation of unknowns

The statistical and computational challenges scale accordingly — but the ExWAS framework (Modules 3-7) provides the foundation.

Key References

Foundational:

Wild CP. Complementing the genome with an “exposome.” Cancer Epidemiol Biomarkers Prev 2005; 14(8):1847-1850.
Rappaport SM, Smith MT. Environment and disease risks. Science 2010; 330:460-461.
Vermeulen R, Schymanski EL, Barabasi AL, Miller GW. The exposome and health: where chemistry meets biology. Science 2020; 367:392-396.

Methodology:

Rappaport SM, et al. The blood exposome and its role in discovering causes of disease. Environ Health Perspect 2014; 122(8):769-774.
Go YM, Walker DI, et al. Reference standardization for LC-HRMS exposome research. Toxicol Sci 2015; 148(2):531-543.
Hu X, Walker DI, et al. A scalable workflow to characterize the human exposome. Nat Commun 2021; 12:5575.

Key References (continued)

Annotation and standards:

Schymanski EL, et al. Identifying small molecules via HRMS: communicating confidence. Environ Sci Technol 2014; 48(4):2097-2098.
Jones DP. Sequencing the exposome: a call to action. Toxicol Rep 2016; 3:29-45.

ExWAS and data science:

Patel CJ, Bhattacharya J, Butte AJ. An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus. PLoS ONE 2010; 5(5):e10746.
Chung MK, et al. The exposome and exposome-wide association studies. Exposome 2024.
Patel CJ, et al. Decoding the exposome: data science methodologies and implications in ExWAS. Exposome 2024; 4(1):osae001.

Summary

LC-HRMS enables untargeted measurement of thousands of chemical features in a single sample
It captures endogenous metabolites, exogenous chemicals, drugs, dietary compounds, and microbiome products simultaneously
The annotation bottleneck (~80-95% unannotated) is the major challenge
The ExWAS framework from this course extends directly to untargeted data — same statistics, larger scale
Batch effects and semi-quantitative data require careful pre-processing
Untargeted and targeted approaches are complementary — targeted validates untargeted discoveries
The future of exposome epidemiology lies in combining LC-HRMS measurement with the ExWAS analytical pipeline at population scale

What’s Next?

The tools are in place:

nhanespewas provides the targeted ExWAS infrastructure (Modules 4-9)
LC-HRMS provides the next-generation measurement platform (this module)
Statistical methods from Modules 3 and 7 scale to untargeted data

The exposome is no longer limited to what we know to measure — LC-HRMS opens the door to discovering the unknown unknowns of environmental health.

Supported By

This course is supported by the National Institutes of Health (NIH):

National Institute of Environmental Health Sciences (NIEHS): R01ES032470, U24ES036819
National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK): R01DK137993