Research Capabilities
Detecting Biomarkers in High-dimensional Data in the Presence of Unobserved Confounding Variables
This project will exploit the high dimensionality of biological data to estimate unobserved covariates and correct for their effects, thereby increasing the accuracy with which we identify unique biosignatures and understand exposure-response relationships.
Predicting host molecular responses to environmental exposures will require sound statistical methodologies to detect and control for the effects of unobserved confounding factors. In this project PNNL will exploit the high-dimensionality of metabonomic and proteomic data generated from companion projects to estimate traces of unobserved covariates and correct for their effects, thereby increasing the accuracy of biosignatures. Techniques such as gas chromatography, NMR and mass spectrometry (all techniques incorporated in this focus area), which simultaneously measure hundreds or thousands of biological products, are being used to identify and understand host response to infection and other environmental insults. To identify specific and sensitive biosignatures in these datasets, it is important to control for confounding factors but it is generally not possible to know in advance what these factors are, or to measure them all. This research will develop new statistical methods to estimate and correct for unobserved confounding effects. In particular, the high-dimensional measurements common in metabonomic and proteomic applications often capture traces of unobserved covariates – remnants of lifestyle, hidden disease states, or even genotype – that can be estimated and removed from the biosignature, thereby improving the accuracy of the biosignature. This research will develop new statistical methods that use the high-dimensional nature of biomarker discovery data to control bias and produce a methodology that increases our confidence in the discovery of biomarkers of response.