Our research program is motivated by understanding human health and the prediction, diagnosis, prevention, and treatment of common diseases such as cancer, cardiovascular disease, and neuropsychiatric diseases. We start with the fundamental assumption that common diseases are the result of genetic and environmental perturbations to a complex adaptive system that is dynamic in time and space and driven by numerous nonlinear biomolecular interactions. We have published extensively on gene-gene interactions (i.e. epistasis), gene-environment interactions (i.e. plastic reaction norms), and genetic heterogeneity as phenomena that drive the complexity of human health. These complexities define unique individuals and groups thus motivating precision health. A central focus of the lab is on the development, evaluation, and application of artificial intelligence, machine learning, and systems approaches for modeling the relationship between biochemical, clinical, demographic, genetic, genomic, environmental, and physiologic measures and health-related endpoints such as disease susceptibility in human populations and clinical studies. We briefly review below some of our past and current applied and methodological work. Information about our current research funding can be found here.
We are very interested in developing computational methods that can analyze complex data as a human would. To this end, we developed the Exploratory Modeling for Extracting Relationships using Genetic and Evolutionary Navigation Techniques (EMERGENT) for modeling the relationship between genetic variation and clinical outcomes. A review of this approach can be found in this book chapter. The EMERGENT approach builds on our previous work on extending genetic programming to gene-gene interaction analysis and classification. For example, we published a series of papers showing how expert knowledge could be harnessed to improve search through multi-objective fitness evaluation, crossover, and population initialization. We have applied what we learned from the EMERGENT project to create a new accessible artificial intelligence system for automated machine learning called PennAI. Our initial paper describing PennAI paper is here. Here is a paper describing our evaluation of PennAI and and its application to a clinical dataset with comparison to deep learning. Source code for PennAI can be found on Github. We are very interested in methods that incorporate the human into AI approaches. Here is a recent editorial on this topic. We also believe that evolutionary computing has an important role to play in AI and may be the next big thing after deep learning.
One approach to understanding complex systems is to perform computer experiments that help understand how interacting components of system give rise to emergent properties. We keep a close eye on this discipline and have carried our artificial life experiments or have used these approaches as inspiration for genetic analysis. For example, we used grammatical evolution to discover Petri net systems models that are consistent with complex genotype-phenotype relationships. Several publications on Petri nets can be found here, here, and here. A review of these approaches can be found here. We later extended this approach to a grid-based cellular system. We have also simulated digital organisms to study epistasis. Here is our recent paper on using computational thought experiments to show that a complex genetic architecture driven largely my epistasis is consistent with results from GWAS. We have also tried our hand at soft robotics. These papers were presented and published as part of the 2018 Artificial Life Conference.
Automated Machine Learning
A central challenge of machine learning is the pipeline that needs to be constructed to piece together pre-processing methods, feature selection, feature engineering, and the computational analysis methods and their parameter settings. There are numerous decisions that need to be made at each step of the pipeline construction process and each requires significant knowledge and experience. We were one of the first groups to introduce a comprehensive approach to automated machine learning (AutoML) through our tree-based pipeline optimization tool (TPOT). TPOT uses genetic programming to optimize the selection of different computational methods available in the scikit-learn library. We have published several papers on TPOT. They can be found here and here. We published a review of TPOT as a chapter in the first book on automated machine learning. Our recent work has focused on scaling TPOT to big biomedical data. Here is a TPOT paper exploring neural networks and a paper reporting a method for covariate adjustment within TPOT. Source code is available here. We recently published a review of automated machine learning for genetic and genomics analysis.
We have shown previously that neural networks show promise for the analysis of human genetics data. For example, we showed that we could use genetic programming to optimize both the weights and architecture of a neural network to detect epistasis. Deep learning neural networks are now computationally feasible and have become very popular because of their performance on problems such as image classification. We have been exploring deep learning approaches for several problems in biomedical research. For example, we developed a deep learning autoencoder approach to the imputation of missing data in electronic health records (EHR). We have also used deep learning for mapping patient trajectories in the EHR. We are currently exploring automated machine learning approaches to neural network analysis for genetic analysis.
A central challenge of machine learning is to provide a proper encoding of the data. This is particularly important in genetic studies where combinations of alleles or genotypes have specific phenotypic effects. We were the first to introduce a feature engineering method for genetic studies. Our multifactor dimensionality reduction (MDR) method provides a data-driven mapping of multilocus genotype combinations into a new feature that is able able capture nonlinear interactions or epistasis. Our 2001 paper introducing this method has been highly cited along with follow up papers evaluating MDR power with simulation and an early software package written in C. We have extended MDR with balanced accuracy, for family-based studies, significance testing, GPU computing, permutation testing, covariate adjustment, survival analysis, robust analysis, gene ontology analysis, analysis of quantitative traits, genome-wide analysis, and grid computing. You can find a recent review of MDR here. Here is the MDR page on Wikipedia. Source code for MDR is available in Java and Python.
We have more recently been developing feature engineering methods using genetic programming to discover useful encodings. Preprints for our new feature engineering wrapper (FEW) method can be found here and here. Source code is available here.
We have also used random forests to engineer features at the gene and pathway level for gene-gene interaction analysis. Papers on these methods can be found here and here.
Big data often has more features than can be comfortably handled by most machine learning methods. A key strategy is to select a subset of features based on expert knowledge about their biology or based on information derived from prior computational or statistical analysis. We have worked on both strategies. For example, in this study of obesity using MDR we first generated a subset of SNPs in 12 genes robustly associated with body mass index. This dramatically reduced the search space for detecting gene-gene interactions. We have also developed novel extensions to the ReliefF algorithm for computational filtering of features. ReliefF was an attractive algorithm to focus on because it can detect nonadditive interactions among features with a combinatorial search of feature combinations. We were able to make ReliefF dramatically better for genetic analysis by first adding a wrapper that systematically removed the worst features followed by re-estimation of the scores. We called this tuned ReliefF or TuRF. This was followed by spatially-uniform ReliefF (SURF) and its extension SURF*. Here is a recent paper on benchmarking these algorithms. We have also collaborated on statistical inference methods for ReliefF. We have reviewed these approaches here and in a recent JBI paper. Source code is available here.
Information theory has been used for decades to measure nonaddtive interactions among features. The concept of interaction information was introduced in the 1950s and has been rediscovered or re-purposed regularly since. We were the first to adapt this methodology for the study of gene-gene interactions in population-based studies of human disease. We have introduced several methodological extensions including evaporative cooling, statistical testing, and measuring three-way interactions. Our review of these methods for genetic analysis can be found here.
Learning Classifier Systems
A learning classifier system (LCS) is a type of rule-based machine learning where the model is a set of solutions that each describes a portion of the data. We became interested in this approach as a strategy for modeling genetic heterogeneity. We have published extensively on this method and numerous modifications to adapt it to complex modeling problems in human genetics. An example application paper can be found here. An early review of the method can be found here. A video overview can be found here. A new book on the subject can be found here. LCS software can be found here. A published description of the software can be found here.
In addition to developing machine learning methods for detecting combinations of risk factors for disease, we have focused on methods for characterizing networks of interacting factors as a way to characterize the complexity of common diseases. First, we were the first to infer networks of genetic interactions using entropy-based measures. This study revealed that genetic interaction networks associated with bladder cancer are much larger and more complex than expected by chance. These results were supported by functional genomics studies. We were also able to show that pairwise networks are useful for detecting higher-order interactions and that nodes of particular type tend to interact with each other. We released a software package called ViSEN for visualizing networks or pairwise and three-way gene-gene interactions. ViSEN is freely available here.
We have also developed methods for building human phenotype networks from information about shared genetic risk factors. We have introduced bipartite network methods using SNP relationships, shared pathways, and shared environmental factors. We have applied these approaches to the genome-wide genetic analysis of glaucoma and diabetes, for example. A review of these methods can be found here.
It is our working hypothesis that common diseases are characterized by robust biochemical and physiological systems that are buffered from genetic and environmental perturbations. As such, we have been interested in pathway analysis to look at aggregated genetic effects that disrupt health. We were one of the very first groups to apply gene set enrichment analysis to genome-wide association study (GWAS) data with application to depression and schizophrenia. We were also among the first to apply pathway analysis to genome-wide epistasis analysis. We have participated in several reviews of pathway-based approaches. These can be found here and here. We have also developed methods for improving pathway analysis by using genomics data to improve gene sets. A paper describing our entropy minimization over variable clusters (EMVC) method can be found here. We later extended this approach to include principal components analysis, spectral methods, and an independent filter approach. We are currently exploring the use of functional genomic annotations such as those from ENCODE to improve genetic analyses. Here are new papers using promoter-enhancer interactions and contactomics data to guide epistasis analysis.
A key to developing powerful computational methods is the ability to simulate complex data where the ground truth is known. We developed the Genetic Architecture Model Emulator for Testing and Evaluating Software (GAMETES) approach to simulating epistasis in population-based data using multilocus penetrance functions. Our initial paper describing this approach can be found here. We also published papers on predicting the difficulty of GAMETES models and on describing their shape. The GAMETES software is available here. We have also developed a Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method that attempts to simulate data that is more closely aligned with the hierarchical complexity of biological systems. A paper describing this approach can be found here. We have recently reworking HIBACHI to be driven by genetic programming. We have used these simulated data in combination with real data to create benchmark data for evaluating machine learning algorithms. This led to the creation of the Penn Machine Learning Benchmark data or PMLB. Here is a more in-depth analysis of these benchmarking results along with some recommendations. An updated version of PMLB (1.0) was published here.
Unsupervised Machine Learning
Unsupervised machine learning methods are important for identifying patterns in data when the endpoint is unknown. For example, you might use these methods to identify subtypes of disease from genomic or clinical data. We recently released a bioconductor package for a biclustering method called runibic. We have also developed a powerful biclustering method called ebic which uses evolutionary computing to optimize pattern discovery. Here is a short paper describing the software and a review paper on scaling these methods to big biomedical data.
We believe the future of data science and informatics is with visualization. Data is too big and too complex for exploration in a spreadsheet. The same is increasingly true for data analysis results. We have been using video game engines to visualize high-dimensional data and research results. For example, we used the Unity 3D engine to develop an interactive 3D heatmap application that was applied to microbiome data. We have also shown how data visualized in a video game can be sent to a 3D printer or in conjunction with artificial intelligence approaches. Here is our open-source software for the 3D heatmap programmed in Unity 3D. We are also interested in using visualization as a tool for interpreting machine learning models. Here is an R package for combining heatmaps with decision trees for model interpretation.