Our research program is motivated by understanding human health and the prediction, diagnosis, prevention, and treatment of common diseases such as cancer, cardiovascular disease, and neuropsychiatric diseases. We start with the fundamental assumption that common diseases are the result of genetic and environmental perturbations to a complex adaptive system that is dynamic in time and space and driven by numerous nonlinear biomolecular interactions. We have published extensively on gene-gene interactions (i.e. epistasis), gene-environment interactions (i.e. plastic reaction norms), and genetic heterogeneity as phenomena that drive the complexity of human health. A central focus of the lab is on the development, evaluation, and application of artificial intelligence, machine learning, and systems approaches for modeling the relationship between biochemical, clinical, demographic, genetic, genomic, environmental, and physiologic measures and health-related endpoints such as disease susceptibility in human populations or clinical databases. We briefly review below some of our past and current applied and methodological work. Information about our current research funding can be found here.
We are very interested in developing computational methods that can analyze complex data as a human would. To this end, we developed the Exploratory Modeling for Extracting Relationships using Genetic and Evolutionary Navigation Techniques (EMERGENT) for modeling the relationship between genetic variation and clinical outcomes. A review of this approach can be found in this book chapter. The EMERGENT approach builds on our previous work on extending genetic programming to gene-gene interaction analysis and classification. For example, we published a series of papers showing how expert knowledge could be harnessed to improve search through multi-objective fitness evaluation, crossover, and population initialization. We have applied what we learned from the EMERGENT project to create a new accessible artificial intelligence system for adaptive machine learning in collaboration with faculty and staff from the Penn Institute for Biomedical Informatics. This new method and software is called PennAI. A preprint can be found here along with some press. Source code for PennAI will be available in 2018.
One approach to understanding complex systems is to perform computer experiments that help understand how interacting components of system give rise to emergent properties. We keep a close eye on this discipline and have carried our artificial life experiments or have used these approaches as inspiration for genetic analysis. For example, we used grammatical evolution to discover Petri net systems models that are consistent with complex genotype-phenotype relationships. Several publications on Petri nets can be found here, here, and here. A review of these approaches can be found here. We later extended this approach to a grid-based cellular system. We have also simulated digital organisms to study epistasis.
Automated Machine Learning
A central challenge of machine learning is the pipeline that needs to be constructed to piece together pre-processing methods, feature selection, feature engineering, and the computational analysis methods and their parameter settings. There are numerous decisions that need to be made at each step of the pipeline construction process and each requires significant knowledge and experience. We were one of the first groups to introduce a comprehensive approach to automated machine learning (AutoML) through our tree-based pipeline optimization tool (TPOT). TPOT uses genetic programming to optimize the selection of different computational methods available in the scikit-learn library. We have published several papers on TPOT. They can be found here and here. Source code is available here. A related area is automated statistics.
A central challenge of machine learning is to provide a proper encoding of the data. This is particularly important in genetic studies where combinations of alleles or genotypes have specific phenotypic effects. We were the first to introduce a feature engineering method for genetic studies. Our multifactor dimensionality reduction (MDR) method provides a data-driven mapping of multilocus genotype combinations into a new feature that is able able capture nonlinear interactions or epistasis. Our 2001 paper introducing this method has been highly cited along with follow up papers evaluating MDR power with simulation and an early software package written in C. We have extended MDR with balanced accuracy, for family-based studies, significance testing, GPU computing, permutation testing, covariate adjustment, survival analysis, robust analysis, gene ontology analysis, analysis of quantitative traits, genome-wide analysis, and grid computing. You can find a recent review of MDR here. Here is the MDR page on Wikipedia. Source code for MDR is available in Java and Python.
We have more recently been developing feature engineering methods using genetic programming to discover useful encodings. Preprints for our new feature engineering wrapper (FEW) method can be found here and here. Source code is available here.
Big data often has more features than can be comfortably handled by most machine learning methods. A key strategy is to select a subset of features based on expert knowledge about their biology or based on information derived from prior computational or statistical analysis. We have worked on both strategies. For example, in this study of obesity using MDR we first generated a subset of SNPs in 12 genes robustly associated with body mass index. This dramatically reduced the search space for detecting gene-gene interactions. We have also developed novel extensions to the ReliefF algorithm for computational filtering of features. ReliefF was an attractive algorithm to focus on because it can detect nonadditive interactions among features with a combinatorial search of feature combinations. We were able to make ReliefF dramatically better for genetic analysis by first adding a wrapper that systematically removed the worst features followed by re-estimation of the scores. We called this tuned ReliefF or TuRF. This was followed by spatially-uniform ReliefF (SURF) and its extension SURF*. We have reviewed these approaches here. Source code is available here.
Information theory has been used for decades to measure nonaddtive interactions among features. The concept of interaction information was introduced in the 1950s and has been rediscovered or re-purposed regularly since. We were the first to adapt this methodology for the study of gene-gene interactions in population-based studies of human disease. We have introduced several methodological extensions including evaporative cooling, statistical testing, and measuring three-way interactions. Our review of these methods for genetic analysis can be found here.
Learning Classifier Systems
A learning classifier system (LCS) is a type of rule-based machine learning where the model is a set of solutions that each describes a portion of the data. We became interested in this approach as a strategy for modeling genetic heterogeneity. We have published extensively on this method and numerous modifications to adapt it to complex modeling problems in human genetics. An example application paper can be found here. An early review of the method can be found here. A video overview can be found here. A new book on the subject can be found here. LCS software can be found here. A published description of the software can be found here.
In addition to developing machine learning methods for detecting combinations of risk factors for disease, we have focused on methods for characterizing networks of interacting factors as a way to characterize the complexity of common diseases. First, we were the first to infer networks of genetic interactions using entropy-based measures. This study revealed that genetic interaction networks associated with bladder cancer are much larger and more complex than expected by chance. These results were supported by functional genomics studies. We were also able to show that pairwise networks are useful for detecting higher-order interactions and that nodes of particular type tend to interact with each other. We released a software package called ViSEN for visualizing networks or pairwise and three-way gene-gene interactions. ViSEN is freely available here.
We have also developed methods for building human phenotype networks from information about shared genetic risk factors. We have introduced bipartite network methods using SNP relationships, shared pathways, and shared environmental factors. We have applied these approaches to the genome-wide genetic analysis of glaucoma and diabetes, for example. A review of these methods can be found here.
It is our working hypothesis that common diseases are characterized by robust biochemical and physiological systems that are buffered from genetic and environmental perturbations. As such, we have been interested in pathway analysis to look at aggregated genetic effects that disrupt health. We were one of the very first groups to apply gene set enrichment analysis to genome-wide association study (GWAS) data with application to depression and schizophrenia. We were also among the first to apply pathway analysis to genome-wide epistasis analysis. We have participated in several reviews of pathway-based approaches. These can be found here and here. We have also developed methods for improving pathway analysis by using genomics data to improve gene sets. A paper describing our entropy minimization over variable clusters (EMVC) method can be found here. We later extended this approach to include principal components analysis, spectral methods, and an independent filter approach. We are currently exploring the use of functional genomic annotations such as those from ENCODE to improve genetic analyses.
A key to developing powerful computational methods is the ability to simulate complex data where the ground truth is known. We developed the Genetic Architecture Model Emulator for Testing and Evaluating Software (GAMETES) approach to simulating epistasis in population-based data using multilocus penetrance functions. Our initial paper describing this approach can be found here. We also published papers on predicting the difficulty of GAMETES models and on describing their shape. The GAMETES software is available here. We have also developed a Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method that attempts to simulate data that is more closely aligned with the hierarchical complexity of biological systems. A paper describing this approach can be found here. We are currently reworking HIBACHI to be driven by genetic programming. A new paper and software package are being developed now.
We believe the future of data science and informatics is with visualization. Data is too big and too complex for exploration in a spreadsheet. The same is increasingly true for data analysis results. We have been using video game engines to visualize high-dimensional data and research results. For example, we used the Unity 3D engine to develop an interactive 3D heatmap application that was applied to microbiome data. We have also shown how data visualized in a video game can be sent to a 3D printer or in conjunction with artificial intelligence approaches. We are now working with the Penn Institute for Biomedical Informatics to incorporate these approaches as part of their new Idea Factory for immersive visual analytics.