Predicting Traits from Genomic Data Using the Microsoft Azure Linux Data Science VM

This post is by Mary Wahl, Data Scientist at Microsoft.

We are thrilled to share a new interactive walkthrough illustrating how genetic data can be used to predict a person's traits using open-source genomics software and data science cloud computing resources available from Microsoft.

As genomic data proliferates, industries ranging from medicine to agriculture and forensics have harnessed the power of machine learning to interpret this data for high-impact applications. Today, doctors can select safe and effective therapies based on a patient's genetic risk of harmful side effects, or on the predicted susceptibility of a specific tumor or pathogen to drug treatments. Investigators can determine a suspect or victim's appearance, even in the absence of witnesses, using DNA evidence from a crime scene. Scientists can find and cultivate naturally-occurring mutations that confer resilience to crops in extreme weather, improving global access to food. All of these applications share the common goal of predicting trait or disease status – in biological parlance, phenotype – from genomic data.

Such applications also face many common challenges that reflect the unique qualities of genomic data. The number of samples available for training a predictive model is often orders of magnitude smaller than the number of genomic features included in the model: overfitting is a frequent issue for such underdetermined problems. Population structure, unrecorded covariates, unobserved genomic variation, and other effects of data collection strategies can confound predictions. Furthermore, since the proportion of these genomic features that directly affect a given trait is typically small, identifying a true association between genomic variants and phenotype can be like finding a needle in a haystack.

Thanks in part to recent algorithmic advancements, Linear Mixed Models (LMMs) have emerged as the industry standard approach for identifying causal features and predicting phenotypes. As with a standard linear model, LMMs include fixed effects for each genomic feature and any recorded covariates, such as age or sex. LMMs also include random effects: in the context of genomic models, these random effects are correlated between individuals on the basis of their genetic similarity. These random effects can account for heritable differences in phenotype that are not reflected by genomic features or covariates, reducing overfitting.

Data scientists and engineers can use our tutorial on Predicting Phenotypes from Genomic Data using Microsoft Azure's Linux Data Science Virtual Machine as a quick-start guide to applying LMMs in a preconfigured compute context. After completing this walkthrough, users can dynamically scale the virtual machine's size to accommodate large datasets for their own real-world applications. This tutorial employs FaST-LMM, an open-source algorithm suite developed by Microsoft Research that facilitates result analysis through direct integration with Python and R. To demonstrate the accuracy of the approach, we pair real sequence data from the International HapMap Project with phenotype data simulated in a gallery experiment in Azure Machine Learning Studio.