Precision medicine pipeline with genomics

Blob Storage
Data Factory
Data Lake Storage
Databricks
Microsoft Genomics

This article presents a solution for genomic analysis and reporting. The processes and results are appropriate for precision medicine scenarios, or areas of medical care that use genetic profiling.

Architecture

Architecture diagram showing how information flows through a genomics analysis and reporting pipeline.

The diagram contains two boxes. The first, on the left, has the label Azure Data Factory for orchestration. The second box has the label Clinician views. The first box contains several smaller boxes that represent data or various Azure components. Arrows connect the boxes, and numbered labels on the arrows correspond with the numbered steps in the document text. Two arrows flow between the boxes, ending in the Clinician views box. One arrow points to a clinician icon. The other points to a Power BI icon.

Download a Visio file of this architecture.

Workflow

Azure Data Factory orchestrates the workflow:

  1. Data Factory transfers the initial sample file to Azure Blob Storage. The file is in FASTQ format.

  2. Microsoft Genomics runs secondary analysis on the file.

  3. Microsoft Genomics stores the output in Blob Storage in one of these formats:

    • Variant call format (VCF)
    • Genomic VCF (GVCF)
  4. Jupyter Notebook annotates the output file. The notebook runs on Azure Databricks.

  5. Azure Data Lake Storage stores the annotated file.

  6. Jupyter Notebook merges the file with other datasets and analyzes the data. The notebook runs on Azure Databricks.

  7. Data Lake Storage stores the processed data.

  8. Azure Healthcare APIs packs the data into a Fast Healthcare Interoperability Resources (FHIR) bundle. The clinical data then enters the patient electronic health record (EHR).

  9. Clinicians view the results in Power BI dashboards.

Components

The solution uses the following components:

Microsoft Genomics

Microsoft Genomics offers an efficient and accurate genomics pipeline that implements the industry's best practices. Its high-performance engine is optimized for these tasks:

  • Reading large files of genomic data
  • Processing them efficiently across many cores
  • Sorting and filtering the results
  • Writing the results to output files

To maximize throughput, this engine operates a Burrows-Wheeler Aligner (BWA) and a Genome Analysis Toolkit (GATK) HaplotypeCaller variant caller. The engine also uses several other components that make up standard genomics pipelines. Examples include duplicate marking, base quality score recalibration, and indexing. In a few hours, the engine can process a single genomic sample on a single multi-core server. The processing starts with raw reads. It produces aligned reads and variant calls.

Internally, the Microsoft Genomics controller manages these aspects of the process:

  • Distributing batches of genomes across pools of machines in the cloud
  • Maintaining a queue of incoming requests
  • Distributing the requests to servers that run the genomics engine
  • Monitoring the servers' performance and progress
  • Evaluating the results
  • Ensuring that processing runs reliably and securely at scale, behind a secure web service API

You can easily use Microsoft Genomics results in tertiary analysis and machine learning services. And because Microsoft Genomics is a cloud service, you don't need to manage or update hardware or software.

Other components

  • Data Factory is an integration service that works with data from disparate data stores. You can use this fully managed, serverless platform to orchestrate and automate workflows. Specifically, Data Factory pipelines transfer data to Azure in this solution. A sequence of pipelines then triggers each step of the workflow.

  • Blob Storage offers optimized cloud object storage for large amounts of unstructured data. In this scenario, Blob Storage provides the initial landing zone for the FASTQ file. This service also functions as the output target for the VCF and GVCF files that Microsoft Genomics generates. Tiering functionality in Blob Storage provides a way to archive FASTQ files in inexpensive long-term storage after processing.

  • Azure Databricks is a data analytics platform. Its fully managed Spark clusters process large streams of data from various sources. In this solution, Azure Databricks provides the computational resources that Jupyter Notebook needs to annotate, merge, and analyze the data.

  • Data Lake Storage is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data may be structured, semi-structured, or unstructured. It typically comes from multiple, heterogeneous sources. In this architecture, Data Lake Storage provides the final landing zone for the annotated files and the merged datasets. It also gives downstream systems access to the final output.

  • Power BI is a collection of software services and apps that display analytics information. You can use Power BI to connect and display unrelated sources of data. In this solution, you can populate Power BI dashboards with the results. Clinicians can then create visuals from the final dataset.

  • Azure Healthcare APIs is a managed, standards-based, compliant interface for accessing clinical health data. In this scenario, Azure Healthcare APIs passes an FHIR bundle to the EHR with the clinical data.

Scenario details

This article presents a solution for genomic analysis and reporting. The processes and results are appropriate for precision medicine scenarios, or areas of medical care that use genetic profiling. Specifically, the solution provides a clinical genomics workflow that automates these tasks:

  • Taking data from a sequencer
  • Moving the data through secondary analysis
  • Providing results that clinicians can consume

The growing scale, complexity, and security requirements of genomics make it an ideal candidate for moving to the cloud. Consequently, the solution uses Azure cloud services in addition to open-source tools. This approach takes advantage of the security, performance, and scalability features of the Azure cloud:

  • Scientists plan on sequencing hundreds of thousands of genomes in coming years. The task of storing and analyzing this data requires significant computing power and storage capacity. With data centers around the world that provide these resources, Azure can meet these demands.
  • Azure is certified for major global security and privacy standards, such as ISO 27001.
  • Azure complies with the security and provenance standards that the Health Insurance Portability and Accountability Act (HIPAA) establishes for personal health information.

A key component of the solution is Microsoft Genomics. This service offers an optimized secondary analysis implementation that can process a 30x genome in a few hours. Standard technologies can take days.

Potential use cases

This solution is ideal for the healthcare industry. It applies to many areas:

  • Risk scoring patients for cancer
  • Identifying patients with genetic markers that predispose them to disease
  • Generating patient cohorts for studies

Considerations

The following considerations align with the Microsoft Azure Well-Architected Framework and apply to this solution:

Availability

The service level agreements (SLAs) of most Azure components guarantee availability:

Scalability

Most Azure services are scalable by design:

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

The technologies in this solution meet most companies' requirements for security.

Guidelines

Because of the sensitive nature of medical data, establish governance and security by following the guidelines in these documents:

Regulatory compliance

General security features

Several components also secure data in other ways:

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

With most Azure services, you can reduce costs by only paying for what you use:

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Fully deployable architectures:

Data Factory solutions

Analytics solutions

Healthcare solutions