Health data consortium on Azure

Azure Data Factory

Azure Data Lake Storage

Azure Data Share

Azure Databricks

Azure SQL Database

This solution for a data consortium uses Azure components. It meets these goals:

Provide a way for multiple organizations to share data.
Centralize data orchestration efforts.
Ensure data security.
Guarantee patient privacy.
Support data interoperability.
Offer customization options to meet specific organizations' requirements.

Architecture

Download a Visio file of this architecture.

Dataflow

Raw data originates in on-premises and third-party sources. Members of the consortium load this data into any of these storage services in Azure Data Share:
The consortium asks members to share data. As data producers, members can either share snapshots or use in-place sharing.
As a data consumer, the consortium receives the shared member data. This data enters Data Lake Storage in the consortium's Data Share for further transformation.
Azure Data Factory and Azure Databricks clean the member data and transform it into a common format.
The consortium combines the member data and stores it in a service. The data's structure and volume determine the type of storage service that's most suitable. Possibilities include:
- Azure Synapse Analytics
- Azure SQL Database
- Azure Data Lake Storage
- Azure Data Explorer
As a data share producer, the consortium invites members to receive data. Members can accept either snapshot data or in-place sharing data.
As data consumers, members receive the shared data. The data enters member data stores for research and analysis.

Throughout the system:

Microsoft Entra ID, Azure Key Vault, and Microsoft Defender for Cloud manage access and provide security.
Azure Pipelines, a service of Azure DevOps, builds, tests, and releases code.

Components

This solution uses the following components:

Healthcare platforms

Electronic Health Records (EHRs) are digital versions of real-time information on patients.
Fast Healthcare Interoperability Resources (FHIR) is a standard for healthcare data exchange that Health Level Seven International (HL7) publishes.
The Internet of Medical Things (IoMT) is the collection of medical devices and apps that connect to IT systems through online computer networks.
Genomics data provides information on how genes interact with each other and the environment.
Imaging data includes the images that radiology, cardiology imaging, radiotherapy, and other devices produce.
Customer relationship management (CRM), billing, and third-party systems provide data on patients.

Azure components

Azure Data Share provides a way for multiple organizations to securely share data. With this service, data providers stay in control of data that they share. It's simple to manage and monitor who shared what data at what time. Data Share also makes it easy to enrich analytics and AI scenarios by combining data from different members.
Azure Synapse Analytics is an analytics service for data warehouses and big data systems. With this product, you can query data with serverless, on-demand resources or with provisioned ones. Azure Synapse Analytics works well with a high volume of structured data.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine. With AI-powered, automated features, SQL Database handles database management functions like upgrading, patching, backups, and monitoring. This service is a good fit for structured data.
Data Lake Storage is a massively scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. Data Lake Storage provides a way to store structured and unstructured data from multiple members in one location.
Azure Data Explorer is a fast, fully managed data analytics service. You can use this service for real-time analysis on large volumes of data. Azure Data Explorer can handle diverse data streams from applications, websites, IoT devices, and other sources. Azure Data Explorer is a good fit for in-place sharing of streaming telemetry and log data.
Azure Data Factory is a hybrid data integration service. You can use this fully managed, serverless solution for data integration and transformation workflows. Data Factory offers a code-free UI and an easy-to-use monitoring panel. In this solution, Data Factory pipelines ingest data from disparate member data shares.
Azure Databricks is a data analytics platform. Based on the latest Apache Spark distributed processing system, Azure Databricks supports seamless integration with open-source libraries. This solution uses Azure Databricks notebooks to transform all member data into a common format.
Microsoft Entra ID is a multi-tenant, cloud-based identity and access management service.
Azure Key Vault securely stores and controls access to secrets like API keys, passwords, certificates, and cryptographic keys. This cloud service also manages security certificates.
Azure Pipelines automatically builds and tests code projects. This Azure DevOps service combines continuous integration and continuous delivery (CI/CD). Using these practices, Azure Pipelines constantly and consistently tests and builds code and ships it to any target.
Defender for Cloud provides unified security management and advanced threat protection across hybrid cloud workloads.

Alternatives

With Data Share, many alternatives exist for data storage. Your choice of service depends on your sharing method and your volume and type of data:

For snapshot sharing of batch data, use any of these services:
- Azure Synapse Analytics
- SQL Database
- Data Lake Storage
- Azure Blob Storage
For in-place sharing of streaming telemetry and log data, use Azure Data Explorer. For more information on analyzing data from various sources, see [Azure Data Explorer interactive analytics][Azure Data Explorer interactive analytics].
Some datasets are large or non-relational. Some don't contain data in standardized formats. For these types of datasets, Blob Storage or Azure Data Lake Storage work better than Azure Synapse Analytics and SQL Database for exchanging data with Data Share. For more information on storing medical data efficiently, see Medical data storage solutions.

If Data Share isn't an option, consider a virtual private network (VPN) instead. You can use a site-to-site VPN to transfer data between member and consortium data stores.

Scenario details

Traditional clinical trials can be complex, time consuming, and costly. To address these issues, a growing number of healthcare organizations are partnering to build data consortiums for conducting clinical trials.

Data consortiums benefit healthcare in many ways:

Make research data available.
Provide new revenue streams.
Lead to cost-effective regulatory decisions by providing quick access to data.
Keep patients safer and healthier by accelerating innovation.

Potential use cases

Many types of healthcare professionals can benefit from this solution:

Organizations that use real-world observational data like patient outcomes to determine treatments.
Physicians who specialize in personalized or precision medicine.
Telemedicine providers who need easy access to patient data.
Researchers who work with genomic data.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

The technologies in this solution meet most companies' requirements for security, scalability, and availability.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

Because of the sensitivity of medical information, several components play a role in securing data:

Security features in Data Share protect data in these ways:
- Encrypting data at rest, where the underlying data store supports at-rest encryption.
- Encrypting data in transit by using Transport Layer Security (TLS) 1.2.
- Encrypting metadata about a data share at rest and in transit.
- Not storing contents of shared customer data.
Azure Synapse Analytics offers a comprehensive security model. You can use its fine-grained controls to secure your data at every level, from single cells to entire databases.
SQL Database uses a layered approach to protect customer data. The strategy covers these areas:
- Network security
- Access management
- Threat protection
- Information protection
Data Lake Storage provides access control. The model supports these types of controls:
- Azure role-based access control (RBAC)
- Portable Operating System Interface (POSIX) access control lists (ACLs)
Azure Data Explorer protects data in these ways:
- Uses Microsoft Entra ID–managed identities for Azure resources.
- Uses RBAC to segregate duties and limit access.
- Blocks traffic that originates from network segments outside Azure Data Explorer.
- Safeguards data and helps you meet commitments by using Azure disk encryption. This service provides volume encryption for virtual machine data disks and the OS. Azure disk encryption also integrates with Key Vault, which encrypts secrets with Microsoft-managed keys or customer-managed keys.

Availability

This solution uses a single-region deployment. Some scenarios require a multi-region deployment for high availability, disaster recovery, or proximity. For those cases, the following services offer paired Azure regions for high availability:

Azure Synapse Analytics provides high warehouse availability by using database snapshots.
The high-availability architecture of SQL Database provides a 99.99 percent uptime guarantee.
Azure Data Explorer offers high availability through a persistence layer, a compute layer, and a leader-follower configuration.

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

Pricing for this solution depends on several factors:

The services you choose
Your system's capacity and throughput
The transformations that you use on data
Your business continuity level
Your disaster recovery level

For more information, see Pricing details.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

Matt Hansen | Senior Cloud Solution Architect
Aruna Ranganathan | Principal Customer Engineering Manager

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Determine how to customize the solution by clarifying these points:

The data sources that are available
The location of each data source
Which Azure services members can use to receive source data
Which data members can share with the consortium
How members can share data: In batches as snapshots or as data streams with in-place sharing
Which Azure services the consortium can use to receive shared data
The format of the member data and whether it needs cleaning or transforming
Which data the consortium can share with members

Product documentation:

Share via

Health data consortium on Azure

Architecture

Dataflow

Components

Healthcare platforms

Azure components

Alternatives

Scenario details

Potential use cases

Considerations

Security

Availability

Cost optimization

Contributors

Next steps

Feedback

Additional resources

Share via

Health data consortium on Azure

Architecture

Dataflow

Components

Healthcare platforms

Azure components

Alternatives

Scenario details

Potential use cases

Considerations

Security

Availability

Cost optimization

Contributors

Next steps

Related resources

Feedback

Additional resources