Predict hospital readmissions with traditional and automated machine learning techniques

Machine Learning
Synapse Analytics
Data Factory

This architecture provides a predictive health analytics framework in the cloud to accelerate the path of model development, deployment, and consumption.

Architecture

This framework makes use of native Azure analytics services for data ingestion, storage, data processing, analysis, and model deployment.

Diagram demonstrates the architecture of a multi-tier app.

Download a Visio file of this architecture.

Workflow

The workflow of this architecture is described in terms of the roles of the participants.

  1. Data Engineer: Responsible for ingesting the data from the source systems and orchestrating data pipelines to move data from the source to the target. May also be responsible for performing data transformations on the raw data.

    • In this scenario, historical hospital readmissions data is stored in an on-premises SQL Server database.
    • The expected output is readmissions data that's stored in a cloud-based storage account.
  2. Data Scientist: Responsible for performing various tasks on the data in the target storage layer, to prepare it for model prediction. The tasks include cleansing, feature engineering, and data standardization.

    • Cleansing: Pre-process the data, removing null values, dropping unneeded columns, and so on. In this scenario, drop columns with too many missing values.
    • Feature Engineering:
      1. Determine the inputs that are needed to predict the desired output.
      2. Determine possible predictors for readmittance, perhaps by talking to professionals such as doctors and nurses. For example, real-world evidence may suggest that a diabetic patient being overweight is a predictor for hospital readmission.
    • Data Standardization:
      1. Characterize the location and variability of the data to prepare it for machine learning tasks. The characterizations should include data distribution, skewness, and kurtosis.
        • Skewness responds to the question: What is the shape of the distribution?
        • Kurtosis responds to the question: What is the measure of thickness or heaviness of the distribution?
      2. Identify and correct anomalies in the dataset—the prediction model should be performed on a dataset with a normal distribution.
      3. The expected output is these training datasets:
        • One to use for creating a satisfactory prediction model that's ready for deployment.
        • One that can be given to a Citizen Data Scientist for automated model prediction (AutoML).
  3. Citizen Data Scientist: Responsible for building a prediction model that's based on training data from the Data Scientist. A Citizen Data Scientist most likely uses an AutoML capability that doesn't require heavy coding skills to create prediction models.

    The expected output is a satisfactory prediction model that's ready for deployment.

  4. Business Intelligence (BI) Analyst: Responsible for performing operational analytics on raw data that the Data Engineer produces. The BI Analyst may be involved in creating relational data from unstructured data, writing SQL scripts, and creating dashboards.

    The expected output is relational queries, BI reports, and dashboards.

  5. MLOps Engineer: Responsible for putting models into production that the Data Scientist or Citizen Data Scientist provides.

    The expected output is models that are ready for production and reproducible.

Although this list provides a comprehensive view of all the potential roles that may be interacting with healthcare data at any point in the workflow, the roles may be consolidated or expanded as needed.

Components

  • Azure Data Factory is an orchestration service that can move data from on-premises systems to Azure, to work with other Azure data services. Pipelines are used for data movement, and mapping data flows are used to perform various transformation tasks such as extract, transform, load (ETL) and extract, load, transform (ELT). In this architecture, the Data Engineer uses Data Factory to run a pipeline that copies historical hospital readmission data from an on-premises SQL Server to cloud storage.
  • Azure Databricks is a Spark-based analytics and machine learning service that's used for data engineering and ML workloads. In this architecture, the Data Engineer uses Databricks to call a Data Factory pipeline to run a Databricks notebook. The notebook is developed by the Data Scientist to handle the initial data cleansing and feature engineering tasks. The Data Scientist may write code in additional notebooks to standardize the data and to build and deploy prediction models.
  • Azure Data Lake Storage is a massively scalable and secure storage service for high-performance analytics workloads. In this architecture, the Data Engineer uses Data Lakes Storage to define the initial landing zone for the on-premises data that's loaded to Azure, and the final landing zone for the training data. The data, in raw or final format, is ready for consumption by various downstream systems.
  • Azure Machine Learning is a collaborative environment that's used to train, deploy, automate, manage, and track machine learning models. Automated machine learning (AutoML) is a capability that automates the time-consuming and iterative tasks that are involved in ML model development. The Data Scientist uses Machine Learning to track ML runs from Databricks, and to create AutoML models to serve as a performance benchmark for the Data Scientist's ML models. A Citizen Data Scientist uses this service to quickly run training data through AutoML to generate models, without needing detailed knowledge of machine learning algorithms.
  • Azure Synapse Analytics is an analytics service that unifies data integration, enterprise data warehousing, and big data analytics. Users have the freedom to query data by using serverless or dedicated resources, at scale. In this architecture:
    • The Data Engineer uses Synapse Analytics to easily create relational tables from data in the data lake to be the foundation for operational analytics.
    • The Data Scientist uses it to quickly query data in the data lake and develop prediction models by using Spark notebooks.
    • The BI Analyst uses it to run queries using familiar SQL syntax.
  • Microsoft Power BI is a collection of software services, apps, and connectors that work together to turn unrelated sources of data into coherent, visually immersive, and interactive insights. The BI Analyst uses Power BI to develop visualizations from the data, such as a map of each patient's home location and nearest hospital.
  • Azure Active Directory (Azure AD) is a cloud-based identity and access management service. In this architecture, it controls access to the Azure services.
  • Azure Key Vault is a cloud service that provides a secure store for secrets such as keys, passwords, and certificates. Key Vault holds the secrets that Databricks uses to gain write access to the data lake.
  • Microsoft Defender for Cloud is a unified infrastructure security management system that strengthens the security posture of data centers, and provides advanced threat protection across hybrid workloads in the cloud and on-premises. You can use it to monitor security threats against the Azure environment.
  • Azure Kubernetes Service (AKS) is a fully managed Kubernetes service for deploying and managing containerized applications. AKS simplifies deployment of a managed AKS cluster in Azure by offloading the operational overhead to Azure.

Alternatives

  • Data Movement: You can use Databricks to copy data from an on-premises system to the data lake. Typically, Databricks is appropriate for data that has a streaming or real-time requirement, such as telemetry from a medical device.

  • Machine Learning: H2O.ai, DataRobot, Dataiku, and other vendors offer automated machine learning capabilities that are similar to Machine Learning AutoML. You can use such platforms to supplement Azure data engineering and machine learning activities.

Scenario details

This architecture represents a sample end-to-end workflow for predicting hospital readmissions for diabetes patients, using publicly available data from 130 US hospitals over the 10 years from 1999 to 2008. First it evaluates a binary classification algorithm for predictive power, then benchmarks it against predictive models that are generated by using automated machine learning. In situations where automated machine learning can't correct for imbalanced data, alternative techniques should be applied. A final model is selected for deployment and consumption.

As healthcare and life science organizations strive to provide a more personalized experience for patients and caregivers, they're challenged to use data from legacy systems to provide predictive insights that are relevant, accurate, and timely. Data collection has moved beyond traditional operational systems and electronic health records (EHRs), and increasingly into unstructured forms from consumer health apps, fitness wearables, and smart medical devices. Organizations need the ability to quickly centralize this data and harness the power of data science and machine learning to stay relevant to their customers.

To achieve these objectives, healthcare and life science organizations should aim to:

  • Create a data source from which predictive analytics can provide real-time value to healthcare providers, hospital administrators, drug manufacturers, and others.
  • Accommodate their industry subject matter experts (SMEs) that don't have data science and machine learning skills.
  • Provide to data science and machine learning (ML) SMEs the flexible tools that they need to create and deploy predictive models efficiently, accurately, and at scale.

Potential use cases

  • Predict hospital readmissions
  • Accelerate patient diagnosis through ML-powered imaging
  • Perform text analytics on physician notes
  • Predict adverse events by analyzing remote patient monitoring data from the Internet of Medical Things (IoMT)

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Availability

Providing real-time clinical data and insights is critical for many healthcare organizations. Here are ways to minimize downtime and keep data safe:

Performance

The Data Factory self-hosted integration runtime can be scaled up for high availability and scalability.

Security

Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.

Healthcare data often includes sensitive protected health information (PHI) and personal information. The following resources are available to secure this data:

Cost optimization

Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.

Pricing for this solution is based on:

  • The Azure services that are used.
  • Volume of data.
  • Capacity and throughput requirements.
  • ETL/ELT transformations that are needed.
  • Compute resources that are needed to perform machine learning tasks.

You can estimate costs by using the Azure pricing calculator.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

Next steps

Azure services

Healthcare solutions