Use Healthcare data foundations in healthcare data solutions (preview)

[This article is prerelease documentation and is subject to change.]

The Healthcare data foundations capability enhances Fast Healthcare Interoperability Resources (FHIR) data processing within the data lake environment and efficiently structures data for analytics and AI/machine learning modeling. These data pipelines flatten or transform the ingested FHIR JSON data into a tabular structure. Accessible through traditional SQL tooling, it enables exploratory analysis on various aspects of healthcare data, including clinical, financial (claims and extended of benefits), and administrative data modules.

The foundation of healthcare data solutions (preview) is based on the medallion lakehouse architecture. In medallion architecture, data is logically organized and processed using a multi-layered approach. The goal is to incrementally and progressively improve the structure and quality of data as it flows through each layer of the architecture. The first layer is bronze which maintains the raw state of the data source. The second layer is silver which represents a validated, enriched version of the data. The third, and final, layer is gold which is highly refined and aggregated. After data flattening, traditional SQL tooling can utilize the clinical data to conduct exploratory analysis. Healthcare data foundations also support the following functionalities:

  1. Using Microsoft Fabric notebooks that support seamless interaction with lakehouse data using popular PySpark and Python library for data exploration and processing.
  2. Using SQL endpoints with T-SQL to query the tabular data and conduct adhoc or exploratory analysis.
  3. Using Power BI to visualize data stored in OneLake. You can create dashboards, reports, charts, and graphs to explore and present the data in a meaningful way.

To learn more about the capability and understand how to deploy and configure it, go to:

The capability also includes the global configuration notebook healthcare#_msft_config_notebook that helps setting up and managing the configuration necessary for all the data transformations in healthcare data solutions (preview).

Important

Avoid executing this notebook, as it is executed in other notebooks during setup.

The Healthcare data foundations capability is required to run other healthcare data solutions (preview) capabilities. Hence, ensure that you successfully deploy this capability before attempting to deploy other capabilities. However, to run the Healthcare data foundations pipelines, you have to set up and execute the FHIR data ingestion pipeline first for ingesting the FHIR data or sample data. For more information, go to:

Prerequisites

Before executing the FHIR export service notebook, make sure you've completed the following steps:

Bronze ingestion

This section explains how to use the healthcare#_msft_raw_bronze_ingestion notebook deployed as part of the Healthcare data foundations capability. The notebook invokes the BronzeIngestionService module in the healthcare data solutions (preview) library to ingest FHIR data from the NDJSON source files into the corresponding table in the bronze lakehouse. For more information about this notebook, see healthcare#_msft_raw_bronze_ingestion.

The notebook can be run on demand or on a preferred schedule as part of a data pipeline in Microsoft Fabric. For more information, see Data pipeline runs in Fabric.

Post executing the notebook, you can verify whether the records are successfully ingested in the corresponding bronze lakehouse table. You can run a SELECT query on the bronze lakehouse table, as displayed in the following example with the Patient table:

A screenshot displaying a sample SELECT query for verifying a successful notebook run.

Note

By default, the notebook is configured to utilize the sample data provided with healthcare data solutions (preview). If you wish to utilize your own data and not the sample data, then update the source_path_pattern variable in the notebook to point to the location of your data.

Silver flattening

This section explains how to use the healthcare#_msft_bronze_silver_flatten notebook deployed as part of the Healthcare data foundations capability. The notebook invokes the SilverIngestionService module in the healthcare data solutions (preview) library to flatten the FHIR resources from the bronze lakehouse tables and ingest the resulting data into the corresponding silver lakehouse tables. By default, you aren't expected to make any changes to this notebook. If you prefer pointing to different source and target lakehouses, you can change the values in the healthcare#_msft_config_notebook.

We recommend scheduling this notebook job to run every 4 hours. The initial run might not have data to consume due to concurrent and dependent jobs, leading to latency. Adjusting the frequency of higher layer jobs can reduce this latency.

Post executing the notebook, you can verify whether the records are successfully ingested in the corresponding silver lakehouse table. You can run a SELECT query on the silver lakehouse table, as displayed in the following example with the Patient table:

A screenshot displaying a sample SELECT query for verifying a successful notebook run for silver flattening.

Guidelines for silver flattening

The flattening of each FHIR domain resource is processed with the following rules:

  • Primitive FHIR elements, such as string, integer, and boolean, are flattened and encoded into native SQL types (in the delta or parquet storage format). Few examples of such FHIR elements are gender and birthDate in the Patient resource.

    A screenshot displaying the example primitive elements.

  • Complex and multi-value FHIR elements are persisted as structs and arrays (in the parquet format). Few examples of such FHIR elements are identifier, name, and telecom in the Patient resource.

    A screenshot displaying the example multi-value elements.

  • Currently, the SQL analytics endpoint doesn't surface complex types (such as structs and arrays). Each complex column is converted to a string representation of the complex dataset and labeled with a _string suffix. You can then query it from the SQL analytics endpoint. Few examples of such FHIR elements are name_string, telecom_string, and identifier_string in the Patient resource.

    A screenshot displaying the example string elements.

Normalization

Normalization is a process of reducing data redundancy in a table and improving data integrity. The system performs it as an extra step after flattening the data and during transforming data from bronze layer to silver layer. Common data types such as datetime fields and reference fields are normalized based on the following HL7 SQL on FHIR guidelines:

  • A single reference and resource ID is normalized using the following rules:

    • All resource IDs are normalized by applying the SHA-256 hash function to the following specific format: sourceSystemURL/resourceType/resourceId. This process ensures that each resource has a consistent and unique identifier across the system.

    • The original ID is then preserved as id_orig.

    This approach of normalizing a single resource object is recursively used to normalize an array of reference objects and nested references.

  • FHIR date elements are converted to UTC and their original value is persisted in an _orig column.

See also