Ingest clinical data using healthcare data foundations

2025-03-31

The clinical transformation capability deploys as part of the healthcare data foundations. This capability provides ready-to-run data pipelines that efficiently prepare data for analytics and AI/machine learning modeling.

For more information on the deployment and the available artifacts, see:

Essentially, the deployment creates three lakehouses, five notebooks, a Fabric environment, and a clinical data pipeline in your healthcare data solutions environment. This data pipeline ingests clinical data and transforms it from the raw source files into the bronze and silver lakehouses. As explained in Data ingestion patterns, it supports two ingestion patterns - Ingest and Bring Your Own Storage (BYOS). The BYOS ingestion pipeline run is explained in Use Azure Health Data Services - Data export. This article outlines how to use the Ingest pattern to process the clinical sample data provided with healthcare data solutions.

Note

You can also use your own FHIR dataset instead of the clinical sample dataset. However, review the considerations in Usage considerations before doing so.

Prerequisites

Deploy healthcare data solutions in Microsoft Fabric
Install the foundational notebooks and pipelines in Deploy healthcare data foundations.
Deploy the clinical sample data as explained in Deploy sample data.

Move the clinical sample data to the ingestion folder

When you deploy the sample data as explained in Deploy sample data, the clinical sample data files should be available in the unified folder structure under Files\SampleData\Clinical\FHIR-NDJSON\FHIR-HDS\51KSyntheticPatients in the bronze lakehouse. Use OneLake or Azure Storage Explorer to copy the 51KSyntheticPatients files from Files\SampleData\Clinical\FHIR-NDJSON\FHIR-HDS to Files\Ingest\Clinical\FHIR-NDJSON\FHIR-HDS in the bronze lakehouse.

Run the data pipeline

Run the healthcare#_msft_clinical_data_foundation_ingestion data pipeline in the bronze lakehouse. Depending on the clinical sample data size and the Fabric capacity assigned to the workspace, the pipeline execution should complete in an hour. After the pipeline run finishes, you can see that the pipeline ran successfully on the sample data but logged a Failed status for the fhir_ingestion_bronze_ingestion notebook activity.

Validate the data

In real-world scenarios, you'll ingest data from various sources with different levels of quality. The validation engine, introduced in Data validation, intentionally triggers validations on some of the provided clinical sample data. During pipeline execution, the ingestion activity fails due to the intentional invalidity of the sample data. The failed files don't process and move to the Failed folder. All the other valid files process successfully, resulting in an overall green/successful pipeline status.

To investigate the failure, select the icon next to the Failed status under activity status. It provides information on how to locate the error details, along with a sample SQL query based on the runId value in the admin lakehouse BusinessEvents table. Seven errors appear for this runId, all due to Last Updated does not exist. The corresponding failed NDJSON file resides in the Failed folder, with the sourceFilePath pointing to …/Files/Failed/Clinical/FHIR-NDJSON/FHIR-HDS/2024/10/18/51KSyntheticPatients/1729215337.346439_RiskAssessment.ndjson.zip.

The successfully processed files leave the Ingest folder (now empty) and move to the Process folder.

You can also explore the ingested data in the bronze lakehouse ClinicalFhir table and the respective FHIR tables in the healthcare data model in the silver lakehouse. Here's a summary of the expected record counts:

Admin lakehouse:
- BusinessEvents table: Seven records
Bronze lakehouse:
- ClinicalFhir table: 46,137,428 records
- Files\Ingest\Clinical\FHIR-NDJSON\FHIR-HDS\51KSyntheticPatients: No files
- Files\Process\Clinical\FHIR-NDJSON\FHIR-HDS\51KSyntheticPatients\YYYY\MM\DD: 57 files
- Files\Failed\Clinical\FHIR-NDJSON\FHIR-HDS\YYYY\MM\DD\51KSyntheticPatients: One file
Silver lakehouse:
- Patient table: 53,054 records
- Observation table: 22,396,683 records
- RiskAssessment table: No records

Usage considerations

When ingesting FHIR datasets in healthcare data solutions in Microsoft Fabric, consider the following requirements:

All files must be in the NDJSON format. There aren't any restrictions on how you name the NDJSON files.
The NDJSON file can contain one or more records for a single FHIR resource type or mixed FHIR resources. For mixed FHIR resources, the bronze lakehouse ClinicalFhir table creates multiple rows, one for each FHIR record in the source NDJSON file.
We recommend using NDJSON files with more records rather than having many files with only a few records. Processing fewer large files is faster than processing a large number of smaller files of the same total size.
Each resource in the file requires a metadata field with a valid value for Meta.LastUpdated. If this value isn't present, a default validation error occurs as explained in Data validation.
Each resource in the file must have a value for the ID field. If this value isn't present, a default validation error occurs as explained in Data validation.

Share via

Ingest clinical data using healthcare data foundations

Prerequisites

Move the clinical sample data to the ingestion folder

Run the data pipeline

Validate the data

Usage considerations

Related information

Feedback

Additional resources