Usage considerations for DICOM data transformation in healthcare data solutions

2025-03-31

This article outlines key considerations to review before using the DICOM data transformation capability. Understanding these factors ensures smooth integration and operation within your healthcare data solutions environment. This resource also helps you navigate potential challenges and limitations with the capability.

Ingestion file size

There's a logical size limit of 8 GB for ZIP files and up to 4 GB for native DCM files. If your files exceed these limits, you might experience longer execution times or failed ingestion. Split the ZIP files into smaller segments (under 4 GB) to ensure successful execution. For native DCM files larger than 4 GB, scale up the Spark nodes (executors) as needed.

Note

ZIP files are supported only in the Ingest ingestion pattern.

DICOM tag extraction

We prioritize promoting the 29 tags present in the bronze lakehouse ImagingDicom delta table for two reasons:

These 29 tags are crucial for general querying and exploration of DICOM data, such as modality and laterality.
These tags are necessary for generating and populating the silver (FHIR) and gold (OMOP) delta tables in subsequent execution steps.

You can extend and promote other DICOM tags of interest. However, the DICOM data transformation notebooks don't automatically recognize or process any other columns of DICOM tags that you add to the ImagingDicom delta table in the bronze lakehouse. You need to process the extra columns independently.

Append pattern in the bronze lakehouse

All newly ingested DCM (or ZIP) files are appended to the ImagingDicom delta table in the bronze lakehouse. For every successfully ingested DCM file, a new record entry is created in the ImagingDicom delta table. There's no business logic for merge or update operations at the bronze lakehouse level.

The ImagingDicom delta table reflects every ingested DCM file at the DICOM instance level and should be considered as such. If the same DCM file is ingested again into the Ingest folder, another entry is added to the ImagingDicom delta table for the same file. However, the file names are different due to the Unix prefix timestamp. Depending on the date of ingestion, the file might be placed within a different YYYY\MM\DD folder.

OMOP version and imaging extensions

The current implementation of the gold lakehouse is based on Observational Medical Outcomes Partnership (OMOP) Common Data Model version 5.4. OMOP doesn't yet have a normative extension to support imaging data. Therefore, the capability implements the extension proposed in Development of Medical Imaging Data Standardization for Imaging-Based Observational Research: OMOP Common Data Model Extension. This extension is the most recent proposal in the imaging research field published on February 5, 2024. The current release of the DICOM data transformation capability is limited to the Image_Occurrence table in the gold lakehouse.

Structured streaming in Spark

Structured streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you express a batch computation on static data. The system ensures end-to-end fault-tolerance guarantees through checkpoints and Write-Ahead logs. To learn more about structured streaming, see Structured Streaming Programming Guide (v3.5.1).

We use ForeachBatch to process the incremental data. This method applies arbitrary operations and writes the logic on the output of a streaming query. The query is executed on the output data of every micro-batch of a streaming query. In the DICOM data transformation capability, structured streaming is used in the following execution steps:

Execution step	Checkpoint folder location	Tracked objects
Extract DICOM metadata into the bronze lakehouse	`healthcare#.HealthDataManager\DMHCheckpoint\medical_imaging\dicom_metadata_extraction`	DCM files in the bronze lakehouse under `Files\Process\Imaging\DICOM\YYYY\MM\DD`.
Convert DICOM metadata to the FHIR format	`healthcare#.HealthDataManager\DMHCheckpoint\medical_imaging\dicom_to_fhir`	Delta table ImagingDicom in the bronze lakehouse.
Ingest data into the bronze lakehouse ImagingStudy delta table	`healthcare#.HealthDataManager\DMHCheckpoint\<bronzelakehouse>\ImagingStudy`	FHIR NDJSON files in the bronze lakehouse under `Files\Process\Clinical\FHIR NDJSON\YYYY\MM\DD\ImagingStudy`.
Ingest data into the silver lakehouse ImagingStudy delta table	`healthcare#.HealthDataManager\DMHCheckpoint\<silverlakehouse>\ImagingStudy`	Delta table ImagingStudy in the bronze lakehouse.

Tip

You can use OneLake file explorer to view the content of the files and folders listed in the table. For more information, see Use OneLake file explorer.

Group pattern in the bronze lakehouse

Group patterns apply when you ingest new records from the ImagingDicom delta table in the bronze lakehouse to the ImagingStudy delta table in the bronze lakehouse. The DICOM data transformation capability groups all the instance-level records in the ImagingDicom delta table by the study level. It creates one record per DICOM study as an ImagingStudy, and then inserts the record into the ImagingStudy delta table in the bronze lakehouse.

Upsert pattern in the silver lakehouse

The upsert operation compares the FHIR delta tables between the bronze and silver lakehouses based on the {FHIRResource}.id:

If a match is identified, the silver record is updated with the new bronze record.
If there's no match identified, the bronze record is inserted as a new record in the silver lakehouse.

We use this pattern to create resources in the silver lakehouse ImagingStudy table.

Caution

To prevent data loss during the data merge into the silver lakehouse, avoid purging, deleting, or updating records from the ImagingDicom table.

OMOP tracking approach

The healthcare#_msft_omop_silver_gold_transformation notebook uses the OMOP API to monitor changes in the silver lakehouse delta table. It identifies newly modified or added records that require upserting into the gold lakehouse delta tables. This process is called watermarking.

The OMOP API compares the date and time values between {Silver.FHIRDeltatable.modified_date} and {Gold.OMOPDeltatable.SourceModifiedOn} to determine the incremental records that were modified or added since the last notebook execution. However, this OMOP tracking approach doesn't apply to all delta tables in the gold lakehouse. The following tables aren't ingested from the delta table in the silver lakehouse:

concept
concept_ancestor
concept_class
concept_relationship
concept_synonym
fhir_system_to_omop_vocab_mapping
vocabulary

These gold delta tables populate using the vocabulary data included in the OMOP sample data deployment. The vocabulary dataset in this folder is managed using structured streaming in Spark.

Upsert pattern in the gold lakehouse

The upsert pattern in the gold lakehouse is different from the silver lakehouse. The OMOP API used by the healthcare#_msft_omop_silver_gold_transformation notebook creates new IDs for each entry in the delta tables of the gold lakehouse. The API creates these IDs when it ingests or converts new records from the silver to gold lakehouse. The OMOP API also maintains internal mappings between the newly created IDs and their corresponding internal IDs in the silver lakehouse delta table.

The API works as follows:

If converting a record from a silver to gold delta table for the first time, it generates a new ID in the OMOP gold lakehouse and maps it to the original new ID in the silver lakehouse. It then inserts the record into the gold delta table with the newly generated ID.
If the same record in the silver lakehouse undergoes some modification and is ingested again into the gold lakehouse, the OMOP API recognizes the existing record in the gold lakehouse (using the mapping information). It then updates the records in the gold lakehouse with the same ID that it generated before.

Mappings between the newly generated IDs (ADRM_ID) in the gold lakehouse and the original IDs (INTERNAL_ID) for each OMOP delta table are stored in OneLake parquet files. You can locate the parquet files at the following file path:

[OneLakePath]\[workspace]\healthcare#.HealthDataManager\DMHCheckpoint\dtt\dtt_state_db\KEY_MAPPING\[OMOPTableName]_ID_MAPPING

You can also query the parquet files in a Spark notebook to view the mapping.

ImagingMetastore design in the silver lakehouse

A single DICOM file can contain up to 5,000 distinct tags, making it inefficient and resource-intensive to map and create fields for all these tags in the silver lakehouse. However, retaining access to the complete set of tags is essential to prevent data loss and maintain flexibility, especially if you require tags beyond the 29 extracted and represented in the data model. To address this problem, the silver lakehouse ImagingMetastore delta table stores all DICOM tags in the metadata_string column. These tags are represented as key-value pairs in a stringified JSON format, enabling efficient querying through the SQL analytics endpoint. This approach aligns with standard practices for managing complex JSON data across all fields in the silver lakehouse.

From the ImagingDicom table in the bronze lakehouse to the ImagingMetastore table in the silver lakehouse, the transformation doesn't perform any grouping. Resources are represented at the instance level in both tables. However, the {FHIRResource}.id is included in the ImagingMetastore table. This value allows you to query all instance-level artifacts associated with a specific study by referencing its unique ID.

Integration with DICOM service

The current integration between the DICOM data transformation capability and the Azure Health Data Services DICOM service supports only Create and Update events. You can create new imaging studies, series, and instances, or update existing ones. However, the integration doesn't support Delete events. If you delete a study, series, or instance in the DICOM service, the DICOM data transformation capability doesn't reflect this change. The imaging data remains unchanged and isn't deleted.

Table warnings

Warnings appear for all tables in each lakehouse where one or more columns use complex object-oriented data types to represent data. In the ImagingDicom and ImagingMetastore tables, the metadata_string column uses a JSON structure to map DICOM tags as key-value pairs. This design accommodates the limitation of Fabric SQL endpoints, which don't support complex data types such as structs, arrays, and maps. You can query these columns as strings using the SQL endpoint (T-SQL) or work with their native types (structs, arrays, maps) using Spark.

Performance optimization

The ImagingDicom table in the bronze lakehouse represents data at the DICOM instance level, while the ImagingStudy table in the silver lakehouse represents it at the DICOM study level. The DICOM data transformation capability groups all instance-level records in the ImagingDicom delta table by the study level. It creates one record per DICOM study and inserts it into the ImagingStudy delta table in the silver lakehouse.

During incremental ingestion, new DCM files are appended to the bronze lakehouse and upserted into the silver lakehouse. For a successful upsert, the complete study information from the bronze lakehouse must be retrieved, combining existing and incremental instances. This process can be memory intensive, especially during large-scale ingestion, such as the initial data load.

Z-ordering technique for incremental data capture

To optimize data retrieval and prevent out-of-memory issues when working with millions of rows of data, use the Z-ordering technique. This technique improves query performance by colocating related data in the same set of files. It reduces the amount of data read during queries for faster and more efficient results. During NDJSON file generation, healthcare data solutions rely heavily on the studyInstanceUid column to query and transform the imaging DICOM metadata to NDJSON. Therefore, we recommend Z-ordering the ImagingDicom table to optimize query performance.

Follow these steps to use Z-ordering while ingesting large datasets:

In your healthcare data solutions workspace, create a new notebook with the following Z-ordering query and save it.

from delta.tables import DeltaTable 
# Load the delta table 
delta_table = DeltaTable.forPath(spark, <path to imaging dicom table>) 
# Optimize the delta table with Z-ordering 
delta_table.optimize().executeZOrderBy("studyInstanceUid")

Open the healthcare#_msft_imaging_with_clinical_foundation_ingestion imaging data pipeline.
On the top toolbar, select the Save as option to save a copy of the pipeline.
Now, open the pipeline, and select the + symbol between the imaging_dicom_extract_bronze_ingestion and imaging_dicom_fhir_conversion notebooks.
In the insert activity menu, scroll to the Transform section and select Notebook.
Select the notebook created in step 1 and link it to the pipeline.
Save the pipeline.

You can now apply Z-ordering to optimize DICOM data ingestion.

Control pixel data reading in DCM files

In DICOM, pixel data consumes significant memory and computing resources during tag extraction. By default, the extraction process reads the entire DCM file and excludes the DICOM pixel data tag (7FE0,0010). To further optimize performance when working with large DCM files (> 4 GB), you can avoid reading the pixel data entirely using the stop_before_pixel parameter in the pydicom library, configured in the admin lakehouse.

The default configuration loads the entire DCM file into memory and removes only the pixel data tag, preserving any trailing tags. When you update the configuration to stop loading before the pixel data tag, any tags appearing later are also skipped. However, keep in mind that this logic stops the file loading process before the occurrence of the pixel data tag. Since the pixel data tag is typically the last tag in the sequence, this approach is often suitable for most use cases. But, if your files contain essential tags after the pixel data tag, it's best to retain the default extraction logic.

Follow these steps to control reading the pixel data:

Go to the admin lakehouse.
Under Files\system-configurations, copy the contents of the deploymentParametersConfiguration.json file and create a local JSON file.
Find the dicom_extract_lib_params parameter definition in the dicom_metadata_extraction notebook section.
Update the parameter value for stop_before_pixels to True.
Name the modified JSON file deploymentParametersConfiguration.json.
Upload the modified JSON file to the original location in the admin lakehouse and overwrite the existing file.

Suppress DCM file tag validation

DCM files can sometimes include nonstandard tags due to inaccurate representation of information in the source system. Or, they could also have private tags that are used to represent information that's not covered by the DICOM standard.

The default tag extraction process validates tags against the DICOM standard, resulting in a validation error if any nonstandard tags are detected. If you're aware of any nonstandard or private tag in your data, you can update the suppress_validation_tags parameter to bypass the tag validation step during metadata extraction. This approach prevents validation errors and helps you represent nonstandard tags as part of the metadata in the bronze lakehouse.

Follow these steps to skip the tag validation process:

Go to the admin lakehouse.
Under Files\system-configurations, copy the contents of the deploymentParametersConfiguration.json file and create a local JSON file.
Find the dicom_extract_lib_params parameter definition in the dicom_metadata_extraction notebook section.
Update the parameter value for supress_validation_tags to True.
Name the modified JSON file deploymentParametersConfiguration.json.
Upload the modified JSON file to the original location in the admin lakehouse and overwrite the existing file.