Use DICOM data ingestion in healthcare data solutions (preview)

[This article is prerelease documentation and is subject to change.]

The DICOM data ingestion capability in healthcare data solutions (preview) allows you to ingest, store, and analyze Digital Imaging and Communications in Medicine (DICOM) data from various sources. To learn more about the capability and understand how to deploy and configure it, go to:

DICOM data ingestion is an optional capability with healthcare data solutions in Microsoft Fabric (preview). However, the capability has a direct dependency on the Healthcare data foundations capability. Ensure that you successfully deploy, configure, and execute the Healthcare data foundations pipelines first.

Prerequisites

You need to ensure you have the following requirements before executing the DICOM data ingestion pipeline:

Data ingestion options

This article provides step-by-step guidance on how to use the DICOM ingestion capability to ingest, transform, and unify the DICOM imaging dataset. The capability supports the following four execution options:

  • Option 1: End to end ingestion of DICOM files. The DICOM files, either in the native (DCM) or compressed (ZIP) formats, are ingested into the lakehouse. This option is called the Ingest option.

  • Option 2: DICOM data ingestion from Azure Data Lake Storage Gen2. The ingestion process accesses the DICOM files directly from their original Data Lake Storage location. Unlike option 1, there's no need to copy or move the DCM files from their original location. This option is called the Bring Your Own Storage (BYOS) ingestion option.

  • Option 3: End to end integration with the DICOM service. The ingestion is facilitated through native integration with the DICOM service in Azure Health Data Services. In this option, the DCM files are first transferred from the Azure Health Data Services DICOM service to Data Lake Storage Gen2. The pipeline then follows the Bring Your Own Storage execution. This option is called the Azure Health Data Services (AHDS) option.

  • Option 4: Ingestion of imaging diagnostic reports. This ingestion method is an optional execution that you can use to ingest the diagnostic reports provided in the sample imaging dataset.

Here's an overview of the different execution options. The following article sections explain each execution step in detail.

A diagram displaying the end to end execution steps.

Option 1: End to end ingestion of DICOM files

In this option, we ingest the imaging data from DICOM files into the healthcare data solutions (preview) lakehouses. You can use the imaging sample dataset that has both ZIP and native DCM files. The end-to-end execution consists of the following consecutive steps:

  1. Ingest DICOM files into OneLake
  2. Organize DICOM files in OneLake
  3. Extract DICOM metadata into the bronze lakehouse
  4. Convert DICOM metadata to the FHIR (Fast Health Interoperability Resources) format
  5. Ingest data into the ImagingStudy delta table in the bronze lakehouse
  6. Flatten and transform data into the ImagingStudy delta table in the silver lakehouse
  7. Convert and ingest data into the Image_Occurrence table in the gold lakehouse (optional)

Ingest DICOM files into OneLake

The Ingest folder in the bronze lakehouse represents a drop (queue) folder. You can drop the DICOM files inside this folder. The files then move to an organized folder structure within the bronze lakehouse.

Note

The capability doesn't automatically deploy the Ingest\Imaging\DICOM folders in your environment. You can set up these folders using the guidance in step 7 of Deploy DICOM data ingestion.

  1. Navigate to the Ingest\Imaging\DICOM folder in the bronze lakehouse.

  2. Select ... (ellipses) > Upload > Upload files.

  3. Select and upload the imaging dataset from the DICOM folder in the sample data.

    You can either upload the native DCM files, or the ZIP files that contain compressed DCM files. Within the ZIP files, the DCM files might be structured into many nested subfolders.

There's no limitation on the number of DCM files or the number, depth, and nesting of subfolders within the ingested ZIP files. For information on the file size limitation, see Ingestion file size.

Organize DICOM files in OneLake

After the DCM and ZIP files move to the bronze lakehouse folders, you can execute the healthcare#_msft_imaging_raw_data_movement notebook to begin organizing the files for processing. For more information about this notebook, see DICOM data ingestion notebooks.

The notebook is preconfigured. So, you don't need to reconfigure any parameters. It uses the ImagingRawDataMovementService module in the healthcare data solutions (preview) library to move the imaging files to an optimized folder structure for further processing.

To execute the notebook, open it and select the Run all button. The notebook will:

  1. Transfer files from the Ingest folder to a new optimized folder structure Files\Process\Imaging\DICOM\yyyy\mm\dd inside the bronze lakehouse. This scalable, data lake friendly folder structure follows Best practices for Azure Data Lake Storage directory structure. For source files in ZIP format with multiple DCM files, the notebook extracts and moves each DCM file to the optimized folder structure, regardless of the original folder hierarchy inside the source ZIP files.

  2. Add a Unix timestamp prefix to the file names. The timestamp generates at the millisecond level to ensure uniqueness across file names. This feature is useful for environments with multiple Picture Archiving and Communication System (PACS) and Vendor Neutral Archive (VNA) systems, where file name uniqueness isn't guaranteed.

  3. If a date movement fails, the failed files (with the Unix timestamp prefix) are saved under the Failed folder within the following optimized folder structure: Files\Failed\Imaging\DICOM\yyyy\mm\dd\.

After the notebook completes successful execution based on the imaging sample dataset, you should find:

  • All the DCM files removed, uncompressed, and renamed from the Ingest source folder, and copied to the Process destination folder. The files copy under the respective modality (imaging), format (DICOM), and date folders.
  • A total of 96 DCM files in the Process folder.

Extract DICOM metadata into the bronze lakehouse

This step uses the healthcare#_msft_imaging_dicom_extract_bronze_ingestion notebook to track and process the newly moved files in the Process folder using Structured streaming in Spark. The notebook uses the MetadataExtractionOrchestrator module in the healthcare data solutions (preview) library to perform the following actions:

  1. Extract the DICOM tags (DICOM data elements) available in the Process folder DCM files and ingest them into the dicomimagingmetastore delta table in the bronze lakehouse. For more information about this transformation process, go to Transformation mapping for DICOM metadata to bronze delta table.

  2. Compress the DCM files to ZIP format for cost and storage efficiency. Review the compression parameters under the kwargs dictionary on the notebook configuration page. You can also provide these parameters when you invoke the service library in the notebook code. The compression step is optional and is configured by default. You don't need to modify any preconfigured notebook parameters for this step.

  3. For data extraction failures, the notebook saves the failed file with the Unix timestamp prefix under the Failed folder in the bronze lakehouse within the following optimized folder structure Files\Failed\Imaging\DICOM\Namespace\yyyy\mm\dd\.

    Data extraction might fail for several reasons:

    1. File parsing fails due to unknown or unexpected errors.
    2. The DCM files have invalid content that isn't compliant with the DICOM standard format.

To execute the notebook, open it and select the Run all button. After the execution completes on the imaging dataset, you should find:

  • A total of 96 ZIP files, corresponding to the yyyy/mm/dd format of the Process folder.

  • A new table called dicomimagingmetastore created in the bronze lakehouse. If you can't find this table, Refresh the Fabric UI and OneLake file explorer.

  • A total of 96 rows in the dicomimagingmetastore delta table in the bronze lakehouse. Each record represents an instance object in the DICOM hierarchy and a single DCM file in the Process folder.

    A screenshot displaying data in the meta store after notebook execution completion.

Convert DICOM metadata to the FHIR format

After you complete ingesting the DCM files and populating the dicomimagingmetastore delta table with the DICOM tags, the next step is to convert the DICOM metadata to the FHIR format.

  1. In your healthcare data solutions (preview) environment, open the notebook healthcare#_msft_imaging_dicom_fhir_conversion. For more information, see DICOM data ingestion notebooks.

    This notebook uses Structured streaming in Spark to track and process recently modified delta tables in the bronze lakehouse, including dicomimagingmetastore. The notebook is preconfigured with default parameter values. It uses the MetadataToFhirConvertor module in the healthcare data solutions (preview) library to convert the DICOM metadata in the dicomimagingmetastore bronze delta table. The conversion process involves transforming metadata from the dicomimagingmetastore table into FHIR ImagingStudy in the FHIR resource R4.3 format and saving the output as NDJSON files. For more information about the transformation, go to Transformation mapping for DICOM metadata to bronze delta table.

  2. To execute the notebook, open it and select Run all.

    The notebook converts the DICOM metadata to FHIR ImagingStudy and writes the NDJSON files in another optimized folder structure for FHIR files in the bronze lakehouse. The folder structure is Files\Process\Clinical\FHIR NDJSON\yyyy\mm\dd\ImagingStudy. The notebook generates only one NDJSON file for all the DICOM metadata processed in a single notebook execution. If you can't find the new folders, Refresh the Fabric UI and OneLake file explorer.

Ingest data into the bronze lakehouse ImagingStudy delta table

After ingesting the DICOM data and converting it to the FHIR format, you can reuse the FHIR data ingestion capability notebooks. This execution runs a simple FHIR data ingestion pipeline, similar to ingesting any other FHIR resource.

  1. Navigate to the healthcare#_msft_raw_bronze_ingestion notebook in your environment and open it to review the configuration. This notebook uses Structured streaming in Spark to track and process newly generated files in the configured folder location.

  2. Ensure you configure the source_path_pattern parameter value as explained in Configure data ingestion from the FHIR ImagingStudy files.

  3. Execute the notebook by selecting Run all.

    The notebook converts the data in the FHIR ImagingStudy NDJSON file to an ImagingStudy delta table in the bronze lakehouse. This delta table maintains the raw state of the data source.

The notebook groups the instance-level data of the same study into one DICOM study record. For more information on this grouping pattern, see Group pattern in the bronze lakehouse.

After the notebook completes execution based on the imaging sample dataset, you can query the ImagingStudy bronze delta table to find nine records. Each record represents a study object in the DICOM hierarchy.

A screenshot displaying the nine imaging study files.

Ingest data into the silver lakehouse ImagingStudy delta table

  1. In your healthcare data solutions (preview) environment, navigate to the healthcare#_msft_bronze_silver_flatten notebook and open it.

  2. Select Run all to execute the notebook. You don't need to edit any of the preconfigured parameters.

    This notebook uses Structured streaming in Spark to track and process the newly added records in the bronze lakehouse. The notebook flattens and transforms data from the ImagingStudy delta table in the bronze lakehouse to the ImagingStudy delta table in the silver lakehouse, in accordance with the FHIR resource (R4.3) format.

    The notebook upserts the ImagingStudy records from the bronze to the silver lakehouse. To learn more about the upsert pattern, go to Upsert pattern in the silver lakehouse. Transformation mapping for bronze to silver delta table explains this transformation process in detail.

After the notebook completes execution, you can see nine records in the ImagingStudy delta table in the silver lakehouse.

A screenshot displaying the records in the silver lakehouse.

Convert and ingest data into the gold lakehouse

Important

Follow this optional execution step only if you've deployed and configured the OMOP analytics capability in healthcare data solutions (preview). Otherwise, you can skip this step.

For the final step, follow this guidance to convert and ingest data into the Image_Occurrence delta table in the gold lakehouse:

  1. In your healthcare data solutions (preview) environment, navigate to the healthcare#_msft_silver_omop notebook and open it.

    This notebook uses the OMOP APIs (in the healthcare data solutions (preview) library) to transform resources from the silver lakehouse into OMOP Common Data Model delta tables in the gold lakehouse. By default, you don't need to make any changes to the notebook configuration. If you prefer configuring different source (silver) and target (gold) lakehouses, review the guidance in Global configuration.

  2. Select Run all to execute the notebook.

    The notebook implements the OMOP tracking approach to track and process newly inserted or updated records in the ImagingStudy delta table in the silver lakehouse. It converts data in the FHIR delta tables in the silver lakehouse (including the ImagingStudy table) to the respective OMOP delta tables in the gold lakehouse (including the Image_Occurrence table). For more information on this transformation, go to Transformation mapping for silver to gold delta table.

    Refer FHIR to OMOP mapping for the mapping details for all the supported OMOP tables.

After the notebook completes execution based on the imaging sample dataset, you can query and find 24 records in the Image_Occurrence delta table in the gold lakehouse. Each record represents a series object in the DICOM hierarchy.

A screenshot showing the files converted and ingested into the gold lakehouse.

Option 2: DICOM data ingestion from Azure Data Lake Storage Gen2

In this ingestion option, you can directly access the DICOM files in their original Data Lake Storage Gen2 location for processing. You don't need to copy or move the DCM files from their original location, unlike in Option 1: End to end ingestion of DICOM files.

To complete the execution, follow these steps:

  1. To access the DICOM files, Create an Azure Data Lake Storage Gen2 shortcut in the bronze lakehouse.

    There's no limitation on where to create the shortcut in the bronze lakehouse. However, we recommend using the following folder structure to maintain consistency with other healthcare data solutions (preview) artifacts: Files\External\Imaging\DICOM\[Namespace]\[ShortcutName]. We use Namespace in the folder structure for the logical separation of shortcuts from different source systems. For example, you can use the Data Lake Storage Gen2 name for the Namespace value.

  2. Now, all your DICOM files, in their source Data Lake Storage Gen2 location and in any folder hierarchy or structure, are available for the DICOM data ingestion capability to process. This process assumes read-only access to the original Data Lake Storage Gen2 data and ensures that no data movement or write operations are performed on the shortcut folder.

  3. Reconfigure the required parameters in the healthcare#_msft_imaging_dicom_extract_bronze_ingestion notebook, as explained in Configure Azure Data Lake Storage ingestion.

  4. Repeat steps 3 to 7 listed in Option 1: End to end ingestion of DICOM files. Ensure you use the reconfigured parameters for these steps.

Option 3: End to end integration with the DICOM service

Important

Follow this execution pipeline only if you're using the Azure Health Data Services DICOM service and have deployed the DICOM API. Otherwise, you can skip this option.

  1. Review and complete the deployment procedure in Deploy the DICOM API in Azure Health Data Services.

  2. After deploying the Azure DICOM service, ingest DCM files through the Store (STOW-RS) API.

  3. Depending on your preferred language, upload the DCM file provided in the sample data (case7_000.dcm) using one of the following options:

    If using Python, you can:

    1. Create a .PY file.
    2. Follow the instructions and the code snippet in Use DICOMweb standard APIs with Python.
    3. Upload a DCM file from a local machine location to the DICOM server.
    4. Use the Retrieve (WADO-RS) API to verify a successful file upload operation.

    You can also verify successful file upload using the following steps:

    1. On the Azure portal, select the Azure storage account linked to the DICOM service.
    2. Navigate to Containers and follow the path [ContainerName]/AHDS/[AzureHealthDataServicesWorkspaceName]/dicom/[DICOMServiceName].
    3. Verify if you can see the DCM file uploaded here.

    An Azure portal screenshot displaying the uploaded data.

    Note

    The DCM case file name case7_000.dcm might change when uploaded to the server. However, the file content remains unchanged.

  4. After successfully uploading the data to the DICOM service and verifying the file in your Data Lake Storage Gen2 location, proceed to the next step.

  5. Follow the guidance in Option 2: DICOM data ingestion from Azure Data Lake Storage Gen2. These steps explain how to create a shortcut and extract the DICOM metadata.

Note

For details on integration limitations with the Azure Health Data Services DICOM service, see Integration with DICOM service.

Option 4: Ingestion of imaging diagnostic reports

Important

There's a known issue with the diagnostic reports in the sample data. The diagnostic reports are currently deployed in JSON format instead of NDJSON format. You can't ingest these files as-is; they must be converted to NDJSON first. We'll work on resolving this issue in the next release.

The provided imaging sample dataset also includes diagnostic reports data in the FHIR format. This data is ingested as a regular FHIR NDJSON resource. To ingest these reports:

  1. Copy the NDJSON files to the respective Process folders in the bronze lakehouse: Files\Process\Clinical\FHIR NDJSON\Fabric.HDS\yyyy\mm\dd\DiagnosticReport.

    A screenshot showing the diagnostic report files in the bronze lakehouse.

  2. Navigate to the healthcare#_msft_raw_bronze_ingestion notebook in your environment. This notebook uses Structured streaming in Spark to track and process newly generated files in the configured folder location.

  3. Ensure you configure the source_path_pattern as explained in Configure data ingestion from the FHIR DiagnosticReport files.

  4. Select Run all to execute the notebook.

    The notebook converts the data in the FHIR DiagnosticReport NDJSON files to the respective DiagnosticReport delta table in the bronze lakehouse. This delta table maintains the raw state of the data source.

  5. After the notebook completes execution on the sample dataset, you can query and find six records in the DiagnosticReport bronze lakehouse delta table.

  6. Now, repeat the steps explained in Ingest data into the silver lakehouse ImagingStudy delta table to ingest the diagnostic report data into the silver lakehouse. After successful completion, you can find six records in the DiagnosticReport delta table in the silver lakehouse.

    A screenshot of the diagnostic reports in the silver lakehouse.

Note

The preview release of the DICOM data ingestion capability doesn't yet support mapping the silver lakehouse DiagnosticReport delta table to the corresponding OMOP delta table in the gold lakehouse.

See also