Exercise - Explore imaging data in the lakehouse

Completed

As the system ingests your DICOM data, it uses the DICOM data to populate the structured data. The first step is to extract the DICOM tags (DICOM Data Elements) from the DCM files and then store them in the ImagingDicom staging table in the bronze lakehouse.

This table is a staging area as the system prepares the imaging data for transformation into the silver lakehouse. You shouldn't use the staging table except to troubleshoot suspected transformation issues that occur. Currently, the DICOM standard has more than 5,000 defined DICOM tags. The extraction promotes 29 of these tags that are the most relevant to support generic query and exploration. Each tag has a column on the ImagingDicom table. For example, tag (0010,0010) is mapped to the [patientname] column.

DICOM data ingestion groups all instance-level records in the ImagingDicom table by the study level. Then, it creates a single record (for each DICOM Study) as ImagingStudy, which it then adds to the ImagingDicom table in the bronze lakehouse.

The following queries explore the ImagingDicom table.

Count by modality

    
    SELECT 
        modality, 
        COUNT(*) AS studyCount
    FROM [healthcare1_msft_bronze].[dbo].[ImagingDicom]
    GROUP BY modality
    ORDER BY studyCount DESC;
    List records for patient by ID
    SELECT   *
    FROM [healthcare1_msft_bronze].[dbo].[ImagingDicom]
    where patientId = 'TCGA-17-Z058'

The ImagingDicom table in the bronze lakehouse needs to be processed to move the data into the silver lakehouse. This process flattens and transforms the data from the ImagingDicom delta table in the bronze lakehouse to the ImagingStudy delta table in the silver lakehouse. This flattening is in accordance with the ImagingStudy Fast Healthcare Interoperability Resources (FHIR) resource (R4.3). During this process, the system transforms the source DICOM metadata that's stored in ImagingDicom into a column in the FHIR ImageStudy table in the silver lakehouse. You can review the mapping details in Transformation mapping for DICOM metadata to bronze delta table.

You can run the following queries on the silver lakehouse data.

Number of imaging procedures by modality

The following query retrieves the number of imaging procedures that are grouped by modality, showing the count of procedures for each type of imaging modality.


    SELECT 
        modality_string AS modality_name, 
        COUNT(id) AS num_procedures
    FROM [healthcare1_msft_silver].[dbo].[ImagingStudy]
    GROUP BY modality_string;

List imaging procedures for patient

The following query retrieves the imaging occurrences for a specific patient, including the patient's identifier value that's extracted from the JSON in the subject_string.


    SELECT 
        id AS image_occurrence_id, 
        started AS image_occurrence_date, 
        modality_string AS modality_name, 
        JSON_VALUE(subject_string, '$.identifier.value') AS patientIdValue
    FROM [healthcare1_msft_silver].[dbo].[ImagingStudy]
    WHERE JSON_VALUE(subject_string, '$.identifier.value') = 'TCGA-17-Z058';

Summarize imaging studies by status by using Spark

The following query aggregates studies based on their status (such as completed or in-progress) to get an overview of the current study states.


    result_df = imaging_study_df.groupBy("status") \
        .agg(F.count("id").alias("num_studies")) \
        .orderBy(F.desc("num_studies"))
    result_df.show()