Exercise - Explore imaging data in the lakehouse
As the system ingests your DICOM data, it uses the DICOM data to populate the structured data. The first step is to extract the DICOM tags (DICOM Data Elements) from the DCM files and then store them in the ImagingDicom staging table in the bronze lakehouse.
This table is a staging area as the system prepares the imaging data for transformation into the silver lakehouse. You shouldn't use the staging table except to troubleshoot suspected transformation issues that occur. Currently, the DICOM standard has more than 5,000 defined DICOM tags. The extraction promotes 29 of these tags that are the most relevant to support generic query and exploration. Each tag has a column on the ImagingDicom table. For example, tag (0010,0010) is mapped to the [patientname] column.
DICOM data ingestion groups all instance-level records in the ImagingDicom table by the study level. Then, it creates a single record (for each DICOM Study) as ImagingStudy, which it then adds to the ImagingDicom table in the bronze lakehouse.
The following queries explore the ImagingDicom table.
Count by modality
SELECT
modality,
COUNT(*) AS studyCount
FROM [healthcare1_msft_bronze].[dbo].[ImagingDicom]
GROUP BY modality
ORDER BY studyCount DESC;
List records for patient by ID
SELECT *
FROM [healthcare1_msft_bronze].[dbo].[ImagingDicom]
where patientId = 'TCGA-17-Z058'
The ImagingDicom table in the bronze lakehouse needs to be processed to move the data into the silver lakehouse. This process flattens and transforms the data from the ImagingDicom delta table in the bronze lakehouse to the ImagingStudy delta table in the silver lakehouse. This flattening is in accordance with the ImagingStudy Fast Healthcare Interoperability Resources (FHIR) resource (R4.3). During this process, the system transforms the source DICOM metadata that's stored in ImagingDicom into a column in the FHIR ImageStudy table in the silver lakehouse. You can review the mapping details in Transformation mapping for DICOM metadata to bronze delta table.
You can run the following queries on the silver lakehouse data.
Number of imaging procedures by modality
The following query retrieves the number of imaging procedures that are grouped by modality, showing the count of procedures for each type of imaging modality.
SELECT
modality_string AS modality_name,
COUNT(id) AS num_procedures
FROM [healthcare1_msft_silver].[dbo].[ImagingStudy]
GROUP BY modality_string;
List imaging procedures for patient
The following query retrieves the imaging occurrences for a specific patient, including the patient's identifier value that's extracted from the JSON in the subject_string.
SELECT
id AS image_occurrence_id,
started AS image_occurrence_date,
modality_string AS modality_name,
JSON_VALUE(subject_string, '$.identifier.value') AS patientIdValue
FROM [healthcare1_msft_silver].[dbo].[ImagingStudy]
WHERE JSON_VALUE(subject_string, '$.identifier.value') = 'TCGA-17-Z058';
Summarize imaging studies by status by using Spark
The following query aggregates studies based on their status (such as completed or in-progress) to get an overview of the current study states.
result_df = imaging_study_df.groupBy("status") \
.agg(F.count("id").alias("num_studies")) \
.orderBy(F.desc("num_studies"))
result_df.show()