Manage location data in SDOH public datasets

2025-07-12

Public social determinants of health (SDOH) datasets provide social determinant information at specific geographic levels, such as county or state, making location a key element. It contextualizes health outcomes within a particular area. However, without industry standards for collecting and storing SDOH data, location representation often varies across datasets and modalities. This inconsistency makes it hard to unify social determinant information for building use cases.

Common challenges in unifying location data include:

Datasets might lack a unique location identifier or include multiple identifiers.
Different datasets might follow distinct hierarchies, often without a common link (such as postal regions, statistical areas, or school districts).
Some scenarios might require location elements not present in the datasets.
While location serves as a common link across modalities, core healthcare modalities use FHIR location, which differs from SDOH location.

The SDOH datasets - Transformations capability offers a scalable and efficient way to address these challenges, making it easier to work with SDOH location information. This article explains three common scenarios related to location data in SDOH and provides step-by-step guidance to work with them using sample queries.

Use SDOH location information from the silver lakehouse

After you prepare the location information in the raw dataset in the designated format and run the ingestion pipeline, the information is stored in specific columns within the SocialDeterminant table in the silver lakehouse.

To work with the primary location element where the social determinant value is calculated, use the LocationType and LocationValue columns.
To retrieve all associated location information, including parent elements like the state, use the LocationJson column. This column contains location details in JSON key-value pairs.

For example:

To build a report using the total population of a county, filter using LocationType = 'COUNTYFIPS' or LocationType = 'COUNTYNAME'.
To aggregate the total population at the state level, extract the STATE element from the LocationJson column.

To represent the JSON information in a tabular format, follow these steps and the corresponding sample queries. The output includes all location elements in a structured view.

Step 1: Get all the location data from the SocialDeterminant table

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import MapType, StringType

metadataId = spark.sql(
    "SELECT SocialDeterminantDatasetMetadataId FROM healthcare#_msft_silver.SocialDeterminantDatasetMetadata WHERE DatasetName = '<Add the Dataset Name>'"
).first()[0]

df = spark.sql(f"SELECT * FROM healthcare#_msft_silver.SocialDeterminant where DatasetMetadataId = {metadataId}")

Step 2: Map the JSON string to the correct JSON structure

locationJsonDf=df.withColumn("LocationJson",from_json(df.LocationJson,MapType(StringType(),StringType())))

Step 3: Select the required location types from the location JSON and create a table

newdf = locationJsonDf.select(col('LocationJson.<Add the required location type1>).alias('< Add the preferred alias1>'),
col('LocationJson.<Add the required location type2>).alias('< Add the preferred alias2>'),
col('LocationJson.<Add the required location type3>).alias('< Add the preferred alias3>'))

Unify SDOH data with FHIR-based modalities

FHIR-based modalities such as clinical, claims, and imaging store patient-level information, while SDOH public datasets store geography-level information. Although inherently disjoint, unifying them is essential for building use cases such as care management analytics. You can achieve it by linking the location element in the patient's address with the corresponding location used to calculate the social determinant value.

In the United States:

ZIP code serves as the unique identifier for a patient's address, equivalent to the postal code in a FHIR address.
County FIPS code is commonly used as the unique identifier in SDOH datasets that store social determinant values at the county level.

To simplify cross-modality data unification, the SDOH datasets - Transformations capability offers:

ZIP to County FIPS mapping dataset: Available under ReferenceData/SDOH/LocationDatasets, this dataset is built using information from the United States Census Bureau and the United States Department of Housing and Urban Development.
Predefined table schema: After deploying the capability, use this ready-to-use table schema to load the ZIP code to County FIPS code data.

Run the following script to load the ZIP to FIPS reference data into the ZipToFipsMapping table. You can also use other methods such as adding a copy activity to the SDOH pipeline.

import pandas as pd

from pyspark.sql import SparkSession

excel_file_path = "/lakehouse/default/Files/ReferenceData/SDOH/LocationDatasets/ZIPToFIPSMapping.xlsx"

df_excel = pd.read_excel(excel_file_path)
df_excel['zip'] = df_excel['zip'].astype(str)
df_excel['countyFips'] = df_excel['countyFips'].astype(str)
df_spark = spark.createDataFrame(df_excel)

# Write Spark DataFrame to the delta table

delta_table_path = "abfss://<workspace_id>@onelake.dfs.fabric.microsoft.com/<silver_lakehouse_id>/Tables/ZipToFipsMapping"

df_spark.write.format("delta").mode("overwrite").save(delta_table_path)

# Verify by reading the delta table

df_read = spark.read.format("delta").load(delta_table_path)
df_read.show()

Note

Attach the bronze lakehouse to your notebook before running the query.

Ingest custom location cohorts and unify them with SDOH data

In some cases, you might want to analyze data at a location level not present in the SDOH datasets. For example, the dataset might represent the total population at the county level with corresponding state information, but you want to surface the total population at the hospital zone level to analyze the addressable patient base. Hospital zone isn't a standard location element but a custom location element that might be specific to your organization.

To achieve this outcome, ingest a dataset mapping the custom location element to the available location elements. Follow these steps:

Prepare your dataset containing the custom location cohort. It should have at least one column in common with the location fields present against the relevant social determinants.
Upload this dataset to ReferenceData/SDOH/LocationDatasets. Refer to the step-by-step queries in the next section.

Step 1: Get location data from the SocialDeterminant table

from pyspark.sql.functions import from_json, col

from pyspark.sql.types import MapType, StringType

metadataId = spark.sql(
    "SELECT SocialDeterminantDatasetMetadataId FROM healthcare#_msft_silver.SocialDeterminantDatasetMetadata WHERE DatasetName = '<Add the Dataset Name>'"
).first()[0]
 
df = spark.sql(f"SELECT * FROM healthcare#_msft_silver.SocialDeterminant where DatasetMetadataId = {metadataId}")

Step 2: Map the JSON string to the right JSON structure

locationJsonDf=df.withColumn("LocationJson",from_json(df.LocationJson,MapType(StringType(),StringType())))

Step 3: Select the required location types from the location JSON

requiredLocationsdf  = locationJsonDf.select(col('LocationJson.<Add the required location type1>').alias('< Add the preferred alias1>'),
col('LocationJson.<Add the required location type2>').alias('<Add the preferred alias2>'),
col('LocationJson.<Add the required location type3>').alias('<Add the preferred alias3>'))

Step 4: Select the required location types from the custom location table

customLocation = spark.sql("SELECT * FROM healthcare#_msft_silver.<Add the table name containing the custom location information>")

Step 5: Perform an inner join as required

joinedDf = requiredLocationsdf.join(customLocation, requiredLocationsdf.<Common location column name> == customLocation.<Common location column name>, "inner").drop(requiredLocationsdf.<Common location column name>)