In Databricks, how to remove duplicates from a column of type array

Question

I'm doing a select collect_set on a field in an array of struct type which results in a column of type array>. I would like to get the distinct strings in each of the nested arrays. How do I do this? The function array_distinct does not distinct the values within the nested arrays.

This is an example of the array>

[["3598f1fc3c1611ef8766a4ae111c4c27","3598f1fc3c1611ef8766a4ae111c4c27"],["4cf7301e3cb811ef9ae1a4ae111c4c27"],["7d46227a3c9f11efb16654b203f6e487","74af78c43ca311efb16654b203f6e487"],["ee8c38a63bf611ef8766a4ae111c4c27","ee8c38a63bf611ef8766a4ae111c4c27","ee8c38a63bf611ef8766a4ae111c4c27","ee8c38a63bf611ef8766a4ae111c4c27"]]

Accepted Answer

To remove duplicates from a column of type array> in Databricks, you can use the array_distinct() function in combination with the transform() function to apply it to each nested array. Here's an example query that demonstrates this:

SELECT transform(nested_array, x -> array_distinct(x)) AS distinct_nested_array
FROM my_table

In this query, nested_array is the column of type array> that you want to remove duplicates from, and distinct_nested_array is the resulting column with the duplicates removed from each nested array.

References:

Share via

In Databricks, how to remove duplicates from a column of type array<array<string>>

0 additional answers

Your answer