In Databricks, how to remove duplicates from a column of type array<array<string>>

Sara Watson 20 Reputation points
2024-07-13T06:28:22.3266667+00:00

I'm doing a select collect_set on a field in an array of struct type which results in a column of type array<array<strings>>. I would like to get the distinct strings in each of the nested arrays. How do I do this? The function array_distinct does not distinct the values within the nested arrays.

This is an example of the array<array<strings>>

[["3598f1fc3c1611ef8766a4ae111c4c27","3598f1fc3c1611ef8766a4ae111c4c27"],["4cf7301e3cb811ef9ae1a4ae111c4c27"],["7d46227a3c9f11efb16654b203f6e487","74af78c43ca311efb16654b203f6e487"],["ee8c38a63bf611ef8766a4ae111c4c27","ee8c38a63bf611ef8766a4ae111c4c27","ee8c38a63bf611ef8766a4ae111c4c27","ee8c38a63bf611ef8766a4ae111c4c27"]]

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,171 questions
0 comments No comments
{count} votes

Accepted answer
  1. Q&A Assist
    2024-07-13T06:28:22.36+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To remove duplicates from a column of type array<array<string>> in Databricks, you can use the array_distinct() function in combination with the transform() function to apply it to each nested array. Here's an example query that demonstrates this:

    SELECT transform(nested_array, x -> array_distinct(x)) AS distinct_nested_array
    FROM my_table
    

    In this query, nested_array is the column of type array<array<string>> that you want to remove duplicates from, and distinct_nested_array is the resulting column with the duplicates removed from each nested array.


    References:

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.