How to faster/optimize data deduplicate process in ADX Cluster with 'take_any' and 'leftanti' operators?

Ori Bandel 30 Reputation points
2024-02-28T10:48:50.0666667+00:00

Hi
We are encountering a complex issue with deduplicating our data in our ADX cluster We're ingesting our data into the cluster via Auto Ingest from/with our Azure Blob, the ingestion process works super fast and is capable of scaling rapidly -- the issue is that we MUST do a deduplication process to our data (which is clients' events data) which become our data pipeline bottle-neck We use a user-defined function, on the raw, based on summarize 'take_any(*)' and a join 'leftanti' (below is the KQL syntax)

| summarize hint.strategy=shuffle hint.num_partitions=3 take_any(*) by event_id| join hint.num_partitions=3 kind=leftanti ((aws_etl datascope=hotcache | where ingest_ts >= last_ts - dedup_lookback_window | project event_id)) 

This process is pretty much limited to ~3k events/rows per second We tried to increase/double the nodes we use in the cluster but it did NOTHING to quick the process Any knowledge re how to alleviate the process is bliss Tnx!

Azure Data Explorer
Azure Data Explorer
An Azure data analytics service for real-time analysis on large volumes of data streaming from sources including applications, websites, and internet of things devices.
521 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,584 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 30,891 Reputation points Microsoft Employee
    2024-03-04T20:29:10.6233333+00:00

    Hello Ori Bandel

    Sorry for the confusion. I misunderstood it earlier.

    From the document:

    The left anti-join flavor returns all records from the left side that don't match any record from the right side. The anti-join models the "NOT IN" query.

    Aliases: anti, left anti-semi

    The "anti" join and the "leftanti" join are equivalent, and both return all the rows from the left table that do not have a matching row in the right table.

    https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/join-leftanti

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.