Double-check the data in your lakehouse to ensure that there are no actual duplicates. Sometimes, issues might arise from discrepancies in the data source. How does caching work in Apache Spark ? :
Caching in Apache Spark is an optimization technique that stores data in memory to improve query performance. In your code, you've attempted to disable caching using spark.conf.set("spark.synapse.vegas.useCache", "false")
. However, the behavior of caching might still be influenced by other settings and the Spark DataFrame lineage.
Make sure there are no previous cache operations on the df
DataFrame before performing the merge. Caching can be persistent, and if df
was cached before, it might still be in memory despite your attempts to disable caching.
Try also to clear the cache before performing the merge using df.unpersist()
to ensure that the DataFrame is not cached when the merge is executed.
df.unpersist()
merged_df = df.merge(...)