An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
When working with Azure Synapse Link for Dataverse, the data from Dataverse Audit Logs is continuously streamed into a Lake Database in Azure Synapse Analytics. This data is stored in Azure Data Lake Storage (ADLS) as Parquet files (partitioned by year) and temporary CSV files.
Since you’re using Synapse Link for Dataverse, the Audit and Audit_Partitioned tables in your Lake Database are populated from the Parquet files stored in ADLS. It’s important to note that Parquet files are append-only and immutable, so you cannot simply run a DELETE statement on the Lake Database. Instead, you need to delete the underlying Parquet files to remove old audit entries.
Why You Can’t Delete Directly from the Lake Database
The Audit and Audit_Partitioned tables in Synapse Link for Dataverse are just external references to the Parquet files in ADLS. You cannot use SQL DELETE statements on the Lake Database because the Parquet files are immutable; SQL queries only operate on metadata and do not delete the actual underlying files.
Recommended Ways to Delete Old Audit Entries Since you cannot delete records directly, here are some effective methods to manage and delete old audit data:
Lifecycle Management Policy (No-Code Approach)
Best For - Fully automated, no-code solution to delete old audit files.
How it Works - Azure Storage Lifecycle Management policies can automatically delete Parquet files after a certain retention period (e.g., six months).
Pros - Completely automated, requires no manual intervention or complex setups.
Cons - Lifecycle management operates based on file modification dates, not partition values, and deletes entire files, not specific rows.
SQL Drop Partition (SQL-Based)
Best For - If your Audit_Partitioned table is partitioned by a field like year (e.g., year=2023).
How it Works - You can drop partitions using a DROP PARTITION SQL command. However, this will only remove the metadata from the Lake Database, not the actual Parquet files in ADLS.
Pros - Simple SQL operation if partitioned by time. It's a fast metadata-only operation.
Cons - You must manually manage the files in ADLS. Dropping a partition doesn't delete the Parquet files themselves.
Azure Data Factory (ADF) Pipeline (Custom Approach)
Best For - More controlled, scheduled deletion of old audit data.
How it Works - You can create an ADF pipeline to identify and delete Parquet files older than a certain threshold. This can be scheduled to run periodically.
Pros - Gives you more control over which files to delete and when.
Cons - Requires ADF pipeline setup and can incur costs based on the number of pipeline runs.
Conclusion and Recommendations
Best Choice for Simplicity - Use Lifecycle Management if you just want to automatically delete files older than a certain period with minimal configuration.
Best for Partitioned Tables - Use SQL DROP PARTITION if your audit table is partitioned, and you only need to clean up metadata (but you’ll still need to manually remove the Parquet files from ADLS).
Best for Granular Control - Use ADF Pipelines if you need more control over the deletion process, such as selective removal of files or additional logic for deletion.
This approach provides flexibility in managing your audit data based on your specific requirements.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.