Change data in Azure Data Lake

13 minutes

When you select the Enable near real-time data changes option, data is inserted, updated, and deleted in your data lake in near real-time. As data changes in your finance and operations environment, the same data is updated in the data lake within a few minutes. You also get the data changes in a separate Change feed folder.

Change data in a data lake lets you build near-real-time data pipelines that react to data changes in finance and operations apps. The Change feed folder in the data lake contains every data change in finance and operations apps. This folder is automatically created and maintained by the Export to Data Lake feature.

Why do you need change data in a data lake?

Data in a data lake is often used for reporting purposes. Although you can use the table data in the data lake to create reports, you can also create additional copies of the data to improve your reporting. For example, you might have a data mart that is designed to enable your power users. In this data mart, you might have simplified, often aggregated, fact tables and dimension tables.

As table data in the data lake is updated, you must keep the corresponding fact tables and dimension tables in the data lake updated. Otherwise, your reports won't reflect the latest data.

The easiest way to update fact tables and dimension tables is to periodically create a full copy by using tables. However, this approach can be inefficient. If your tables are large (for example, if they have tens of millions or hundreds of millions of rows), the process of updating a fact table by making a full copy might take hours and consume lots of compute resources. Therefore, your users might not have the reports in time (that is, they might have to wait hours to see the latest data in reports). Additionally, because compute resources are consumed every time that data is reprocessed, you might receive a larger bill from the services that you've consumed.

Incremental update of your fact tables and dimension tables provides the answer to both problems (time consumption and compute resource consumption). In an incremental update, you select only the changed records from source tables, and update them in corresponding fact tables and dimension tables.

Incremental update is a standard capability in most data transformation tools, such as Azure Data Factory. However, for the incremental update feature to work, you must identify the records that changed in source tables.

The Change feed folder provides a history of table data changes in the data lake. This history can be used for data pipelines that use incremental update.

For more detailed information about the Change feed folder, see Change data in Azure Data Lake.

Best practices when change feeds are used

This section examines some best practices that you should follow when change feeds are used.

Updating near-real-time data marts

If you require that your data warehouse or data marts be updated in near-real time (in other words, if they must be updated within minutes of a data change in finance and operations apps), you should use change feeds.

However, there are several important concepts that must be understood:

Change records are grouped into files that are around 4 megabytes (MB) or 8 MB in size. Microsoft has optimized file sizes so that they provide optimum query response times when the files are queried by Synapse SQL Serverless. Optimized file sizes (and batched writes) also reduce the Azure charges that you might incur as the data lake is updated.
Change records are only appended. Files in the Change feed folder are never updated. Although each change record contains the LSN number and date/time of a change, you can also use the date/time stamp of the CSV files to identify changes.
When you reactivate a table in finance and operations, the Change feed folder is cleared, and the system starts change feeds from the next available change. This behavior ensures that the changes are consistent with the Tables folder. When tables are reactivated, you should consider triggering a full refresh of the downstream data pipeline.
Your downstream jobs can be orchestrated on a periodic basis (for example, every 10 minutes), or they can be triggered when a new change feed file is added to a folder.
In either case, your downstream data pipelines must have a last processed marker (also known as a watermark). Whenever you can, you should rely on the LSN field in the record as the watermark. However, you can also use the File Date-Time stamp as the watermark. By relying on the LSN, you ensure that you consume the changes in the same sequence in which they were committed in the finance and operations database.

For an example that shows how you can build a downstream pipeline, see the Synapse data ingestion template. You can use it to incrementally ingest data into a SQL-based data warehouse.

Continue

Why do you need change data in a data lake?

Best practices when change feeds are used

Updating near-real-time data marts

Feedback