API fetching data from GIT loading to Storage Account in Parquet

Question

API fetching data from GIT loading to Storage Account in Parquet

ThakarPrateekS-5118 0

I have a api that i am calling via ADF using copy activity

the data that i am bringing is 28 days and i want to build historical data

in the incoming data there is a column called "day" which holds date.

i want to reference that and make the adf pipeline so it writes incrementally

what would be the approach?

Chandra Boorla 14,510 Reputation points Microsoft External Staff Moderator

2024-10-22T13:17:28.93+00:00

Hi @ThakarPrateekS-5118

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Chandra Boorla 14,510 Reputation points Microsoft External Staff Moderator

2024-10-22T13:17:28.93+00:00

Hi @ThakarPrateekS-5118

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @ThakarPrateekS-5118

Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

As I understand that you have an API providing 28 days of data and want to build a historical data pipeline in Azure Data Factory. You want to use the 'day' column to incrementally load new or updated data.

The three main ADF activities needed to accomplish this use case are lookup activity, copy data activity, and stored procedure activity.

Lookup activity: Lookup activity reads and returns the content of a configuration file or table. It also returns the result of executing a query.

Copy activity: Copy activity copies data among data stores located on-premises and in the cloud.

Stored procedures: A stored procedure is a prepared SQL code that can be saved as a stored procedure in the database so the code can be reused over and over again. You can also pass parameters to a stored procedure so that the stored procedure can act based on the parameter value(s) that is passed.

Watermark column: A watermark is a column in each table that indicates when the corresponding row was last created or modified. The watermark column is used to find out or slice the new or updated records for every run.

Mostly timestamp column will be chosen as a watermark column.

For repro purpose I took source as SQL database and sink as storage account.

Here is the step-by-step guide to building an incremental data pipeline in Azure Data Factory.

Sample data and Stored procedure in SQL database:

User's image

Lookup activity configuration:

User's image Copy data activity: Stored procedure activity: Pipeline status:

User's image Output: Now added 2 more extra rows:

User's image

After adding the extra rows, once we done with the pipeline execution the output is:

User's image

By following these steps and considering the additional factors, you can effectively create a historical data pipeline in Azure Data Factory that incrementally loads data based on the "day" column in your incoming data.

For more details, please refer below links:

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-portal

https://www.youtube.com/watch?v=AOClU3s9jXw&t=12s

I hope this information helps. Please do let us know if you have any further queries.

ThakarPrateekS-5118 0 Reputation points

2024-10-22T12:55:29.5633333+00:00

thanks, i am digesting this and soon will try to implement, appreciate detailed response
Chandra Boorla 14,510 Reputation points Microsoft External Staff Moderator

2024-10-23T12:26:54.03+00:00

Hi @ ThakarPrateekS-5118

Just checking in to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

API fetching data from GIT loading to Storage Account in Parquet

1 answer

Your answer