Trouble with GET request to Copy Data from REST API to Data Lake

Question

Hello!

I will provide some context: my pipeline makes a GET Request to a REST API (Auth type: OAuth2 Client Credential) in order to import data to the Data Lake (ADLSGen2) in parquet file format. Later, a Stored Procedure creates a View which includes every file in a predefined directory.

I am looking forward to requesting data to the API on an hourly basis (or maybe every 30 minutes) in order to get information of the previous hour. The thing is: almost 36 million records are brought per hour as a response.

In the body of the response there is no reference to the number or the total of pages. There is only data.

On the other hand, the Headers include "first-page" and "next-page" (this one appears only if there are further pages in the response, but also makes no reference to the total of pages).

I was wondering if there are any useful suggestions to make my Copy Data activity work differently.
Right now, and because of what I mentioned above, the pagination rule is set to RFC5988.
I would like my requested data to be partitioned in some way.

Also, I was wondering if there is another way to approach this issue.

Thanks!

Mateo

Answer

@Mateo Estrada Bazan Hello and welcome to Microsoft Q&A.

As I understand you want to partition the data coming from a third-party REST API and write it to Data Lake.

At first it sounde like you want to just break the data up into smaller files.

However a meaningful partition would partition on some aspect of the data. This assumes you are able to specify a query or filter to the REST API, like get family names where family size > 3.

You could then write all the expected filter values to an array variable and pass it to a ForEach activity. Inside the ForEach activity your copy activity uses those values to populate the filter / query to the REST API, and specify unique filenames in the Data Lake.

If you can't specify query / filter, and just want to break the data up into equal sized files, there is a way to do that. Have your sink dataset point to a folder and leave the file blank, do not point to a file name.

Then in the Copy Activity Sink tab, you can specify Max rows per file.

User's image

Share via

Trouble with GET request to Copy Data from REST API to Data Lake

1 answer