How to overcome ADF lookup activity limitation

Question

How to overcome ADF lookup activity limitation

Saravanan R 0

Scenario

I have a use case where data needs to be read from a source (SQL, Blob, S3, etc.), sent to a REST API for a custom transformation, and then the API response must be written back to a sink (SQL, S3, Blob, etc.).

Current Approach

Step 1: Use a Lookup activity to read the source dataset.
Step 2: Call the REST API in a Copy activity (source = REST, sink = chosen sink).
Issue: Lookup activity has a hard limit of 5,000 rows / 2 MB JSON, so this approach does not scale for large datasets.

To work around this, I currently:

Split the source into chunks of 5,000 rows.
For SQL, I use OFFSET + LIMIT queries to paginate data.
Use ForEach to process each chunk and call the API.

Problem

This chunking solution works but introduces serious performance issues:

For 1M or 10M records, the pipeline takes too long due to serial/limited parallel processing.
Copy activity only returns metadata (rows copied, duration, etc.) — it does not expose the response body from the REST API, so I cannot directly capture and write it back to the sink of my choice.

Questions

How can I overcome the 5k row limit and avoid Lookup while still sending bulk data to the REST API?
Is there any alternative approach in ADF to capture the actual REST response of a Copy activity (not just metadata) and write it to a sink?
Apart from re-engineering the REST API itself to handle batching, what ADF-native solutions exist for this pattern?

Thanks in advance , Any response towards solving the scenario is much appreciated .

Saravanan R 0 Reputation points

2025-08-22T09:56:00.1166667+00:00
Appreciate your inputs. The approaches suggested do work, but they’re not the best fit for my use case. Since we already have an integration with Databricks, invoking a Databricks notebook from ADF would only be additional step and mostly double the cost for end user.

Azure Functions Approach This is closer to what I’m doing with SQL using limit and offset — essentially breaking the data into smaller chunks and invoking an Azure Function for each chunk.

To conclude: There’s currently no direct way in ADF to handle this pattern without staging the Azure Function response in a blob or other intermediate store.

Source → Custom Transformation (Via Azure Function/REST) → Sink

For smaller datasets, a Lookup + Copy approach is sufficient.

For larger datasets, we need to rely on external compute (e.g., Azure Functions or Databricks) or adjust the Function implementation so it can directly read from blob storage, handle batching internally, and write the results to the sink.

For larger datasets, break into smaller chunk files in blob, take the chunk as input, call your external REST API (Azure function) with that data. Collect the API response, Write the response directly into a sink (SQL, Blob, etc.)

Note: The REST service I’m using calls an Azure Function that implements my custom transformation logic. Even if I split the larger CSV dataset into smaller chunks, ADF alone cannot process them directly. I would still need an Azure Function that accepts the blob path as input, writes the processed response back to blob storage, and then use a Copy activity to merge and move the results to the desired sink.

It would greatly help if Azure Data Factory could provide an option to expose and persist the full response body when using a REST sink. This would eliminate the need for staging workarounds (e.g., writing via Azure Functions or Databricks) and enable direct end-to-end transformations within ADF.
Anonymous

2025-08-25T10:59:14.69+00:00

Hi Saravanan R,

Thank you for your detailed response!

I understand your concern. At present, ADF doesn’t support directly persisting the full REST response body. If this functionality is important for your use case, I recommend sharing your feedback on the Azure Data Factory Feedback channel so the product team can review it for future improvements. https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

Thanks

Kalyani
Anonymous

2025-08-26T12:42:56.1533333+00:00

Hi Saravanan R,Your feedback is important so please take a moment to accept answers. If you still have questions, please let us know what is needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!, If the given information helpful, do click Accept Answer and Upvote for was this answer helpful.

Thanks!

Kalyani

2 answers

Your answer

Anonymous

2025-08-25T10:59:14.69+00:00

Hi Saravanan R,

Thank you for your detailed response!

I understand your concern. At present, ADF doesn’t support directly persisting the full REST response body. If this functionality is important for your use case, I recommend sharing your feedback on the Azure Data Factory Feedback channel so the product team can review it for future improvements. https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

Thanks

Kalyani
Anonymous

2025-08-26T12:42:56.1533333+00:00

Hi Saravanan R,Your feedback is important so please take a moment to accept answers. If you still have questions, please let us know what is needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!, If the given information helpful, do click Accept Answer and Upvote for was this answer helpful.

Thanks!

Kalyani

Answer 1

Hi Saravanan R,

Thanks for posting your query on Microsoft Q&A!

To Answer your question please check below.

Q1: How can I overcome the 5k row limit and avoid Lookup while still sending bulk data to the REST API?

Use Databricks or Azure Functions to read and batch data outside of Lookup. These services can handle millions of rows efficiently and support full response capture.

Q2: Is there any alternative approach in ADF to capture the actual REST response of a Copy activity (not just metadata) and write it to a sink?

Unfortunately, no. ADF’s Copy activity does not expose the response body. To capture and persist the full JSON response, you’ll need to use Databricks or Azure Functions.

Q3: Apart from re-engineering the REST API itself to handle batching, what ADF-native solutions exist for this pattern?

While ADF itself doesn’t support response capture natively, it can orchestrate external compute services that do.

Azure Databricks Notebook Approach (recommended for very large datasets)

Provision an Azure Databricks workspace and cluster sized for your data volume and API throughput needs.
In ADF, create a “Databricks Notebook” activity. Pass into it:

Source dataset path (e.g., ADLS/Blob container, SQL connection string if reading via Spark).
Chunk size or partitioning hints.
REST endpoint URL and any authentication tokens.

In your Databricks notebook (Python/Spark):

Read the source data as a Spark DataFrame.
Repartition the DataFrame based on desired parallelism (e.g., '.repartition(200)').
Use mapPartitions or foreachPartition to: Collect each partition's rows into a batch JSON payload. Call the REST API with Python's 'requests'(or Spark's 'HttpClient') to send the batch. Parse the JSON response body. Yield or append the response records into a new Spark RDD/DataFrame.
Write the responses DataFrame directly to your sink (Parquet/CSV to ADLS, write to SQL via JDBC, or upload to S3).

Monitor the Databricks job via ADF’s Databricks activity output or via Databricks REST API.

Example Code: Just giving some sample if you want you need to change required details as per the code like container and account names, data frame names etc..

import json, requests
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark
spark = SparkSession.builder.appName("BulkRESTCalls").getOrCreate()
# Step 1: Read source data (from SQL or ADLS)
df = spark.read.format("jdbc").option("url", jdbc_url).option("query", "SELECT * FROM source_table").load()
# Step 2: Repartition for parallelism
df = df.repartition(200)  # Adjust based on cluster size and API limits
# Step 3: Define function to call REST API per partition
def call_rest_api(partition):
    import requests
    rows = list(partition)
    if not rows:
        return
    payload = json.dumps([row.asDict() for row in rows])
    headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
    response = requests.post(rest_url, data=payload, headers=headers)
    if response.status_code == 200:
        for item in response.json():
            yield item
    else:
        yield {"error": f"Failed with status {response.status_code}"}
# Step 4: Apply REST call and collect responses
response_rdd = df.rdd.mapPartitions(call_rest_api)
response_df = spark.createDataFrame(response_rdd)
# Step 5: Write to sink (ADLS, Blob, SQL)
response_df.write.mode("overwrite").parquet("abfss://******@account.dfs.core.windows.net/output")

Azure Functions Approach

For smaller or simpler workloads:

Partition source data into blob chunks.

Use ADF’s ForEach to trigger multiple Azure Functions in parallel.

Each Function calls the REST API and writes the full response to a sink.

Consolidate results afterward using a Copy activity.

Please let me know if these approaches work for you or not.

If the information's help, please Upvote and Accept the Answer so that it will benefit for other community members.

Thanks,

Kalyani

Answer 2

Hello Saravanan !

Thank you for posting on Microsoft Learn.

In you case, I think you should avoid using Lookup for data movement and instead you can use copy activity (or mapping data flows) to land the full source to ADLS/Blob in big partitions.

You can work with partitioned reads ( SQL partitionOptions + parallelCopies) and set ForEach to parallel (raise Batch count) to scale.

You can call the REST API from compute that can return bodies. If you are comfortable with Azure Function you can use it, ideally durable functions for fan-out/fan-in) where you read the staged file or partition, call the API in parallel, write the API response directly to your sink (SQL/Blob/S3) and return only the status to ADF.

Or you can simply use Databricks Notebook activity / Custom activity on SHIR to do the same at scale. This avoids the Copy-REST-sink limitation since copy won’t give you the response body.

If you must stay “ADF-only” and capture response at least small payloads only), you can use Web activity to POST and get the response body (size-limited), then write it:

For files: have the Web/Function write to Blob, then Copy Blob in SQL/S3.
For JSON: pass @activity('WebCall').output into a Copy with a REST/HTTP source that uses the activity output.

Share via

How to overcome ADF lookup activity limitation

2 answers

Your answer