An Azure service for ingesting, preparing, and transforming data at scale.
Hi Saravanan R,
Thanks for posting your query on Microsoft Q&A!
To Answer your question please check below.
Q1: How can I overcome the 5k row limit and avoid Lookup while still sending bulk data to the REST API?
Use Databricks or Azure Functions to read and batch data outside of Lookup. These services can handle millions of rows efficiently and support full response capture.
Q2: Is there any alternative approach in ADF to capture the actual REST response of a Copy activity (not just metadata) and write it to a sink?
Unfortunately, no. ADF’s Copy activity does not expose the response body. To capture and persist the full JSON response, you’ll need to use Databricks or Azure Functions.
Q3: Apart from re-engineering the REST API itself to handle batching, what ADF-native solutions exist for this pattern?
While ADF itself doesn’t support response capture natively, it can orchestrate external compute services that do.
Azure Databricks Notebook Approach (recommended for very large datasets)
- Provision an Azure Databricks workspace and cluster sized for your data volume and API throughput needs.
- In ADF, create a “Databricks Notebook” activity. Pass into it:
- Source dataset path (e.g., ADLS/Blob container, SQL connection string if reading via Spark).
- Chunk size or partitioning hints.
- REST endpoint URL and any authentication tokens.
- In your Databricks notebook (Python/Spark):
- Read the source data as a Spark DataFrame.
- Repartition the DataFrame based on desired parallelism (e.g., '.repartition(200)').
- Use mapPartitions or foreachPartition to: Collect each partition's rows into a batch JSON payload. Call the REST API with Python's 'requests'(or Spark's 'HttpClient') to send the batch. Parse the JSON response body. Yield or append the response records into a new Spark RDD/DataFrame.
- Write the responses DataFrame directly to your sink (Parquet/CSV to ADLS, write to SQL via JDBC, or upload to S3).
- Monitor the Databricks job via ADF’s Databricks activity output or via Databricks REST API.
Example Code: Just giving some sample if you want you need to change required details as per the code like container and account names, data frame names etc..
import json, requests
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark
spark = SparkSession.builder.appName("BulkRESTCalls").getOrCreate()
# Step 1: Read source data (from SQL or ADLS)
df = spark.read.format("jdbc").option("url", jdbc_url).option("query", "SELECT * FROM source_table").load()
# Step 2: Repartition for parallelism
df = df.repartition(200) # Adjust based on cluster size and API limits
# Step 3: Define function to call REST API per partition
def call_rest_api(partition):
import requests
rows = list(partition)
if not rows:
return
payload = json.dumps([row.asDict() for row in rows])
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
response = requests.post(rest_url, data=payload, headers=headers)
if response.status_code == 200:
for item in response.json():
yield item
else:
yield {"error": f"Failed with status {response.status_code}"}
# Step 4: Apply REST call and collect responses
response_rdd = df.rdd.mapPartitions(call_rest_api)
response_df = spark.createDataFrame(response_rdd)
# Step 5: Write to sink (ADLS, Blob, SQL)
response_df.write.mode("overwrite").parquet("abfss://******@account.dfs.core.windows.net/output")
Azure Functions Approach
For smaller or simpler workloads:
Partition source data into blob chunks.
Use ADF’s ForEach to trigger multiple Azure Functions in parallel.
Each Function calls the REST API and writes the full response to a sink.
Consolidate results afterward using a Copy activity.
Please let me know if these approaches work for you or not.
If the information's help, please Upvote and Accept the Answer so that it will benefit for other community members.
Thanks,
Kalyani