Share via

How to overcome ADF lookup activity limitation

Saravanan R 0 Reputation points
2025-08-20T17:44:26.0833333+00:00

Scenario

I have a use case where data needs to be read from a source (SQL, Blob, S3, etc.), sent to a REST API for a custom transformation, and then the API response must be written back to a sink (SQL, S3, Blob, etc.).

Current Approach

  • Step 1: Use a Lookup activity to read the source dataset.
  • Step 2: Call the REST API in a Copy activity (source = REST, sink = chosen sink).
  • Issue: Lookup activity has a hard limit of 5,000 rows / 2 MB JSON, so this approach does not scale for large datasets.

To work around this, I currently:

  • Split the source into chunks of 5,000 rows.
  • For SQL, I use OFFSET + LIMIT queries to paginate data.
  • Use ForEach to process each chunk and call the API.

Problem

This chunking solution works but introduces serious performance issues:

  • For 1M or 10M records, the pipeline takes too long due to serial/limited parallel processing.
  • Copy activity only returns metadata (rows copied, duration, etc.) — it does not expose the response body from the REST API, so I cannot directly capture and write it back to the sink of my choice.

Questions

  1. How can I overcome the 5k row limit and avoid Lookup while still sending bulk data to the REST API?
  2. Is there any alternative approach in ADF to capture the actual REST response of a Copy activity (not just metadata) and write it to a sink?
  3. Apart from re-engineering the REST API itself to handle batching, what ADF-native solutions exist for this pattern?

Thanks in advance , Any response towards solving the scenario is much appreciated .

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


2 answers

Sort by: Most helpful
  1. Anonymous
    2025-08-21T11:10:21+00:00

    Hi Saravanan R,

    Thanks for posting your query on Microsoft Q&A!

    To Answer your question please check below.

    Q1: How can I overcome the 5k row limit and avoid Lookup while still sending bulk data to the REST API?

    Use Databricks or Azure Functions to read and batch data outside of Lookup. These services can handle millions of rows efficiently and support full response capture.

    Q2: Is there any alternative approach in ADF to capture the actual REST response of a Copy activity (not just metadata) and write it to a sink?

    Unfortunately, no. ADF’s Copy activity does not expose the response body. To capture and persist the full JSON response, you’ll need to use Databricks or Azure Functions.

    Q3: Apart from re-engineering the REST API itself to handle batching, what ADF-native solutions exist for this pattern?

    While ADF itself doesn’t support response capture natively, it can orchestrate external compute services that do.

    Azure Databricks Notebook Approach (recommended for very large datasets)

    1. Provision an Azure Databricks workspace and cluster sized for your data volume and API throughput needs.
    2. In ADF, create a “Databricks Notebook” activity. Pass into it:
    • Source dataset path (e.g., ADLS/Blob container, SQL connection string if reading via Spark).
    • Chunk size or partitioning hints.
    • REST endpoint URL and any authentication tokens.
    1. In your Databricks notebook (Python/Spark):
    • Read the source data as a Spark DataFrame.
    • Repartition the DataFrame based on desired parallelism (e.g., '.repartition(200)').
    • Use mapPartitions or foreachPartition to: Collect each partition's rows into a batch JSON payload. Call the REST API with Python's 'requests'(or Spark's 'HttpClient') to send the batch. Parse the JSON response body. Yield or append the response records into a new Spark RDD/DataFrame.
    • Write the responses DataFrame directly to your sink (Parquet/CSV to ADLS, write to SQL via JDBC, or upload to S3).
    1. Monitor the Databricks job via ADF’s Databricks activity output or via Databricks REST API.

    Example Code: Just giving some sample if you want you need to change required details as per the code like container and account names, data frame names etc..

    import json, requests
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col
    # Initialize Spark
    spark = SparkSession.builder.appName("BulkRESTCalls").getOrCreate()
    # Step 1: Read source data (from SQL or ADLS)
    df = spark.read.format("jdbc").option("url", jdbc_url).option("query", "SELECT * FROM source_table").load()
    # Step 2: Repartition for parallelism
    df = df.repartition(200)  # Adjust based on cluster size and API limits
    # Step 3: Define function to call REST API per partition
    def call_rest_api(partition):
        import requests
        rows = list(partition)
        if not rows:
            return
        payload = json.dumps([row.asDict() for row in rows])
        headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
        response = requests.post(rest_url, data=payload, headers=headers)
        if response.status_code == 200:
            for item in response.json():
                yield item
        else:
            yield {"error": f"Failed with status {response.status_code}"}
    # Step 4: Apply REST call and collect responses
    response_rdd = df.rdd.mapPartitions(call_rest_api)
    response_df = spark.createDataFrame(response_rdd)
    # Step 5: Write to sink (ADLS, Blob, SQL)
    response_df.write.mode("overwrite").parquet("abfss://******@account.dfs.core.windows.net/output")
    

    Azure Functions Approach

    For smaller or simpler workloads:

    Partition source data into blob chunks.

    Use ADF’s ForEach to trigger multiple Azure Functions in parallel.

    Each Function calls the REST API and writes the full response to a sink.

    Consolidate results afterward using a Copy activity.

    Please let me know if these approaches work for you or not.

    If the information's help, please Upvote and Accept the Answer so that it will benefit for other community members.

    Thanks,

    Kalyani

    1 person found this answer helpful.
    0 comments No comments

  2. Amira Bedhiafi 41,386 Reputation points MVP Volunteer Moderator
    2025-08-21T11:46:26.67+00:00

    Hello Saravanan !

    Thank you for posting on Microsoft Learn.

    In you case, I think you should avoid using Lookup for data movement and instead you can use copy activity (or mapping data flows) to land the full source to ADLS/Blob in big partitions.

    You can work with partitioned reads ( SQL partitionOptions + parallelCopies) and set ForEach to parallel (raise Batch count) to scale.

    You can call the REST API from compute that can return bodies. If you are comfortable with Azure Function you can use it, ideally durable functions for fan-out/fan-in) where you read the staged file or partition, call the API in parallel, write the API response directly to your sink (SQL/Blob/S3) and return only the status to ADF.

    Or you can simply use Databricks Notebook activity / Custom activity on SHIR to do the same at scale. This avoids the Copy-REST-sink limitation since copy won’t give you the response body.

    If you must stay “ADF-only” and capture response at least small payloads only), you can use Web activity to POST and get the response body (size-limited), then write it:

    • For files: have the Web/Function write to Blob, then Copy Blob in SQL/S3.
    • For JSON: pass @activity('WebCall').output into a Copy with a REST/HTTP source that uses the activity output.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.