Suggest strategy/architecture for API Data Ingestion - Data factory or Function apps or something else?

Amrale, Siddhesh 40 Reputation points
2025-03-19T12:59:12.7166667+00:00

Hi!

Description and Goal
I have 50 external APIs and 2 on Prem APIs right now.
So, I want to pull data from these APIs (some Apis can have pagination too) and put into some storage.
I might pull one API every 2-minute, one API every 10-minute, one API daily etc.
The APIs data can be complex or simple. It is approximately around 10KB to 100MB for each pull.
Data format is mostly Json or xml.

Example: I need to pull parking data from an API and put it into storage. After putting it into storage I will create my own API and provide this parking data to internal teams. My manager might create power BI dashboard on this data.

Should I have different layers?

what would be best for ingestion and then using that data, please suggest best strategy for my use case in detail. Like what azure technology I should use? what should be the file format when I store this data? How can I query this data? Should I create another layer which picks this raw data from storage and convert it into something?

Also, is there any scenario where data factory beats azure function app? Because whatever data factory would do in this scenario azure function can do it too in faster and cheaper way. Also, I would have more control and flexibility.

#AzureDataFactory #AzureFunctionApp

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,628 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 24,185 Reputation points Microsoft External Staff Moderator
    2025-03-19T14:23:11.9833333+00:00

    Hi @Amrale, Siddhesh
    Your use case involves ingesting data from multiple APIs with varying frequencies, handling different data formats (JSON/XML), and ensuring it is stored optimally for querying and analysis. Below is a recommended architecture using Azure services:

    Data Ingestion Layer

    Azure Function Apps vs. Azure Data Factory (ADF)

    Criteria Azure Function Apps Azure Data Factory
    Triggering Flexibility Event-driven, supports CRON scheduling, and can handle frequent invocations (every 2 min, 10 min, etc.). Best for batch-oriented workloads, not ideal for very high-frequency triggers.
    Performance & Cost Cost-effective for frequent small payloads, can scale dynamically. Better suited for large-scale ETL processes with less frequent execution.
    Complex Orchestration More control over API calls, including custom pagination handling. Ideal for workflows involving multiple dependencies, transformations, and monitoring.
    Pagination & API Handling Requires custom logic but provides full control over API request handling. Some pagination support, but more limited in flexibility than Functions.
    Monitoring & Debugging Can integrate with Azure Application Insights for detailed logging. Built-in monitoring with logs and execution history in Azure Portal.
    1. For high-frequency, small-to-medium-sized API calls (e.g., every 2 mins, 10 mins) → Use Azure Function Apps due to cost efficiency, flexibility, and event-driven execution.
    2. For scheduled batch pulls or complex workflows (e.g., daily ingestion from multiple APIs) → Use Azure Data Factory (especially if needing data transformations and orchestration).

    Storage Layer

    1. Raw Storage: Store data in Azure Data Lake Storage (ADLS) Gen2 for scalability and hierarchical structure.
    2. File Format:
      • JSON: If minimal transformation is needed before querying.
      • Parquet: If optimized querying is required (better for Power BI and analytical queries).
    3. Schema Evolution: Use Azure Synapse or Databricks if schema transformations or merging of different API data sources is needed.

    Processing & Transformation Layer

    1. Direct Querying: If querying raw JSON/XML, use Azure Synapse Serverless SQL to query directly from ADLS.
    2. Transformation Needs:
      • Databricks or Synapse Pipelines: If further transformation, aggregation, or normalization is required before exposing to internal APIs.
      • Azure Function Apps: If lightweight transformation logic is needed.

    API Exposure & Consumption Layer

    1. Azure API Management (APIM): To expose the processed data as an internal API for consumers.
    2. Power BI Integration: Power BI can connect directly to ADLS (using Synapse) or via an API exposed through APIM.

    When does ADF outperform Azure Functions?

    • If dealing with large-scale batch processing with dependency chaining.
    • When built-in connectors simplify ingestion (e.g., database ingestion instead of API calls).
    • If monitoring, retry mechanisms, and logging via ADF UI are preferred.

    Control & Flexibility

    • Azure Function Apps provide better flexibility for API interactions and custom logic.
    • ADF is more suitable when orchestration across multiple data sources is needed.

    In summary of the points mentioned above:

    1. Use Azure Function Apps for API ingestion due to the need for frequent, dynamic calls.
    2. Store data in ADLS Gen2 in Parquet format for optimized querying.
    3. If necessary, use Azure Synapse or Databricks for further transformations.
    4. Expose processed data via Azure API Management for internal teams.
    5. Power BI can connect to Synapse or APIs based on reporting needs.

    I hope this helps. Please let us know if you need any further clarification.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Nandan Hegde 36,151 Reputation points MVP Volunteer Moderator
    2025-03-19T14:17:29.0766667+00:00

    Azure Functions in the Consumption plan have a default timeout of 5 minutes and an upper limit of 10 minutes. If you need a longer timeout, you need to switch to a different plan, such as the Premium plan, which provides longer timeouts and dedicated resources.

    So Azure function cant be a good fit as there might be cases where it might timeout.

    So would suggest you either use Azure data factory in case if you want to use low code/no code way but you can also use Azure databricks for better performance aspect.

    You can use a Medallion architecture way wherein you can have a bronze layer or a container within Azure blob storage to have the raw data as is as different paginated version of JSON files.

    Then you can have a silver layer as a seperate container that can merge all individual files of single entity into a single file

    Then you can have a gold layer container which would contain the data needed for reporting purpose

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.