Design Pattern Options for Data Ingestion into Databricks

Suleman Butt 1 Reputation point
2023-09-13T22:07:59.9333333+00:00

Dear MS Azure Databrick Community,

Currently, we are in the process of formulating several design pattern options for the ingestion of data (both structured and semi-structured) into Databricks. 

We have come up with the following options, and we would greatly appreciate your insights on their suitability, taking into account the various dimensions outlined below.

 

Design Pattern Options for Data Ingestion:

  1. Databricks Autoloader or COPY INTO: The client has the option to ingest data into a designated container using the ADLS Client library. Our Lakehouse platform team can configure and provide access details (such as AD credentials, SAS tokens, or Account Keys) to the client for authentication and data storage in the container. After data is successfully written to the container, either our platform team or the client (if they have Databricks access) can employ Autoloader-based jobs or the COPY INTO method to efficiently process and insert the data into the targeted Databricks delta table or database schema object.
  2. SQL Warehouse: A Service Principal (SP) can be set up with the necessary write permissions for the Databricks delta table or database schema objects. Information about this SP and the SQL Warehouse can be shared with the client. The client can then authenticate using the SP and insert data using SQL queries. It's important to note that this approach assumes that the data is not temporarily stored or sourced from a cloud blob storage.
  3. Custom API: Our platform team has the capability to implement a customized API (utilizing a POST request) that internally employs a SP to insert data into Databricks delta tables or database schema objects using SQL commands. This custom API can be designed to be configurable enough to accept input parameters in the form of insert queries and manage data insertion into any desired Databricks delta table or database schema object. We can also exercise control and configuration over which clients are permitted to use the API for data insertion. Similar to the previous option, this method assumes that the data is not temporarily stored or derived from cloud blob storage.
  4. Apache Kafka (Platform team managed): Clients have the opportunity to write data into the relevant Kafka topic, which our platform team can then process and insert into the appropriate Databricks delta table or database schema object.
  5. Additional Options, direct integration with external tools

 

Now, let's explore various dimensions that should guide our decision-making process and determine which option is most suitable for specific scenarios:

 

  • Scalability in Relation to Data Size/Payload: When dealing with varying data sizes, ranging from substantial amounts in gigabytes (e.g., for Data Scientists) to smaller sizes in megabytes or even kilobytes, it's essential to consider which option among the ones listed is the most scalable. Should data be temporarily stored in an intermediate cloud location before ingestion? What should be the size and number of individual files to optimize performance?
  • Change Frequency: The frequency at which data needs to be inserted, whether it's on a second-by-second, minute-by-minute, or hourly basis compared to daily, weekly, or monthly insertions, plays a crucial role in selecting the appropriate option. This decision should take into account the data size aspect mentioned earlier.
  • Cost: It's imperative to assess the cost implications of each option, both in terms of infrastructure and implementation effort. Some methods may require more resources or involve higher development costs than others.
  • Reusability vs. Custom Development: Consider how efficiently and rapidly we can reuse or reconfigure existing pipelines vs the effort required to build a new one from scratch each time a new client requires data ingestion.
  • Additional Dimensions: Explore any other relevant dimensions that could impact the choice of data ingestion design pattern.

 

Taking into account these dimensions while evaluating the data ingestion design pattern options, our goal is to create a matrix that provides a clear order of preference based on these considerations.

 

Notes:

  • Client: This could refer to either an internal or external company's application, with the possibility of having access to a Databricks workspace or not.
  • Our main focus here is to streamline data ingestion option for non-SAP based data sources. When it comes to SAP data sources, we generally adhere to a data ingestion process based on SAP Data Intelligence (SAP DI), where SAP DI handles the initial data landing into our Datalake. Subsequently, we employ Databricks Autoloader-driven jobs to transport the data through different stages within our Datalake and ultimately store it in Databricks objects.

 

Kindly let us know your thoughts, experience in outlining some design best practises for ingesting data into Databricks.

 

Many thanks.

 

Suleman Butt

Solutions Architect

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,153 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 23,011 Reputation points
    2023-09-14T11:58:59.8+00:00

    I checked some articles and ressources and I concluded the following best practises by cases:

     **1.****Scalability in Relation to Data Size/Payload:**
    
    • Databricks Autoloader/COPY INTO: Highly scalable and efficient for ingesting large datasets. The autoloader's ability to handle bulk inserts makes it particularly useful for bigger payloads. It can scale horizontally based on the data size.
    • SQL Warehouse: Suited for structured data and offers scalability. However, it might not perform as well as Autoloader for large data sets.

    Custom API: Best suited for moderate to smaller payloads. It adds a layer of abstraction which could become a bottleneck with massive datasets.

    Apache Kafka: Extremely scalable and can handle varying data sizes, but the design of the Kafka consumer in Databricks will determine actual performance.

       **2.Change Frequency:**
    
    • Databricks Autoloader/COPY INTO: Efficient for infrequent, bulk inserts.
    • SQL Warehouse: Best for scenarios where frequent inserts are not critical.
    • Custom API: Suitable for frequent, smaller changes, especially if near real-time ingestion isn't a necessity.
    • Apache Kafka: Ideal for real-time or near real-time ingestion with high-frequency changes.
    **3.Cost:**
    
    • Databricks Autoloader/COPY INTO: While a cost is associated with storage and computing, the Autoloader minimizes unnecessary compute usage by efficiently managing incremental loads.
            - **SQL Warehouse**: Depending on the solution (e.g., Azure Synapse), costs can be significant but predictable.
      
    • Custom API: Overhead includes both the infrastructure for the API and the development/maintenance.
    • Apache Kafka: While Kafka itself can be cost-effective, managing and scaling a Kafka cluster might add to the costs. 4.Reusability vs. Custom Development:
    • Databricks Autoloader/COPY INTO: With established patterns, the solution can be replicated across multiple data ingestion scenarios. (One point for reusability)
    • SQL Warehouse: Moderate reusability. Each client's schema and data might necessitate tweaks.
    • Custom API: Lower reusability, as each new requirement might need adjustments to the API.
    • Apache Kafka: High reusability, especially if topics and consumers are designed generically.
    • 5.Additional Dimensions:
    • Complexity: A custom API or Kafka setup requires a more complex architecture and expertise to manage, compared to direct ingestion methods.
    • Data Integrity and Reliability: While tools like Kafka provide built-in mechanisms to ensure data integrity, custom solutions would need additional considerations.
    • Security: Direct ingestion methods are often more secure since they don't expose additional endpoints (like an API). However, each method needs security considerations, especially regarding data encryption and access controls. I assume that your matrix would be like : if cost and scalability are your top concerns, Autoloader might rank first. If real-time ingestion is a top priority, Kafka could be the top choice. You may go for some POCs and then analyze the outputs. Don't forget that what you choose needs to be aligned with Bayer’s data governance, security policies, and long-term strategy.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.