I checked some articles and ressources and I concluded the following best practises by cases:
**1.****Scalability in Relation to Data Size/Payload:**
- Databricks Autoloader/COPY INTO: Highly scalable and efficient for ingesting large datasets. The autoloader's ability to handle bulk inserts makes it particularly useful for bigger payloads. It can scale horizontally based on the data size.
- SQL Warehouse: Suited for structured data and offers scalability. However, it might not perform as well as Autoloader for large data sets.
Custom API: Best suited for moderate to smaller payloads. It adds a layer of abstraction which could become a bottleneck with massive datasets.
Apache Kafka: Extremely scalable and can handle varying data sizes, but the design of the Kafka consumer in Databricks will determine actual performance.
**2.Change Frequency:**
- Databricks Autoloader/COPY INTO: Efficient for infrequent, bulk inserts.
- SQL Warehouse: Best for scenarios where frequent inserts are not critical.
- Custom API: Suitable for frequent, smaller changes, especially if near real-time ingestion isn't a necessity.
- Apache Kafka: Ideal for real-time or near real-time ingestion with high-frequency changes.
**3.Cost:**
- Databricks Autoloader/COPY INTO: While a cost is associated with storage and computing, the Autoloader minimizes unnecessary compute usage by efficiently managing incremental loads.
- **SQL Warehouse**: Depending on the solution (e.g., Azure Synapse), costs can be significant but predictable.
- Custom API: Overhead includes both the infrastructure for the API and the development/maintenance.
- Apache Kafka: While Kafka itself can be cost-effective, managing and scaling a Kafka cluster might add to the costs. 4.Reusability vs. Custom Development:
- Databricks Autoloader/COPY INTO: With established patterns, the solution can be replicated across multiple data ingestion scenarios. (One point for reusability)
- SQL Warehouse: Moderate reusability. Each client's schema and data might necessitate tweaks.
- Custom API: Lower reusability, as each new requirement might need adjustments to the API.
- Apache Kafka: High reusability, especially if topics and consumers are designed generically.
- 5.Additional Dimensions:
- Complexity: A custom API or Kafka setup requires a more complex architecture and expertise to manage, compared to direct ingestion methods.
- Data Integrity and Reliability: While tools like Kafka provide built-in mechanisms to ensure data integrity, custom solutions would need additional considerations.
- Security: Direct ingestion methods are often more secure since they don't expose additional endpoints (like an API). However, each method needs security considerations, especially regarding data encryption and access controls. I assume that your matrix would be like : if cost and scalability are your top concerns, Autoloader might rank first. If real-time ingestion is a top priority, Kafka could be the top choice. You may go for some POCs and then analyze the outputs. Don't forget that what you choose needs to be aligned with Bayer’s data governance, security policies, and long-term strategy.