Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Lakeflow Connect offers simple and efficient connectors to ingest data from local files, popular enterprise applications, databases, cloud storage, message buses, and more. This page outlines some of the ways that Lakeflow Connect can improve ETL performance. It also covers common use cases and the range of supported ingestion tools, from fully-managed connectors to fully-customizable frameworks.
Flexible service models
Lakeflow Connect offers a broad range of connectors for enterprise applications, cloud storage, databases, message buses, and more. It also gives you the flexibility to choose between the following:
A fully-managed service | Out-of-the-box connectors that democratize data access with simple UIs and powerful APIs. This allows you to quickly create robust ingestion pipelines while minimizing long-term maintenance costs. |
---|---|
A custom pipeline | If you need more customization, you can use Lakeflow Declarative Pipelines or Structured Streaming. Ultimately, this versatility enables Lakeflow Connect to meet your organization's specific needs. |
Unification with core Databricks tools
Lakeflow Connect uses core Databricks features to provide comprehensive data management. For example, it offers governance using Unity Catalog, orchestration using Lakeflow Jobs, and holistic monitoring across your pipelines. This helps your organization manage data security, quality, and cost while unifying your ingestion processes with your other data engineering tools. Lakeflow Connect is built on an open Data Intelligence Platform, with full flexibility to incorporate your preferred third-party tools. This ensures a tailored solution that aligns with your existing infrastructure and future data strategies.
Fast, scalable ingestion
Lakeflow Connect uses incremental reads and writes to enable efficient ingestion. When combined with incremental transformations downstream, this can significantly improve ETL performance.
Common use cases
Customers ingest data to solve their organizations' most challenging problems. Sample use cases include the following:
Use case | Description |
---|---|
Customer 360 | Measuring campaign performance and customer lead scoring |
Portfolio management | Maximizing ROI with historical and forecasting models |
Consumer analytics | Personalizing your customers' purchasing experiences |
Centralized human resources | Supporting your organization's workforce |
Digital twins | Increasing manufacturing efficiency |
RAG chatbots | Building chatbots to help users understand policies, products, and more |
Layers of the ETL stack
Some connectors operate at one level of the ETL stack. For example, Databricks offers fully-managed connectors for enterprise applications like Salesforce and databases like SQL Server. Other connectors operate at multiple layers of the ETL stack. For example, you can use standard connectors in either Structured Streaming for full customization or Lakeflow Declarative Pipelines for a more managed experience. You can similarly choose your level of customization for streaming data from Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.
Databricks recommends starting with the most managed layer. If it doesn't satisfy your requirements (for example, if it doesn't support your data source), drop down to the next layer. Databricks plans to expand support for more connectors across all three layers.
The following table describes the three layers of ingestion products, ordered from most customizable to most managed:
Layer | Description |
---|---|
Structured Streaming | Structured Streaming is an API for incremental stream processing in near real-time. It provides strong performance, scalability, and fault tolerance. |
Lakeflow Declarative Pipelines | Lakeflow Declarative Pipelines builds on Structured Streaming, offering a more declarative framework for creating data pipelines. You can define the transformations to perform on your data, and Lakeflow Declarative Pipelines manages orchestration, monitoring, data quality, errors, and more. Therefore, it offers more automation and less overhead than Structured Streaming. |
Fully-managed connectors | Fully-managed connectors build on Lakeflow Declarative Pipelines, offering even more automation for the most popular data sources. They extend Lakeflow Declarative Pipelines functionality to also include source-specific authentication, CDC, edge case handling, long-term API maintenance, automated retries, automated schema evolution, and so on. Therefore, they offer even more automation for any supported data sources. |
Managed connectors
You can use fully-managed connectors to ingest from enterprise applications and databases.
Supported connectors include:
Supported interfaces include:
- Databricks UI
- Databricks Asset Bundles
- Databricks APIs
- Databricks SDKs
- Databricks CLI
Standard connectors
In addition to the managed connectors, Databricks offers customizable connectors for cloud object storage and message buses. See Standard connectors in Lakeflow Connect.
File upload and download
You can ingest files that reside on your local network, files that have been uploaded to a volume, or files that are downloaded from an internet location. See Files.
Ingestion partners
Many third-party tools support batch or streaming ingestion into Databricks. Databricks validates various third-party integrations, although the steps to configure access to source systems and ingest data vary by tool. See Ingestion partners for a list of validated tools. Some technology partners are also featured in Databricks Partner Connect, which has a UI that simplifies connecting third-party tools to Lakehouse data.
DIY ingestion
Databricks provides a general compute platform. As a result, you can create your own ingestion connectors using any programming language supported by Databricks, like Python or Java. You can also import and use popular open source connector libraries like data load tool, Airbyte, and Debezium.
Ingestion alternatives
Databricks recommends ingestion for most use cases because it scales to accommodate high data volumes, low-latency querying, and third-party API limits. Ingestion copies data from your source systems to Azure Databricks, which results in duplicate data that might become stale over time. If you don't want to copy data, you can use the following tools:
Tool | Description |
---|---|
Lakehouse Federation | Allows you to query external data sources without moving your data. |
Delta Sharing | Allows you to securely share data across platforms, clouds, and regions. |