You can use Azure Databricks to query streaming data sources using Structured Streaming. Azure Databricks provides extensive support for streaming workloads in Python and Scala, and supports most Structured Streaming functionality with SQL.
The following examples demonstrate using a memory sink for manual inspection of streaming data during interactive development in notebooks. Because of row output limits in the notebook UI, you might not observe all data read by streaming queries. In production workloads, you should only trigger streaming queries by writing them to a target table or external system.
Azure Databricks provides streaming data readers for the following streaming systems:
Kafka
Kinesis
PubSub
Pulsar
You must provide configuration details when you initialize queries against these systems, which vary depending on your configured environment and the system you choose to read from. See Configure streaming data sources.
Common workloads that involve streaming systems include data ingestion to the lakehouse and stream processing to sink data to external systems. For more on streaming workloads, see Streaming on Azure Databricks.
The following examples demonstrate an interactive streaming read from Kafka:
Azure Databricks creates all tables using Delta Lake by default. When you perform a streaming query against a Delta table, the query automatically picks up new records when a version of the table is committed. By default, streaming queries expect source tables to contain only appended records. If you need to work with streaming data that contains updates and deletes, Databricks recommends using Delta Live Tables and APPLY CHANGES INTO. See The APPLY CHANGES APIs: Simplify change data capture with Delta Live Tables.
The following examples demonstrate performing an interactive streaming read from a table:
Python
display(spark.readStream.table("table_name"))
SQL
SELECT * FROM STREAM table_name
Query data in cloud object storage with Auto Loader
You can stream data from cloud object storage using Auto Loader, the Azure Databricks cloud data connector. You can use the connector with files stored in Unity Catalog volumes or other cloud object storage locations. Databricks recommends using volumes to manage access to data in cloud object storage. See Connect to data sources.
Azure Databricks optimizes this connector for streaming ingestion of data in cloud object storage that is stored in popular structured, semi-structured, and unstructured formats. Databricks recommends storing ingested data in a nearly-raw format to maximize throughput and minimize potential data loss due to corrupt records or schema changes.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.