Interact with external data on Azure Databricks

Databricks Runtime provides bindings to popular data sources and formats to make importing and exporting data from the lakehouse simple. This article provides information to help you identify formats and integrations that have built-in support. You can also discover ways to extend Azure Databricks to interact with even more systems. Most data on Azure Databricks live in cloud object storage. See Where’s my data?.

Azure Databricks provides a number of optimizations for data loading and ingestion.

Azure Databricks also supports query federation for both SQL and DataFrame users. See What is query federation?.

If you have not read or written data with Azure Databricks before, consider reviewing the DataFrames tutorial for Python or Scala. Even for users familiar with Apache Spark, this tutorial might address new challenges associated with accessing data in the cloud.

Partner Connect provides optimized, easy-to-configure integrations to many enterprise solutions. See What is Databricks Partner Connect?.

What data formats can you use in Azure Databricks?

Azure Databricks has built-in keyword bindings for all the data formats natively supported by Apache Spark. Azure Databricks uses Delta Lake as the default protocol for reading and writing data and tables, whereas Apache Spark uses Parquet.

The following data formats all have built-in keyword configurations in Apache Spark DataFrames and SQL:

Azure Databricks also provides a custom keyword for loading MLflow experiments.

Work with streaming data sources on Azure Databricks

Azure Databricks can integrate with stream messaging services for near-real time data ingestion into the Databricks Lakehouse. Azure Databricks can also sync enriched and transformed data in the lakehouse with other streaming systems.

Structured Streaming provides native streaming access to file formats supported by Apache Spark, but Databricks recommends Auto Loader for most Structured Streaming operations that read data from cloud object storage. See What is Auto Loader?.

Ingesting streaming messages to Delta Lake allows you to retain messages indefinitely, allowing you to replay data streams without fear of losing data due to retention thresholds.

Azure Databricks has specific features for working with semi-structured data fields contained in Avro, protocol buffers, and JSON data payloads. To learn more, see:

To learn more about specific configurations for streaming from or to message queues, see:

What data sources connect to Azure Databricks with JDBC?

You can use JDBC to connect with many data sources. Databricks Runtime includes drivers for a number of JDBC databases, but you might need to install a driver or different driver version to connect to your preferred database. Supported databases include the following:

What data services does Azure Databricks integrate with?

The following data services require you to configure connection settings, security credentials, and networking settings. You might need administrator or power user privileges in your Azure account or Azure Databricks workspace. Some also require that you create an Azure Databricks library and install it in a cluster:

Data formats with special considerations

The following data formats may require additional configuration or special considerations for use:

  • Databricks recommends loading images as binary data.
  • XML is not natively supported, but can be used after installing a library.
  • Hive tables are also natively supported by Apache Spark, but require configuration on Azure Databricks.
  • Azure Databricks can directly read many file formats while still compressed. You can also unzip compressed files on Azure Databricks if necessary.
  • LZO requires a codec installation.

For more information about Apache Spark data sources, see Generic Load/Save Functions and Generic File Source Options.