Interact with external data on Azure Databricks

Databricks Runtime provides bindings to popular data sources and formats to make importing and exporting data from the lakehouse simple. This article provides information to help you identify formats and integrations that have built-in support. You can also discover ways to extend Azure Databricks to interact with even more systems.

Azure Databricks provides a number of optimizations for data loading and ingestion.

Azure Databricks also supports query federation for both SQL and DataFrame users. See What is query federation?.

If you have not read or written data with Azure Databricks before, consider reviewing the DataFrames tutorial for Python or Scala. Even for users familiar with Apache Spark, this tutorial might address new challenges associated with accessing data in the cloud.

Partner Connect provides optimized, easy-to-configure integrations to many enterprise solutions. See What is Databricks Partner Connect?.

What data formats can you use in Azure Databricks?

Azure Databricks has built-in keyword bindings for all the data formats natively supported by Apache Spark. Azure Databricks uses Delta Lake as the default protocol for reading and writing data and tables, whereas Apache Spark uses Parquet.

The following data formats all have built-in keyword configurations in Apache Spark DataFrames and SQL:

Azure Databricks also provides a custom keyword for loading MLflow experiments.

Data formats with special considerations

The following data formats may require additional configuration or special consideration for use:

  • Databricks recommends loading images as binary data.
  • XML is not natively supported, but can be used after installing a library.
  • Hive tables are also natively supported by Apache Spark, but require configuration on Azure Databricks.
  • Azure Databricks can directly read many file formats while still compressed. You can also unzip compressed files on Azure Databricks if necessary.
  • LZO requires a codec installation.

For more information about Apache Spark data sources, see Generic Load/Save Functions and Generic File Source Options.

How do you configure cloud object storage for Azure Databricks?

Azure Databricks uses cloud object storage to store data files and tables. During workspace deployment, Azure Databricks configures a cloud object storage location known as the DBFS root. You can configure connections to other cloud object storage locations in your account.

In almost all cases, the data files you interact with using Apache Spark on Azure Databricks are stored in cloud object storage. See the following articles for guidance on configuring connections:

What data sources connect to Azure Databricks with JDBC?

You can use JDBC to connect with many data sources. Databricks Runtime includes drivers for a number of JDBC databases, but you might need to install a driver or different driver version to connect to your preferred database. Supported databases include the following:

What data services does Azure Databricks integrate with?

The following data services require you to configure connection settings, security credentials, and networking settings. You might need administrator or power user privileges in your Azure account or Azure Databricks workspace. Some also require that you create an Azure Databricks library and install it in a cluster: