Load data using a Unity Catalog external location

Important

This feature is in Public Preview.

This article describes how to use the add data UI to create a managed table from data in Azure Data Lake Storage Gen2 using a Unity Catalog external location. An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.

For other approaches to loading data using external locations, see Create a table from files stored in your cloud tenant.

Before you begin

Before you begin, you must have the following:

File types

The following file types are supported:

  • CSV
  • TSV
  • JSON
  • XML
  • AVRO
  • Parquet

Step 1: Confirm access to the external location

To confirm access to the external location, do the following:

  1. In the sidebar of your Azure Databricks workspace, click Catalog.
  2. In Catalog Explorer, click External Data > External Locations.

Step 2: Create the managed table

To create the managed table, do the following:

  1. In the sidebar of your workspace, click + New > Add data.

  2. In the add data UI, click Azure Data Lake Storage.

  3. Select an external location from the drop-down list.

  4. Select the folders and the files that you want to load into Azure Databricks, and then click Preview table.

  5. Select a catalog and a schema from the drop-down lists.

  6. (Optional) Edit the table name.

  7. (Optional) To set advanced format options by file type, click Advanced attributes, turn off Automatically detect file type, and then select a file type.

    For a list of format options, see the following section.

  8. (Optional) To edit the column name, click the input box at the top of the column.

    Column names don’t support commas, backslashes, or unicode characters (such as emojis).

  9. (Optional) To edit column types, click the icon with the type.

  10. Click Create table.

File type format options

The following format options are available, depending on the file type:

Format option Description Supported file types
Column delimiter The separator character between columns. Only a single character is allowed, and backslash is not supported.

The default is a comma.
CSV
Escape character The escape character to use when parsing the data.

The default is a quotation mark.
CSV
First row contains the header This option specifies whether the file contains a header.

Enabled by default.
CSV
Automatically detect file type Automatically detect file type. Default is true. XML
Automatically detect column types Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as STRING.

Enabled by default.
* CSV

* JSON
* XML
Rows span multiple lines Whether a column’s value can span multiple lines in the file.

Disabled by default.
* CSV

* JSON
Merge the schema across multiple files Whether to infer the schema across multiple files and to merge the schema of each file.

Enabled by default.
CSV
Allow comments Whether comments are allowed in the file.

Enabled by default.
JSON
Allow single quotes Whether single quotes are allowed in the file.

Enabled by default.
JSON
Infer timestamp Whether to try to infer timestamp strings as TimestampType.

Enabled by default.
JSON
Rescued data column Whether to save columns that don’t match the schema. For more information, see What is the rescued data column?.

Enabled by default.
* CSV

* JSON
* Avro
* Parquet
Exclude attribute Whether to exclude attributes in elements. Default is false. XML
Attribute prefix The prefix for attributes to differentiate attributes and elements. Default is _. XML

Column data types

The following column data types are supported. For more information about individual data types see SQL data types.

Data Type Description
BIGINT 8-byte signed integer numbers.
BOOLEAN Boolean (true, false) values.
DATE and day, without a time-zone.
DECIMAL (P,S) Numbers with maximum precision P and fixed scale S.
DOUBLE 8-byte double-precision floating point numbers.
STRING Character string values.
TIMESTAMP Values comprising values of fields year, month, day, hour, minute, and second, with the session local timezone.

Known issues

  • You might experience issues with special characters in complex data types, such as a JSON object with a key containing a backtick or a colon.
  • Some JSON files might require that you manually select JSON for the file type. To manually select a file type after you select files, click Advanced attributes, turn off Automatically detect file type, and then select JSON.
  • Nested timestamps and decimals inside complex types might encounter issues.