Load data using a Unity Catalog external location
Important
This feature is in Public Preview.
This article describes how to use the add data UI to create a managed table from data in Azure Data Lake Storage Gen2 using a Unity Catalog external location. An external location is an object that combines a cloud storage path with a storage credential that authorizes access to the cloud storage path.
Before you begin
Before you begin, you must have the following:
- A workspace with Unity Catalog enabled. For more information, see Set up and manage Unity Catalog.
- The
READ FILES
privilege on the external location. For more information, see Create an external location to connect cloud storage to Azure Databricks. - The
CREATE TABLE
privilege on the schema in which you want to create the managed table, theUSE SCHEMA
privilege on the schema, and theUSE CATALOG
privilege on the parent catalog. For more information, see Unity Catalog privileges and securable objects.
File types
The following file types are supported:
- CSV
- TSV
- JSON
- XML
- AVRO
- Parquet
Step 1: Confirm access to the external location
To confirm access to the external location, do the following:
- In the sidebar of your Azure Databricks workspace, click Catalog.
- In Catalog Explorer, click External Data > External Locations.
Step 2: Create the managed table
To create the managed table, do the following:
In the sidebar of your workspace, click + New > Add data.
In the add data UI, click Azure Data Lake Storage.
Select an external location from the drop-down list.
Select the folders and the files that you want to load into Azure Databricks, and then click Preview table.
Select a catalog and a schema from the drop-down lists.
(Optional) Edit the table name.
(Optional) To set advanced format options by file type, click Advanced attributes, turn off Automatically detect file type, and then select a file type.
For a list of format options, see the following section.
(Optional) To edit the column name, click the input box at the top of the column.
Column names don’t support commas, backslashes, or unicode characters (such as emojis).
(Optional) To edit column types, click the icon with the type.
Click Create table.
File type format options
The following format options are available, depending on the file type:
Format option | Description | Supported file types |
---|---|---|
Column delimiter |
The separator character between columns. Only a single character is allowed, and backslash is not supported. The default is a comma. |
CSV |
Escape character |
The escape character to use when parsing the data. The default is a quotation mark. |
CSV |
First row contains the header |
This option specifies whether the file contains a header. Enabled by default. |
CSV |
Automatically detect file type |
Automatically detect file type. Default is true . |
XML |
Automatically detect column types |
Automatically detect column types from file content. You can edit types in the preview table. If this is set to false, all column types are inferred as STRING. Enabled by default. |
- CSV - JSON - XML |
Rows span multiple lines |
Whether a column’s value can span multiple lines in the file. Disabled by default. |
- CSV - JSON |
Merge the schema across multiple files |
Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default. |
CSV |
Allow comments |
Whether comments are allowed in the file. Enabled by default. |
JSON |
Allow single quotes |
Whether single quotes are allowed in the file. Enabled by default. |
JSON |
Infer timestamp |
Whether to try to infer timestamp strings as TimestampType .Enabled by default. |
JSON |
Rescued data column |
Whether to save columns that don’t match the schema. For more information, see What is the rescued data column?. Enabled by default. |
- CSV - JSON - Avro - Parquet |
Exclude attribute |
Whether to exclude attributes in elements. Default is false . |
XML |
Attribute prefix |
The prefix for attributes to differentiate attributes and elements. Default is _ . |
XML |
Column data types
The following column data types are supported. For more information about individual data types see SQL data types.
Data Type | Description |
---|---|
BIGINT |
8-byte signed integer numbers. |
BOOLEAN |
Boolean (true , false ) values. |
DATE |
and day, without a time-zone. |
DECIMAL (P,S) |
Numbers with maximum precision P and fixed scale S . |
DOUBLE |
8-byte double-precision floating point numbers. |
STRING |
Character string values. |
TIMESTAMP |
Values comprising values of fields year, month, day, hour, minute, and second, with the session local timezone. |
Known issues
- You might experience issues with special characters in complex data types, such as a JSON object with a key containing a backtick or a colon.
- Some JSON files might require that you manually select JSON for the file type. To manually select a file type after you select files, click Advanced attributes, turn off Automatically detect file type, and then select JSON.
- Nested timestamps and decimals inside complex types might encounter issues.