Create and query catalog tables

Completed

So far we've considered Delta Lake table instances created from dataframes and modified through the Delta Lake API. You can also define Delta Lake tables as catalog tables in the metastore and work with them using SQL.

External vs managed tables

Tables in a Spark catalog, including Delta Lake tables, can be managed or external; and it's important to understand the distinction between these kinds of table.

  • A managed table is defined without a specified location, and the data files are stored within the storage used by the metastore. Dropping the table not only removes its metadata from the catalog, but also deletes the folder in which its data files are stored.
  • An external table is defined for a custom file location, where the data for the table is stored. The metadata for the table is defined in the Spark catalog. Dropping the table deletes the metadata from the catalog, but doesn't affect the data files.

Creating catalog tables

There are several ways to create catalog tables.

Creating a catalog table from a dataframe

You can create managed tables by writing a dataframe using the saveAsTable operation as shown in the following examples:

# Save a dataframe as a managed table
df.write.format("delta").saveAsTable("MyManagedTable")

## specify a path option to save as an external table
df.write.format("delta").option("path", "/mydata").saveAsTable("MyExternalTable")

Creating a catalog table using SQL

You can also create a catalog table by using the CREATE TABLE SQL statement with the USING DELTA clause, and an optional LOCATION parameter for external tables. You can run the statement using the SparkSQL API, like the following example:

spark.sql("CREATE TABLE MyExternalTable USING DELTA LOCATION '/mydata'")

Alternatively you can use the native SQL support in Spark to run the statement:

%sql

CREATE TABLE MyExternalTable
USING DELTA
LOCATION '/mydata'

Tip

The CREATE TABLE statement returns an error if a table with the specified name already exists in the catalog. To mitigate this behavior, you can use a CREATE TABLE IF NOT EXISTS statement or the CREATE OR REPLACE TABLE statement.

Defining the table schema

In all of the examples so far, the table is created without an explicit schema. In the case of tables created by writing a dataframe, the table schema is inherited from the dataframe. When creating an external table, the schema is inherited from any files that are currently stored in the table location. However, when creating a new managed table, or an external table with a currently empty location, you define the table schema by specifying the column names, types, and nullability as part of the CREATE TABLE statement; as shown in the following example:

%sql

CREATE TABLE ManagedSalesOrders
(
    Orderid INT NOT NULL,
    OrderDate TIMESTAMP NOT NULL,
    CustomerName STRING,
    SalesTotal FLOAT NOT NULL
)
USING DELTA

When using Delta Lake, table schemas are enforced - all inserts and updates must comply with the specified column nullability and data types.

Using the DeltaTableBuilder API

You can use the DeltaTableBuilder API (part of the Delta Lake API) to create a catalog table, as shown in the following example:

from delta.tables import *

DeltaTable.create(spark) \
  .tableName("default.ManagedProducts") \
  .addColumn("Productid", "INT") \
  .addColumn("ProductName", "STRING") \
  .addColumn("Category", "STRING") \
  .addColumn("Price", "FLOAT") \
  .execute()

Similarly to the CREATE TABLE SQL statement, the create method returns an error if a table with the specified name already exists. You can mitigate this behavior by using the createIfNotExists or createOrReplace method.

Using catalog tables

You can use catalog tables like tables in any SQL-based relational database, querying and manipulating them by using standard SQL statements. For example, the following code example uses a SELECT statement to query the ManagedSalesOrders table:

%sql

SELECT orderid, salestotal
FROM ManagedSalesOrders

Tip

For more information about working with Delta Lake, see Table batch reads and writes in the Delta Lake documentation.