Create catalog tables
So far we've considered Delta Lake table instances created from dataframes and modified through the Delta Lake API. You can also define Delta Lake tables as catalog tables in the Hive metastore for your Spark pool, and work with them using SQL.
External vs managed tables
Tables in a Spark catalog, including Delta Lake tables, can be managed or external; and it's important to understand the distinction between these kinds of table.
- A managed table is defined without a specified location, and the data files are stored within the storage used by the metastore. Dropping the table not only removes its metadata from the catalog, but also deletes the folder in which its data files are stored.
- An external table is defined for a custom file location, where the data for the table is stored. The metadata for the table is defined in the Spark catalog. Dropping the table deletes the metadata from the catalog, but doesn't affect the data files.
Creating catalog tables
There are several ways to create catalog tables.
Creating a catalog table from a dataframe
You can create managed tables by writing a dataframe using the saveAsTable
operation as shown in the following examples:
# Save a dataframe as a managed table
df.write.format("delta").saveAsTable("MyManagedTable")
## specify a path option to save as an external table
df.write.format("delta").option("path", "/mydata").saveAsTable("MyExternalTable")
Creating a catalog table using SQL
You can also create a catalog table by using the CREATE TABLE
SQL statement with the USING DELTA
clause, and an optional LOCATION
parameter for external tables. You can run the statement using the SparkSQL API, like the following example:
spark.sql("CREATE TABLE MyExternalTable USING DELTA LOCATION '/mydata'")
Alternatively you can use the native SQL support in Spark to run the statement:
%%sql
CREATE TABLE MyExternalTable
USING DELTA
LOCATION '/mydata'
Tip
The CREATE TABLE
statement returns an error if a table with the specified name already exists in the catalog. To mitigate this behavior, you can use a CREATE TABLE IF NOT EXISTS
statement or the CREATE OR REPLACE TABLE
statement.
Defining the table schema
In all of the examples so far, the table is created without an explicit schema. In the case of tables created by writing a dataframe, the table schema is inherited from the dataframe. When creating an external table, the schema is inherited from any files that are currently stored in the table location. However, when creating a new managed table, or an external table with a currently empty location, you define the table schema by specifying the column names, types, and nullability as part of the CREATE TABLE statement; as shown in the following example:
%%sql
CREATE TABLE ManagedSalesOrders
(
Orderid INT NOT NULL,
OrderDate TIMESTAMP NOT NULL,
CustomerName STRING,
SalesTotal FLOAT NOT NULL
)
USING DELTA
When using Delta Lake, table schemas are enforced - all inserts and updates must comply with the specified column nullability and data types.
Using the DeltaTableBuilder API
You can use the DeltaTableBuilder API (part of the Delta Lake API) to create a catalog table, as shown in the following example:
from delta.tables import *
DeltaTable.create(spark) \
.tableName("default.ManagedProducts") \
.addColumn("Productid", "INT") \
.addColumn("ProductName", "STRING") \
.addColumn("Category", "STRING") \
.addColumn("Price", "FLOAT") \
.execute()
Similarly to the CREATE TABLE
SQL statement, the create
method returns an error if a table with the specified name already exists. You can mitigate this behavior by using the createIfNotExists
or createOrReplace
method.
Using catalog tables
You can use catalog tables like tables in any SQL-based relational database, querying and manipulating them by using standard SQL statements. For example, the following code example uses a SELECT
statement to query the ManagedSalesOrders table:
%%sql
SELECT orderid, salestotal
FROM ManagedSalesOrders
Tip
For more information about working with Delta Lake, see Table batch reads and writes in the Delta Lake documentation.