Apache Hive compatibility

2024-03-01

Applies to: check marked yes Databricks Runtime

Apache Spark SQL in Azure Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.

SerDes and UDFs

Hive SerDes and UDFs are based on Hive 1.2.1.

Metastore connectivity

See External Apache Hive metastore (legacy) for information on how to connect Azure Databricks to an externally hosted Hive metastore.

Supported Hive features

Spark SQL supports the vast majority of Hive features, such as:

Hive query statements, including:
- SELECT
- GROUP BY
- ORDER BY
- CLUSTER BY
- SORT BY
All Hive expressions, including:
- Relational expressions (=, ⇔, ==, <>, <, >, >=, <=, etc)
- Arithmetic expressions (+, -, *, /, %, etc)
- Logical expressions (AND, &&, OR, ||, etc)
- Complex type constructors
- Mathematical expressions (sign, ln, cos, etc)
- String expressions (instr, length, printf, etc)
User defined functions (UDF)
User defined aggregation functions (UDAF)
User defined serialization formats (SerDes)
Window functions
Joins
- JOIN
- {LEFT|RIGHT|FULL} OUTER JOIN
- LEFT SEMI JOIN
- CROSS JOIN
Unions
Sub-queries
- SELECT col FROM ( SELECT a + b AS col from t1) t2
Sampling
Explain
Partitioned tables including dynamic partition insertion
View
Vast majority of DDL statements, including:
- CREATE TABLE
- CREATE TABLE AS SELECT
- ALTER TABLE
Most Hive data types, including:
- TINYINT
- SMALLINT
- INT
- BIGINT
- BOOLEAN
- FLOAT
- DOUBLE
- STRING
- BINARY
- TIMESTAMP
- DATE
- ARRAY<>
- MAP<>
- STRUCT<>

Unsupported Hive functionality

The following sections contain a list of Hive features that Spark SQL doesn't support. Most of these features are rarely used in Hive deployments.

Major Hive features

Writing to bucketed table created by Hive
ACID fine-grained updates

Esoteric Hive features

Union type
Unique join
Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment and only supports populating the sizeInBytes field of the Hive metastore

Hive input and output formats

File format for CLI: For results showing back to the CLI, Spark SQL supports only TextOutputFormat
Hadoop archive

Hive optimizations

A handful of Hive optimizations are not included in Spark. Some of these (such as indexes) are less important due to Spark SQL's in-memory computational model.

Block level bitmap indexes and virtual columns (used to build indexes).
Automatically determine the number of reducers for joins and groupbys: In Spark SQL, you need to control the degree of parallelism post-shuffle using SET spark.sql.shuffle.partitions=[num_tasks];.
Skew data flag: Spark SQL does not follow the skew data flag in Hive.
STREAMTABLE hint in join: Spark SQL does not follow the STREAMTABLE hint.
Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.