What are user-defined functions (UDFs)?

A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. Azure Databricks has support for many different types of UDFs to allow for distributing extensible logic. This article introduces some of the general strengths and limitations of UDFs.

Note

Not all forms of UDFs are available in all execution environments on Azure Databricks. If you are working with Unity Catalog, see User-defined functions (UDFs) in Unity Catalog.

See the following articles for more information on UDFs:

Defining custom logic without serialization penalties

Azure Databricks inherits much of its UDF behaviors from Apache Spark, including the efficiency limitations around many types of UDFs. See Which UDFs are most efficient?.

You can safely modularize your code without worrying about potential efficiency tradeoffs associated with UDFs. To do so, you must define your logic as a series of Spark built-in methods using SQL or Spark DataFrames. For example, the following SQL and Python functions combine Spark built-in methods to define a unit conversion as a reusable function:

SQL

CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
  WHEN unit = "F" THEN (temp - 32) * (5/9)
  ELSE temp
END;

SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;

Python

def convertFtoC(unitCol, tempCol):
  from pyspark.sql.functions import when
  return when(unitCol == "F", (tempCol - 32) * (5/9)).otherwise(tempCol)

from pyspark.sql.functions import col

df_query = df.select(convertFtoC(col("unit"), col("temp"))).toDF("c_temp")
display(df_query)

To run the above UDFs, you can create example data.

Which UDFs are most efficient?

UDFs might introduce significant processing bottlenecks into code execution. Azure Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.

Note

Azure Databricks optimizes many functions using Photon if you use Photon-enabled compute. Only functions that chain together Spark SQL of DataFrame commands can be optimized by Photon.

Some UDFs are more efficient than others. In terms of performance:

  • Built in functions will be fastest because of Azure Databricks optimizers.
  • Code that executes in the JVM (Scala, Java, Hive UDFs) will be faster than Python UDFs.
  • Pandas UDFs use Arrow to reduce serialization costs associated with Python UDFs.
  • Python UDFs work well for procedural logic, but should be avoided for production ETL workloads on large datasets.

Note

In Databricks Runtime 12.2 LTS and below, Python scalar UDFs and Pandas UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported in Databricks Runtime 13.3 LTS and above for all access modes.

In Databricks Runtime 14.1 and below, Scala scalar UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported for all access modes in Databricks Runtime 14.2 and above.

In Databricks Runtime 13.3 LTS and above, you can register scalar Python UDFs to Unity Catalog using SQL syntax. See User-defined functions (UDFs) in Unity Catalog.

Type Optimized Execution environment
Hive UDF No JVM
Python UDF No Python
Pandas UDF No Python (Arrow)
Scala UDF No JVM
Spark SQL Yes JVM (Photon)
Spark DataFrame Yes JVM (Photon)

When should you use a UDF?

A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.

For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.

Example data for example UDFs

The code examples in this article use UDFs to convert temperatures between Celcius and Farenheit. If you wish to execute these functions, you can create a sample dataset with the following Python code:

import numpy as np
import pandas as pd

Fdf = pd.DataFrame(np.random.normal(55, 25, 10000000), columns=["temp"])
Fdf["unit"] = "F"

Cdf = pd.DataFrame(np.random.normal(10, 10, 10000000), columns=["temp"])
Cdf["unit"] = "C"

df = spark.createDataFrame(pd.concat([Fdf, Cdf]).sample(frac=1))

df.cache().count()
df.createOrReplaceTempView("tv_temp")