Photon runtime

Photon is the native vectorized query engine on Azure Databricks, written to be directly compatible with Apache Spark APIs so it works with your existing code. It is developed in C++ to take advantage of modern hardware, and uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications-—all natively on your data lake. Photon is part of a high-performance runtime that runs your existing SQL and DataFrame API calls faster and reduces your total cost per workload. Photon is used by default in Databricks SQL warehouses.

Azure Databricks clusters

Photon is available for clusters running Databricks Runtime 9.1 LTS and above.

To enable Photon acceleration, select the Use Photon Acceleration checkbox when you create the cluster. If you create the cluster using the clusters API, set runtime_engine to PHOTON.

Photon supports a number of instance types on the driver and worker nodes. Photon instance types consume DBUs at a different rate than the same instance type running the non-Photon runtime. For more information about Photon instances and DBU consumption, see the Azure Databricks pricing page.

Photon advantages

  • Supports SQL and equivalent DataFrame operations against Delta and Parquet tables.
  • Accelerates queries that process a significant amount of data (100GB+) and include aggregations and joins.
  • Faster performance when data is accessed repeatedly from the disk cache.
  • More robust scan performance on tables with many columns and many small files.
  • Faster Delta and Parquet writing using UPDATE, DELETE, MERGE INTO, INSERT, and CREATE TABLE AS SELECT, especially for wide tables (hundreds to thousands of columns).
  • Replaces sort-merge joins with hash-joins.

Photon coverage

Operators

  • Scan, Filter, Project
  • Hash Aggregate/Join/Shuffle
  • Nested-Loop Join
  • Null-Aware Anti Join
  • Union, Expand, ScalarSubquery
  • Delta/Parquet Write Sink
  • Sort
  • Window Function

Expressions

  • Comparison / Logic
  • Arithmetic / Math (most)
  • Conditional (IF, CASE, etc.)
  • String (common ones)
  • Casts
  • Aggregates(most common ones)
  • Date/Timestamp

Data types

  • Byte/Short/Int/Long
  • Boolean
  • String/Binary
  • Decimal
  • Float/Double
  • Date/Timestamp
  • Struct
  • Array
  • Map

The following table lists supported Azure Databricks expressions and the minimum Databricks Runtime release version that supports it.

Name
Abs
Acos
Add
AddMonths
AesDecrypt
AesEncrypt
And
ArrayContains
ArrayDistinct
ArrayExcept
ArrayExists
ArrayFilter
ArrayForAll
ArrayIntersect
ArrayJoin
ArraySize
ArrayTransform
ArrayUnion
Atan
Atan2
Average
Base64
Bin
BitAndAgg
BitLength
BitOrAgg
BitwiseAnd
BitwiseNot
BitwiseOr
BitwiseReverse
BitwiseXor
BitXorAgg
BoundaryAsGeojson
BoundaryAsWkb
BoundaryAsWkt
Cast
Cbrt
CeilExpressionBuilder
CenterAsGeojson
CenterAsWkb
CenterAsWkt
Chr
Coalesce
CollectList
Concat
ConcatWs
Conv
Cos
Count
CreateArray
CreateMap
CreateNamedStruct
CreateStruct
CurrentCatalog
CurrentDatabase
CurrentDate
CurrentTimestamp
CurrentTimeZone
CurrentUser
DateAdd
DateDiff
DateFormatClass
DateFromUnixDate
DateSub
DayOfMonth
DayOfWeek
DayOfYear
Decode
DenseRank
Divide
ElementAt
EqualNullSafe
EqualTo
Exp
Explode
Extract
First
FloorExpressionBuilder
FromUnixTime
FromUTCTimestamp*
Get
GetJsonObject
GreaterThan
GreaterThanOrEqual
Greatest
GridDistance
H3ToString
Hex
Hour
If
In
InitCap
InputFileBlockLength
InputFileBlockStart
InputFileName
InSet
IntegralDivide
IsChildOf
IsNaN
IsNotNull
IsNull
IsPentagon
IsValid
JsonToStructs
Lag
Last
LastDay
Lead
Least
Length
LengthOfJsonArray
LessThan
Levenshtein
Like
Log
Log2
LongLatAsH3
LongLatAsH3String
Lower
LPadExpressionBuilder
MakeDate
MakeTimestamp
Max
MaxChild
Md5
MicrosToTimestamp
MillisToTimestamp
Min
MinChild
Minute
MonotonicallyIncreasingID
Month
MonthsBetween
Multiply
Murmur3Hash
NaNvl
NextDay
Not
Now
NthValue
NTile
NullIf
Nvl
Nvl2
OctetLength
ParseToDate
ParseToTimestamp
Percentile
PercentRank
Pi
Pmod
PosExplode
Pow
Quarter
Rand
Rank
RegExpExtract
RegExpExtractAll
RegExpReplace
RegrAvgX
RegrAvgY
Remainder
Resolution
Reverse
Reverse
RLike
Round
RowNumber
RPadExpressionBuilder
Second
SecondsToTimestamp
Sha1
Sha2
ShiftLeft
ShiftRight
ShiftRightUnsigned
Sin
Size
Slice
SoundEx
SparkVersion
Sqrt
StddevPop
StddevSamp
StringInstr
StringLocate
StringRepeat
StringSpace
StringSplit
StringToH3
StringTranslate
StringTrim
StringTrimBoth
StringTrimLeft
StringTrimRight
StructsToJson
Substring
Subtract
Sum
Tan
ToChildren
ToParent
ToRadians
ToUnixTimestamp
ToUTCTimestamp
TruncDate
TruncTimestamp
TryElementAt
TryValidate
UnaryMinus
UnBase64
Unhex
UnixDate
UnixMicros
UnixMillis
UnixSeconds
UnixTimestamp
Upper
Uuid
Validate
VarianceSamp
WeekDay
WeekOfYear
XxHash64
Year
*from_utc_timestamp is not fully supported by Photon. See from_utc_timestamp for more information.

Limitations

  • Structured Streaming: Photon currently supports stateless streaming with Delta, Parquet, and CSV. Kafka and Kinesis support is in Public Preview
  • Does not support UDFs.
  • Does not support RDD APIs.
  • Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data.

Features not supported by Photon run the same way they would with Databricks Runtime; there is no performance advantage for those features.