Share via

Serverless job typed dataset transformations fail

David Regan 0 Reputation points
2026-03-20T16:06:00.1633333+00:00

I've been experiencing exceptions trying to run a  custom Scala JaR in a serverless configuation.

Eventually, I found that typed dataset operation cause the error.

I've put together a minimal repro of the situation.

I write some test data to a delta table ("main.temp.tiny_elements").

I then run 4 jobs in parallel that do a streaming read from this table, do some basic transformation and write the results to a temp table (to materialize the operations).

Dataframe processing works but any operations that manipulate typed datasets fail.

The four test cases  are:

  1. ReadDataFrameAndWrite: takes the streamed data frame back and writes to a temp table
  2. ReadDataFrameAsDataSetAndWrite: takes the streamed data frame and treats it as a typed dataset and writes to a temp table
  3. ReadDataFrameFilterDataSetAndWrite: takes the streamed data frame, treats it as a type dataset, filters the dataset and writes to a temp table
  4. ReadDataFrameMapDataSetAndWrite: takes the streamed data frame, treats it as a type dataset, maps the dataset and writes to a temp table

Tests 1 & 2 work, 3 & 4 don't and return the attached exception (which I believe is a secondary exception masking the underlying issue).

Attached is a zip of:

  • a scala project to build the JaR file for the assembly to add to the serverless environment. Simply run "sbt assembly "to build net.opentrading.databricks-assembly-1.1.jar
  • YAML for the test jobs. This will need to be adjusted for the location of the JAR dependency
  • A test python notebook to create test data in the input delta tableI've been experiencing exceptions trying to run a  custom Scala JaR in a serverless configuation. Eventually, I found that typed dataset operation cause the error. I've put together a minimal repro of the situation. I write some test data to a delta table ("main.temp.tiny_elements"). I then run 4 jobs in parallel that do a streaming read from this table, do some basic transformation and write the results to a temp table (to materialize the operations). Dataframe processing works but any operations that manipulate typed datasets fail. The four test cases  are:
    1. ReadDataFrameAndWrite: takes the streamed data frame back and writes to a temp table
    2. ReadDataFrameAsDataSetAndWrite: takes the streamed data frame and treats it as a typed dataset and writes to a temp table
    3. ReadDataFrameFilterDataSetAndWrite: takes the streamed data frame, treats it as a type dataset, filters the dataset and writes to a temp table
    4. ReadDataFrameMapDataSetAndWrite: takes the streamed data frame, treats it as a type dataset, maps the dataset and writes to a temp table
    Tests 1 & 2 work, 3 & 4 don't and return the attached exception (which I believe is a secondary exception masking the underlying issue). Attached is a zip of:
    • a scala project to build the JaR file for the assembly to add to the serverless environment. Simply run "sbt assembly "to build net.opentrading.databricks-assembly-1.1.jar
    • YAML for the test jobs. This will need to be adjusted for the location of the JAR dependency
    • A test python notebook to create test data in the input delta table
Azure Databricks
Azure Databricks

An Apache Spark-based analytics platform optimized for Azure.


1 answer

Sort by: Most helpful
  1. Vinodh247 42,051 Reputation points MVP Volunteer Moderator
    2026-03-21T17:13:16.7733333+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    This is a known limitation rather than a bug in your code. In serverless Spark environments (especially in Databricks Serverless), typed Dataset operations (map, filter with case classes, encoders) rely on JVM level serialization (Encoders, closures, bytecode generation). Serverless isolates execution and restricts parts of the JVM execution model, so these encoder-based transformations often fail at runtime, especially in structured streaming.

    Why your tests behave this way:

    • Test 1 (DataFrame) -> works because it uses Catalyst + Tungsten (no JVM object encoding)
    • Test 2 (as[Type] only) -> works because no transformation is executed yet
    • Test 3 & 4 (filter, map) -> fail because they trigger encoder-based execution + closure serialization, which is not fully supported in serverless

    The exception you see is typically a wrapper, hiding the real issue (unsupported encoder / serialization path).

    Bottom line: Typed Datasets are not reliably supported in serverless streaming jobs. This is by design in current implementations.

    What you should do?

    Stick to DataFrame APIs (select, withColumn, where) for serverless

    Avoid map, flatMap, strongly-typed filter

    If you need typed logic, either:

    switch to standard (non-serverless) clusters, or

      rewrite logic using SQL/DataFrame expressions
      
    

    Architectural takeaway (important for you as a data architect): serverless Spark is optimised for declarative transformations, not JVM-level functional transformations. Treat it closer to SQL engine + distributed optimizer, not a full Scala runtime

     

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.