Error message "com.databricks.sql.cloudfiles.errors.CloudFilesIOException: Failed to write to the schema log at location". Databricks notebook is encountering an issue while writing to the schema log in Databricks Cloud Files.

Anna Louise Juul Willumsen 15 Reputation points
2023-02-09T14:13:58.14+00:00

Hello everyone and nice to meet you! :-)

Does anyone have any clue of what could be the issue with the following error message? It is concerning configuring schema inference and evolution in Auto Loader Configure schema inference and evolution in Auto Loader - Azure Databricks | Microsoft Learn

Specifically when running something similar to the following commands in a python file in terraform

(
  .option("cloudFiles.format", "parquet")
  # The schema location directory keeps track of your data schema over time
  
  .load("
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,947 questions
{count} vote

2 answers

Sort by: Most helpful
  1. Anna Louise Juul Willumsen 15 Reputation points
    2023-02-15T12:38:43.6933333+00:00

    This is my code:

    basePath = f"/mnt/raw/{system_name}/"
    baseCheckpointPath = f"{basePath}_____checkpoints/"
    baseSchemasPath = f"{basePath}_____autoloaderSchemas/"
    # COMMAND ----------
    def stream_csv_table_from_tablename(tableName):
        tableDf = (
            spark.readStream.format("cloudFiles")
            .option("cloudFiles.format", "csv")
            # The schema location directory keeps track of your data schema over time
            .option("cloudFiles.schemaLocation", f"{baseSchemasPath}{tableName}")
            .option("cloudFiles.inferColumnTypes", True)
            .option("header", True)
            .option("delimiter", ",")
            .load(f"{basePath}/{tableName}/")
        )
        return tableDf
    

    This is the sample/documentation code:

    (spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "parquet")
      # The schema location directory keeps track of your data schema over time
      .option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
      .load("<path_to_source_data>")
      .writeStream
      .option("checkpointLocation", "<path_to_checkpoint>")
      .start("<path_to_target")
    )
    

    The <path_to_target> (csv table directory) is an existing location, and I have the nessecary permissions for this folder/directory, but the <path_to_checkpoint> (schema information location) does not exist, as it should be created automatically when running the notebook, as far as I have understood - could this be the error?

    Our runtime version is higher than the suggested.


  2. Christian Binderkrantz 0 Reputation points
    2023-03-28T08:39:18.04+00:00

    I had similar issue but solved it by enabling "Hierarchical namespace" for the storage account I used.

    0 comments No comments