Data lake not preserving Not NULL Constraint on Save and Load

AJITH KUMAR RAI 0 Reputation points
2024-02-13T05:05:06.7+00:00

Hi Team, Data lake not preserving not null constraint while saving data.

val currencySchema = StructType(List( StructField("CurrencyCode", StringType, nullable = false), StructField("CurrencyName", StringType, nullable = true), StructField("CurrencyPrecision", IntegerType, nullable = true), StructField("FractionalPart", IntegerType, nullable = false) ))

// Sample data val data = Seq( Row("USD", "US Dollar", null, 2), Row("EUR", "Euro", 3, 2), Row("GBP", null, 2, 2), Row("JPY", "Japanese Yen", 0, 2)//,

)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), exrate02currencySchema) df.printSchema()

Now after saving to lake and trying to read lake both NOT NULL Constraint is same.

For example:

On save schema is root |-- **CurrencyCode: string (nullable = false) ** |-- CurrencyName: string (nullable = true) |-- CurrencyPrecision: integer (nullable = true) |-- FractionalPart: integer (nullable = false)

**On Read Schema is ** root |-- CurrencyCode: string (nullable = true) |-- CurrencyName: string (nullable = true) |-- CurrencyPrecision: integer (nullable = true) |-- FractionalPart: integer (nullable = true)

How to preserve the nullable property

After this I want to create table on this lake path by preserving the NOT NULL Constraint also.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,559 questions
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,375 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 33,071 Reputation points Volunteer Moderator
    2024-02-13T10:15:36.0066667+00:00

    You can enforce NOT NULL constraints when reading the data back into Spark by explicitly specifying the schema with the nullable property set as required. This won't prevent nulls at the storage level but will cause Spark to throw errors if null values are encountered in columns marked as non-nullable.

    // Define the schema with NOT NULL constraints where applicable
    val enforcedSchema = StructType(List(
      StructField("CurrencyCode", StringType, nullable = false),
      StructField("CurrencyName", StringType, nullable = true),
      StructField("CurrencyPrecision", IntegerType, nullable = true),
      StructField("FractionalPart", IntegerType, nullable = false)
    ))
    // Read the data with the enforced schema
    val dfEnforced = spark.read.schema(enforcedSchema).format("parquet").load("path_to_your_data_lake")
    

    Before saving your data, perform data validation to ensure that there are no null values in columns that should be non-nullable. This can be done using DataFrame operations to filter out or fix records that do not meet the schema constraints. Also , you can use the external tables :

    CREATE EXTERNAL TABLE your_table_name (
      CurrencyCode STRING NOT NULL,
      CurrencyName STRING,
      CurrencyPrecision INT,
      FractionalPart INT NOT NULL
    )
    WITH (
      LOCATION = 'path/to/your/data/in/adls',
      DATA_SOURCE = your_external_data_source,
      FILE_FORMAT = your_file_format
    );
    

    1 person found this answer helpful.

  2. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2024-02-16T11:55:47.26+00:00

    AJITH KUMAR RAI To create a table on the Azure Data Lake database with NOT NULL constraint, you can use the following SQL statement:

    CREATE TABLE currencySchema.currencytbl (
        CurrencyCode string,
        CurrencyName string,
        CurrencyPrecision int NOT NULL,
        FractionalPart int NOT NULL
    )
    USING DELTA
    LOCATION '<your-lake-path>'
    

    This statement creates a table named currencytbl in the currencySchema schema with the CurrencyPrecision and FractionalPart columns set to NOT NULL. The table is created using the Delta format and is located at the specified lake path.

    If you are still facing issues with the NOT NULL constraint even though data is present, you can try to validate the data in the DataFrame before writing it to the lake. You can use the isNull method to check if a column contains null values and take appropriate actions to handle them. Here's an example code snippet that demonstrates how to validate the data in the DataFrame before writing it to the lake:

    // Validate the data in the DataFrame before writing it to the lake
    if (df.filter(df("CurrencyPrecision").isNull || df("FractionalPart").isNull).count() > 0) {
      // Handle null values in the DataFrame
      // ...
    } else {
      // Write the DataFrame to the Azure Data Lake
      df.write.format("parquet").mode("overwrite").save("<your-lake-path>")
    }
    

    This code snippet checks if the CurrencyPrecision and FractionalPart columns contain null values in the DataFrame. If null values are present, appropriate actions can be taken to handle them. If no null values are present, the DataFrame is written to the Azure Data Lake.

    I hope this helps. Let me know if you have any further questions or concerns.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.