Data lake not preserving Not NULL Constraint on Save and Load

Question

Data lake not preserving Not NULL Constraint on Save and Load

AJITH KUMAR RAI 0

Hi Team, Data lake not preserving not null constraint while saving data.

val currencySchema = StructType(List( StructField("CurrencyCode", StringType, nullable = false), StructField("CurrencyName", StringType, nullable = true), StructField("CurrencyPrecision", IntegerType, nullable = true), StructField("FractionalPart", IntegerType, nullable = false) ))

// Sample data val data = Seq( Row("USD", "US Dollar", null, 2), Row("EUR", "Euro", 3, 2), Row("GBP", null, 2, 2), Row("JPY", "Japanese Yen", 0, 2)//,

)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data), exrate02currencySchema) df.printSchema()

Now after saving to lake and trying to read lake both NOT NULL Constraint is same.

For example:

On save schema is root |-- **CurrencyCode: string (nullable = false) ** |-- CurrencyName: string (nullable = true) |-- CurrencyPrecision: integer (nullable = true) |-- FractionalPart: integer (nullable = false)

**On Read Schema is ** root |-- CurrencyCode: string (nullable = true) |-- CurrencyName: string (nullable = true) |-- CurrencyPrecision: integer (nullable = true) |-- FractionalPart: integer (nullable = true)

How to preserve the nullable property

After this I want to create table on this lake path by preserving the NOT NULL Constraint also.

2 answers

Your answer

Answer 1

Amira Bedhiafi 33,631 Volunteer Moderator

You can enforce NOT NULL constraints when reading the data back into Spark by explicitly specifying the schema with the nullable property set as required. This won't prevent nulls at the storage level but will cause Spark to throw errors if null values are encountered in columns marked as non-nullable.

// Define the schema with NOT NULL constraints where applicable
val enforcedSchema = StructType(List(
  StructField("CurrencyCode", StringType, nullable = false),
  StructField("CurrencyName", StringType, nullable = true),
  StructField("CurrencyPrecision", IntegerType, nullable = true),
  StructField("FractionalPart", IntegerType, nullable = false)
))
// Read the data with the enforced schema
val dfEnforced = spark.read.schema(enforcedSchema).format("parquet").load("path_to_your_data_lake")

Before saving your data, perform data validation to ensure that there are no null values in columns that should be non-nullable. This can be done using DataFrame operations to filter out or fix records that do not meet the schema constraints. Also , you can use the external tables :

CREATE EXTERNAL TABLE your_table_name (
  CurrencyCode STRING NOT NULL,
  CurrencyName STRING,
  CurrencyPrecision INT,
  FractionalPart INT NOT NULL
)
WITH (
  LOCATION = 'path/to/your/data/in/adls',
  DATA_SOURCE = your_external_data_source,
  FILE_FORMAT = your_file_format
);

AJITH KUMAR RAI 0 Reputation points

2024-02-14T06:08:00.09+00:00
Thank you for response.

Here it is a Lake database not a external table.
How to achieve Not Null Constraint with Lake database as mentioned below
It is falling because of NOT NULL even though data present.
always expecting NULL Fields.

Could you please provide information on below one.

%%sql CREATE TABLE testSchema.samplecurrency ( CurrencyCode string, CurrencyName string, CurrencyPrecision int NOT NULL, FractionalPart int NOT NULL ) USING DELTA LOCATION 'path'
AJITH KUMAR RAI 0 Reputation points

2024-02-14T06:10:47.5966667+00:00
Thank you for response.
Here we are trying to create Lake data base table not external table. Like below but it is not working as expected always data is NULL from lake path fields.

CREATE TABLE currencySchema.currencytbl ( CurrencyCode string, CurrencyName string, CurrencyPrecision int NOT NULL, FractionalPart int NOT NULL ) USING DELTA LOCATION '
AJITH KUMAR RAI 0 Reputation points

2024-02-15T05:17:09.0533333+00:00

Hi @Amira Bedhiafi Thank you for response. Here it is a Lake database not a external table.
How to achieve Not Null Constraint with Lake database as mentioned below
It is falling because of NOT NULL even though data present.
always expecting NULL Fields. Could you please provide information on below one. %%sql CREATE TABLE testSchema.samplecurrency ( CurrencyCode string, CurrencyName string, CurrencyPrecision int NOT NULL, FractionalPart int NOT NULL ) USING DELTA LOCATION 'path'

Answer 2

Smaran Thoomu 24,260 Microsoft External Staff Moderator

AJITH KUMAR RAI To create a table on the Azure Data Lake database with NOT NULL constraint, you can use the following SQL statement:

CREATE TABLE currencySchema.currencytbl (
    CurrencyCode string,
    CurrencyName string,
    CurrencyPrecision int NOT NULL,
    FractionalPart int NOT NULL
)
USING DELTA
LOCATION '<your-lake-path>'

This statement creates a table named currencytbl in the currencySchema schema with the CurrencyPrecision and FractionalPart columns set to NOT NULL. The table is created using the Delta format and is located at the specified lake path.

If you are still facing issues with the NOT NULL constraint even though data is present, you can try to validate the data in the DataFrame before writing it to the lake. You can use the isNull method to check if a column contains null values and take appropriate actions to handle them. Here's an example code snippet that demonstrates how to validate the data in the DataFrame before writing it to the lake:

// Validate the data in the DataFrame before writing it to the lake
if (df.filter(df("CurrencyPrecision").isNull || df("FractionalPart").isNull).count() > 0) {
  // Handle null values in the DataFrame
  // ...
} else {
  // Write the DataFrame to the Azure Data Lake
  df.write.format("parquet").mode("overwrite").save("<your-lake-path>")
}

This code snippet checks if the CurrencyPrecision and FractionalPart columns contain null values in the DataFrame. If null values are present, appropriate actions can be taken to handle them. If no null values are present, the DataFrame is written to the Azure Data Lake.

I hope this helps. Let me know if you have any further questions or concerns.

AJITH KUMAR RAI 0 Reputation points

2024-02-20T06:13:20.0266667+00:00
Hi @Smaran Thoomu
Thank you for response.
Yes with the help of isNull we can verify data before processing Just for information: For table creating scenario we are facing below issue and also please find sample scala note book. CREATE TABLE currencySchema.currencytbl ( CurrencyCode string, CurrencyName string, CurrencyPrecision int NOT NULL, FractionalPart int NOT NULL ) USING DELTA LOCATION '<your-lake-path>'

== Specified ==

root -- CurrencyCode: string (nullable = true) -- CurrencyName: string (nullable = true) -- CurrencyPrecision: integer (nullable = false) -- FractionalPart: integer (nullable = false)

== Existing == root -- CurrencyCode: string (nullable = true) -- CurrencyName: string (nullable = true) -- CurrencyPrecision: integer (nullable = true) -- FractionalPart: integer (nullable = true)

== Differences==

Field CurrencyPrecision is non-nullable in specified schema but nullable in existing schema.

Field FractionalPart is non-nullable in specified schema but nullable in existing schema. If your intention is to keep the existing schema, you can omit the schema from the create table command. Otherwise please ensure that the schema matches.

import org.apache.spark.sql.Row import org.apache.spark.sql.DataFrame import scala.util.Try import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, TimestampType} val currencySchema = StructType(List( StructField("CurrencyCode", StringType, nullable = false), StructField("CurrencyName", StringType, nullable = true), StructField("CurrencyPrecision", IntegerType, nullable = true), StructField("FractionalPart", IntegerType, nullable = false) )) // Sample data val data = Seq( Row("USD", "US Dollar", null, 2), Row("EUR", "Euro", 3, 2), Row("GBP", null, 2, 2), Row("JPY", "Japanese Yen", 0, 2)//, // Row(null, null, null, null) // Row with null values ) // Create DataFrame val df = spark.createDataFrame(spark.sparkContext.parallelize(data), currencySchema) df.printSchema() ////val isBuildingIdNullDF = existingDataFrame.where(col("CurrencyCode").isNull) df.write.format("delta") .mode("overwrite") .save("<path>") var newdf = spark.read .format("delta") .load("<path>") newdf.printSchema()
Smaran Thoomu 24,260 Reputation points Microsoft External Staff Moderator

2024-02-20T19:04:25.88+00:00
Hi AJITH KUMAR RAI,

I understand that you're facing an issue while creating a table with a non-nullable schema in Delta Lake. The error message indicates that the specified schema is different from the existing schema, and the "CurrencyPrecision" and "FractionalPart" columns are non-nullable in the specified schema but nullable in the existing schema.

To resolve this issue, you can either omit the schema from the create table command to keep the existing schema or ensure that the schema matches the specified schema.

Here's an example of how to create a table with a non-nullable schema in Delta Lake:

CREATE TABLE currencytbl ( CurrencyCode STRING, CurrencyName STRING, CurrencyPrecision INT NOT NULL, FractionalPart INT NOT NULL ) USING DELTA LOCATION '<your-lake-path>'

Please note that the schema of the table should match the schema of the data frame that you're trying to save to the table.

I hope this helps! Let me know if you have any further questions or concerns.

Share via

Data lake not preserving Not NULL Constraint on Save and Load

2 answers

Your answer