question

Naga-4366 avatar image
0 Votes"
Naga-4366 asked Naga-4366 commented

Data masking in Azure databricks

Hi Team,

Looking for some leads on Step by Step by process to implement Data Masking in Azure data bricks.

Source would be like any table (SQL Server) or ADLs files (.CSV or .txt) implement masking in Azure Data Bricks and store the masking data in Azure Data Lake Storage (ADLs)

Thanks in advance!!!

Regards,
NagaSri

azure-databricksazure-data-lake-storage
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

Shalvin avatar image
0 Votes"
Shalvin answered Naga-4366 commented

Hello @Naga-4366

A simplistic approach for masking data while reading from a set of CSV files from a storage is to

  1. Create a masking function (python / scala)

  2. Register the function as a spark UDF

  3. Use spark.read or spark.readStream with selectExpr containing the UDF to load data to a Data Frame

  4. Save the data to a table

Below sample code could help you to read all CSVs a storage account path to a spark database table.

 from pyspark.sql import SparkSession
 from pyspark.sql.functions import udf
 import hashlib
    
 class Mask:
     def __init__(self, salt: str):
         self.salt = salt
        
     def sha512(self, value):
         return hashlib.sha512(f'{value}{self.salt}'.encode()).hexdigest()
    
     def shake_128(self, value):
         return hashlib.shake_128(f'{value}{self.salt}'.encode()).hexdigest(32)
    
     def register(self, spark: SparkSession):
         spark.udf.register('sha512', self.sha512)
         spark.udf.register('shake128', self.shake_128)

Create the Spark Session, set config to read from storage, and register UDFs.

 spark = SparkSession.builder.getOrCreate()
    
 spark.conf.set(f'fs.azure.account.key.{<my_storage>}.blob.core.windows.net', '<my_storage_key>')
 path = f'wasbs://{<my_container>}@{<my_storage>}.blob.core.windows.net/*.csv'
    
 m= Mask('123456789')
 m.register(spark)

Now, use following code to read source files and save it to a database table

 spark.read \
     .format('csv') \
     .option('inferSchema', True) \
     .option('header', True) \
     .load(path) \
     .selectExpr(['user_name', 'shake128(password)']) \
     .write \
     .mode('append') \
     .saveAsTable('my_table')


To run the above code and see it working,

  1. Go to the storage account <my_storage>

  2. Open container <my_container>

  3. Upload some csv files to the folder with columns user_name and password with some values

  4. Copy above code (all code block could be in same cell or different) to a databricks python notebook

  5. Put the correct storage account name, container name and AccountKey in the above place holders <>

  6. Run all cells in the order

  7. In a new cell, run the following code

  8. %sql SELECT * FROM my_table

  9. You should see data displayed



Thanks,
Shalvin





















· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@Naga-4366 Please let me know if you find above reply useful. If yes, do click on 'Mark as answer' link in above reply. This will help other community members facing similar query to refer to this solution. Thanks.

0 Votes 0 ·

Hi Shalvin/Sharma - Able to perform but getting below error and unable to find any leads on the same.

Can't extract value from shake128(Dob#2505): need struct type but got string.

Also, added new function as below even for this function, getting the same error.

def mask_func(self, value):
charList = list(value)
charList[4:12] = 'x'*8
return "".join(charList)

Regards,
NagaSri

0 Votes 0 ·