Data masking in Azure databricks

Naga 71 Reputation points
2021-10-29T14:26:37.943+00:00

Hi Team,

Looking for some leads on Step by Step by process to implement Data Masking in Azure data bricks.

Source would be like any table (SQL Server) or ADLs files (.CSV or .txt) implement masking in Azure Data Bricks and store the masking data in Azure Data Lake Storage (ADLs)

Thanks in advance!!!

Regards,
NagaSri

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,540 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,325 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Shalvin 161 Reputation points
    2021-10-30T23:07:42.447+00:00

    Hello @Naga

    A simplistic approach for masking data while reading from a set of CSV files from a storage is to

    1. Create a masking function (python / scala)
    2. Register the function as a spark UDF
    3. Use spark.read or spark.readStream with selectExpr containing the UDF to load data to a Data Frame
    4. Save the data to a table

    Below sample code could help you to read all CSVs a storage account path to a spark database table.

    from pyspark.sql import SparkSession  
    from pyspark.sql.functions import udf  
    import hashlib  
      
    class Mask:  
        def __init__(self, salt: str):  
            self.salt = salt  
          
        def sha512(self, value):  
            return hashlib.sha512(f'{value}{self.salt}'.encode()).hexdigest()  
      
        def shake_128(self, value):  
            return hashlib.shake_128(f'{value}{self.salt}'.encode()).hexdigest(32)  
      
        def register(self, spark: SparkSession):  
            spark.udf.register('sha512', self.sha512)  
            spark.udf.register('shake128', self.shake_128)  
       
    

    Create the Spark Session, set config to read from storage, and register UDFs.

    spark = SparkSession.builder.getOrCreate()  
      
    spark.conf.set(f'fs.azure.account.key.{<my_storage>}.blob.core.windows.net', '<my_storage_key>')  
    path = f'wasbs://{<my_container>}@{<my_storage>}.blob.core.windows.net/*.csv'  
      
    m= Mask('123456789')  
    m.register(spark)  
    

    Now, use following code to read source files and save it to a database table

    spark.read \  
    	.format('csv') \  
    	.option('inferSchema', True) \  
    	.option('header', True) \  
    	.load(path) \  
    	.selectExpr(['user_name', 'shake128(password)']) \  
    	.write \  
    	.mode('append') \  
    	.saveAsTable('my_table')  
    

    To run the above code and see it working,

    1. Go to the storage account <my_storage>
    2. Open container <my_container>
    3. Upload some csv files to the folder with columns user_name and password with some values
    4. Copy above code (all code block could be in same cell or different) to a databricks python notebook
    5. Put the correct storage account name, container name and AccountKey in the above place holders <>
    6. Run all cells in the order
    7. In a new cell, run the following code
    8. %sql SELECT * FROM my_table
    9. You should see data displayed

    Thanks,
    Shalvin


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.