Safe execution of Eval function in spark notebook inside synapse pipeline

Heta Desai 357 Reputation points
2022-05-27T18:09:18.49+00:00

I need to create a notebook which will read a csv file from ADLS. The schema to read the csv file is parameterized from Azure SQL database as a string. I am using the Eval() function to convert this schema into the StructType. The notebook is executing using the synapse pipeline.

Here is the example :

Below is the string which I will get from database into pyspark notebook using synapse pipeline:

Schema = " StructType().add("ID",IntegerType(),True).add("Name",StringType(),True).add("Address",StringType(),True).add("StartTimestamp",TimestampType(),True) "

Using below command converting a string into StructType:

SchemaStruct = eval(Schema) 

To read csv file this converted StructType will be used:

df_read = spark.read.format("csv") \
      .option("header", True) \
      .schema(SchemaStruct) \
      .load(InputFilePath)

As I am passing the StructType as string to Eval() from Azure database, for the safe execution of Eval() I want to limit it to the specified types only. Which means the Eval() should convert the string into StructType only when IntegerType() and StringType() is used in it.

Is there any other way to secure the eval() execution specific for StructType ?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
{count} votes

Answer accepted by question author
  1. MartinJaffer-MSFT 26,161 Reputation points
    2022-05-31T15:44:25.393+00:00

    I can give it a try @Anonymous . You are using pyspark, so I'll reference the Python regex documentation.

    Original:  
    StructType().add("ID",IntegerType(),True).add("Name",StringType(),True).add("Address",StringType(),True).add("StartTimestamp",TimestampType(),True)  
      
    Broken into segments for working sub-problems:  
      
    StructType()  
    .add("ID",IntegerType(),True)  
    .add("Name",StringType(),True)  
    .add("Address",StringType(),True)  
    .add("StartTimestamp",TimestampType(),True)  
    

    So, looking at your example, it starts with StructType() before going into anything else. ( ) are metacharacters so we need to escape them. ^ means "This is beginning of string".

    ^StructType\(\)  
    

    Your example ends with a timestamp. You mentioned only accepting new integer and strings, so I'm guessing you want the timestamp hardcoded in always. We need to escape it too. $ means "this is end of entire string"

    \.add\("StartTimestamp","TimestampType\(\),True\)$  
    

    I see in the middle is where we have the parts that may vary. They share a similar structure, differing in the field name and data type. Since you did not give me further insight into your requirements, I am going to assume any of these may change, and the order may change. Also the number of them may change.

    The sub-expression for both of these looks like

    \.add\("\w+",(IntegerType)|(StringType)\(\),True\)  
    

    The \w means any alphanumberic character, and + means one or more, so valid examples are a, a1, 1a, A, A1, 1A, 1, aA, Aa, AA, aa 11. This should exclude special characters, so while someone could do ArrayType as a name they cannot do ArrayType().
    The unescaped parentheses ( ) are a grouping, and pipe | means exclusive or. So it will accept IntegerType or StringType but not both and not ArrayType.

    Now we want to allow this middle pattern to repeat. So we need to group it together and specify how many times it should repeat. If we use asterisk * it means 0 or more times. + means one or more times. There are other mechanisms if you have a certain number range you want to enforce. This would make it (I think).

    (\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+  
    

    We should test each component before putting it all together. Together it would look like:

    ^StructType\(\)(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+\.add\("StartTimestamp","TimestampType\(\),True\)$  
    
    
    import re  
    pattern = re.compile(r"^StructType\(\)(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+\.add\("StartTimestamp","TimestampType\(\),True\)$")  
    if None == pattern.match(myThingToTest):  
        Not A match  
    else:  
      A match!  
    

    I have not tested this yet. I wanted to get this out before I lose my work.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,161 Reputation points
    2022-06-01T19:01:28.08+00:00

    I've got it now @ HetaDesai-5854 .

    X = r'^\s*StructType\(\)(\.add\("\w+",((IntegerType)|(StringType))\(\),True\))+\.add\("StartTimestamp",TimestampType\(\),True\)\s*$'  
      
    Y = r'^\s*StructType\(\)(\.add\("\w+",(IntegerType|StringType|TimestampType)\(\),True\))+\s*$'  
    

    Y should be more flexible. You can add more types if you want. The order of the type does not matter. The quantity of Type()'s should be at least 1. If you want to allow adding 0 types, then use Z

    Z = r'^\s*StructType\(\)(\.add\("\w+",(IntegerType|StringType|TimestampType)\(\),True\))*\s*$'  
    

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.