Safe execution of Eval function in spark notebook inside synapse pipeline

Question

Safe execution of Eval function in spark notebook inside synapse pipeline

Heta Desai 357

I need to create a notebook which will read a csv file from ADLS. The schema to read the csv file is parameterized from Azure SQL database as a string. I am using the Eval() function to convert this schema into the StructType. The notebook is executing using the synapse pipeline.

Here is the example :

Below is the string which I will get from database into pyspark notebook using synapse pipeline:

Schema = " StructType().add("ID",IntegerType(),True).add("Name",StringType(),True).add("Address",StringType(),True).add("StartTimestamp",TimestampType(),True) "

Using below command converting a string into StructType:

SchemaStruct = eval(Schema)

To read csv file this converted StructType will be used:

df_read = spark.read.format("csv") \
      .option("header", True) \
      .schema(SchemaStruct) \
      .load(InputFilePath)

As I am passing the StructType as string to Eval() from Azure database, for the safe execution of Eval() I want to limit it to the specified types only. Which means the Eval() should convert the string into StructType only when IntegerType() and StringType() is used in it.

Is there any other way to secure the eval() execution specific for StructType ?

MartinJaffer-MSFT 26,161 Reputation points

2022-05-30T21:31:08.423+00:00

Hello and welcome to Microsoft Q&A @Anonymous .

So if I understand right, you want to limit whether or not that string is used by eval, based upon the contents of the string.

I think we can do that by using regular expressions (regex).

The requirements you have provided, are only positive reqs. What MUST be inside.
Just as important, if not more so for security, are what MUST NOT be inside. Someone could include the things you are looking for, but also add in other stuff.

Depending upon the rigidity of your expected structure, we could have a single expression to match, with a blank spot for variations. On the other side, we might need multiple checks for things to avoid.

Since you are using eval(), I expect you are comfortable with coding. Am I wrong?
Heta Desai 357 Reputation points

2022-05-31T07:10:37.173+00:00

Yes I am comfortable with coding. Can you provide me a sample code to implement this ?

Answer accepted by question author

1 additional answer

Your answer

MartinJaffer-MSFT 26,161 Reputation points

2022-05-30T21:31:08.423+00:00

Hello and welcome to Microsoft Q&A @Anonymous .

So if I understand right, you want to limit whether or not that string is used by eval, based upon the contents of the string.

I think we can do that by using regular expressions (regex).

The requirements you have provided, are only positive reqs. What MUST be inside.
Just as important, if not more so for security, are what MUST NOT be inside. Someone could include the things you are looking for, but also add in other stuff.

Depending upon the rigidity of your expected structure, we could have a single expression to match, with a blank spot for variations. On the other side, we might need multiple checks for things to avoid.

Since you are using eval(), I expect you are comfortable with coding. Am I wrong?
Heta Desai 357 Reputation points

2022-05-31T07:10:37.173+00:00

Yes I am comfortable with coding. Can you provide me a sample code to implement this ?

Answer 1

I can give it a try @Anonymous . You are using pyspark, so I'll reference the Python regex documentation.

Original:  
StructType().add("ID",IntegerType(),True).add("Name",StringType(),True).add("Address",StringType(),True).add("StartTimestamp",TimestampType(),True)  
  
Broken into segments for working sub-problems:  
  
StructType()  
.add("ID",IntegerType(),True)  
.add("Name",StringType(),True)  
.add("Address",StringType(),True)  
.add("StartTimestamp",TimestampType(),True)

So, looking at your example, it starts with StructType() before going into anything else. ( ) are metacharacters so we need to escape them. ^ means "This is beginning of string".

^StructType\(\)

Your example ends with a timestamp. You mentioned only accepting new integer and strings, so I'm guessing you want the timestamp hardcoded in always. We need to escape it too. $ means "this is end of entire string"

\.add\("StartTimestamp","TimestampType\(\),True\)$

I see in the middle is where we have the parts that may vary. They share a similar structure, differing in the field name and data type. Since you did not give me further insight into your requirements, I am going to assume any of these may change, and the order may change. Also the number of them may change.

The sub-expression for both of these looks like

\.add\("\w+",(IntegerType)|(StringType)\(\),True\)

The \w means any alphanumberic character, and + means one or more, so valid examples are a, a1, 1a, A, A1, 1A, 1, aA, Aa, AA, aa 11. This should exclude special characters, so while someone could do ArrayType as a name they cannot do ArrayType().
The unescaped parentheses ( ) are a grouping, and pipe | means exclusive or. So it will accept IntegerType or StringType but not both and not ArrayType.

Now we want to allow this middle pattern to repeat. So we need to group it together and specify how many times it should repeat. If we use asterisk * it means 0 or more times. + means one or more times. There are other mechanisms if you have a certain number range you want to enforce. This would make it (I think).

(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+

We should test each component before putting it all together. Together it would look like:

^StructType\(\)(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+\.add\("StartTimestamp","TimestampType\(\),True\)$  


import re  
pattern = re.compile(r"^StructType\(\)(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+\.add\("StartTimestamp","TimestampType\(\),True\)$")  
if None == pattern.match(myThingToTest):  
    Not A match  
else:  
  A match!

I have not tested this yet. I wanted to get this out before I lose my work.

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

MartinJaffer-MSFT 26,161 Reputation points

2022-05-31T16:09:16.58+00:00

Testing, and have a few minor corrections. I need to replace the outermost double quotes " with single quotes ' so it would be pattern = re.compile(r' .... ')

Also your original example has some whitespace I didn't account for. The regex whitespace class is \s. The adjustment would make it:

^\s*StructType(\.add$"\w+",(IntegerType)|(StringType)\($,True\))+\.add$"StartTimestamp","TimestampType\($,True\)\s*$

Hmm, that didn't work all as one, so testing the components.

>>> start = re.compile(r'^\s*StructType') >>> None == start.match(Schema) False

Start is good.

Found a stray " in the ending.
.add("StartTimestamp",TimestampType(),True)\s*$

but something is still wrong.

>>> end = re.compile(r'\.add$"StartTimestamp",TimestampType\($,True\)\s*$') >>> None == end.match(Schema) True
Heta Desai 357 Reputation points

2022-06-01T05:43:40.947+00:00

My requirement is StructType() should work with datatype used in given string. I need to consider Timestamp as well.

Answer 2

I've got it now @ HetaDesai-5854 .

X = r'^\s*StructType\(\)(\.add\("\w+",((IntegerType)|(StringType))\(\),True\))+\.add\("StartTimestamp",TimestampType\(\),True\)\s*$'  
  
Y = r'^\s*StructType\(\)(\.add\("\w+",(IntegerType|StringType|TimestampType)\(\),True\))+\s*$'

Y should be more flexible. You can add more types if you want. The order of the type does not matter. The quantity of Type()'s should be at least 1. If you want to allow adding 0 types, then use Z

Z = r'^\s*StructType\(\)(\.add\("\w+",(IntegerType|StringType|TimestampType)\(\),True\))*\s*$'

Hope this will help. Please let us know if any further queries.

------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Share via

Safe execution of Eval function in spark notebook inside synapse pipeline

1 additional answer

Your answer