I can give it a try @Anonymous . You are using pyspark, so I'll reference the Python regex documentation.
Original:
StructType().add("ID",IntegerType(),True).add("Name",StringType(),True).add("Address",StringType(),True).add("StartTimestamp",TimestampType(),True)
Broken into segments for working sub-problems:
StructType()
.add("ID",IntegerType(),True)
.add("Name",StringType(),True)
.add("Address",StringType(),True)
.add("StartTimestamp",TimestampType(),True)
So, looking at your example, it starts with StructType() before going into anything else. ( ) are metacharacters so we need to escape them. ^ means "This is beginning of string".
^StructType\(\)
Your example ends with a timestamp. You mentioned only accepting new integer and strings, so I'm guessing you want the timestamp hardcoded in always. We need to escape it too. $ means "this is end of entire string"
\.add\("StartTimestamp","TimestampType\(\),True\)$
I see in the middle is where we have the parts that may vary. They share a similar structure, differing in the field name and data type. Since you did not give me further insight into your requirements, I am going to assume any of these may change, and the order may change. Also the number of them may change.
The sub-expression for both of these looks like
\.add\("\w+",(IntegerType)|(StringType)\(\),True\)
The \w means any alphanumberic character, and + means one or more, so valid examples are a, a1, 1a, A, A1, 1A, 1, aA, Aa, AA, aa 11. This should exclude special characters, so while someone could do ArrayType as a name they cannot do ArrayType().
The unescaped parentheses ( ) are a grouping, and pipe | means exclusive or. So it will accept IntegerType or StringType but not both and not ArrayType.
Now we want to allow this middle pattern to repeat. So we need to group it together and specify how many times it should repeat. If we use asterisk * it means 0 or more times. + means one or more times. There are other mechanisms if you have a certain number range you want to enforce. This would make it (I think).
(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+
We should test each component before putting it all together. Together it would look like:
^StructType\(\)(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+\.add\("StartTimestamp","TimestampType\(\),True\)$
import re
pattern = re.compile(r"^StructType\(\)(\.add\("\w+",(IntegerType)|(StringType)\(\),True\))+\.add\("StartTimestamp","TimestampType\(\),True\)$")
if None == pattern.match(myThingToTest):
Not A match
else:
A match!
I have not tested this yet. I wanted to get this out before I lose my work.
Hope this will help. Please let us know if any further queries.
------------------------------
- Please don't forget to click on
or upvote
button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how - Want a reminder to come back and check responses? Here is how to subscribe to a notification
- If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators