Generate Unique Identifiers (UID) in U-SQL on Azure Data Lake Analytics with Python extension scripts
U-SQL doesn't support constructs to generate Unique Identifier in Text Files. The script below generates unique identifier for every row in the input file.
The steps are
- Extract the data file with the EXTRACT statement
- REDUCERS are spun based on the customer code. Too little reducers or too many reducers may both cause performance issues. Identify a column that can fairly split, but make sure not to specify a unique column.
- For every reduced data set, the python script is invoked with the DATA FRAME. Add another column to the data frame "sguid" and generate a new encoded UID.
- The output produced out of the reducer will have a new column sguid
REFERENCE ASSEMBLY [ExtPython];
DECLARE @ReduceScript = @" import uuid import base64
def usqlml_main(df): df['sguid'] = '' df['sguid'] = df.sguid.apply(lambda row: str(base64.urlsafe_b64encode(uuid.uuid1().bytes))) return df ";
@AllData = EXTRACT OrderNo string, Date string, CustomerCode string, ProductCode string, SalesArea string, OrderValue string FROM "/DataLoads/Input/TempFile.csv" USING Extractors.Text(delimiter: ',', skipFirstNRows: 1);
@ReducedData = REDUCE @AllData ON CustomerCode PRODUCE sguid string, OrderNo string, Date string, CustomerCode string, ProductCode string, SalesArea string, OrderValue string USING new Extension.Python.Reducer(pyScript:@ReduceScript);
OUTPUT @ReducedData TO "/DataLoads/CSVOutputwithGUID.txt" USING Outputters.Text(); |
Note : Follow these instructions to enable U-SQL extensions on your ADL-A account