preserving relationships between Multiple Columns with delimited text using Azure pipeline

Vamsi Madhav 46 Reputation points
2022-10-13T20:19:58.843+00:00

Hi i could do with some help on a requirement I have that I need to achieve using synapse pipeline dataflow.

My data looks like this:
user_id, certification_id, certification_title, expiry_date
1000, "201, 202, 203", "Azure Certification1, Azure Certification2, Azure Certification3", "2099-12-31, 2022-12-12, 2023-05-05"

expected output is
user_id, certification_id, certification_title, expiry_date
1000, 201, Azure Certification1, 2099-12-31
1000, 202, Azure Certification2, 2022-12-12
1000, 203, Azure Certification3, 2023-05-05

this is how i tried it

my source data is from a rest api end point
certification_id stream 1: split function in dervied columns >> flatten >> denseRank function in window
new branch - certification_title stream 2: split function in dervied columns >> flatten >> denseRank function in window
new branch - expirry_date stream3: split function in dervied columns >> flatten >> denseRank function in window

then i am joining the data back again using a join with user_Id and rank as key columns.

the data works in most cases but because i need to sort the data to use the window option - i am losing the consistency in the output data format.
also i am not sure if its a very performant way of doing this.

any advice would be greatly appreciated.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,374 questions
{count} votes

1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,236 Reputation points
    2022-10-17T17:59:09.427+00:00

    Okay, I got it now @Anonymous !

    Also, I found the blocker earlier; the space after the comma makes a difference! So putting space after the comma in delimiter took care of that.

    Below I past the dataflow script. It works pretty much like I originally wanted.

    source(output(  
    		user_id as short,  
    		certification_id as string,  
    		certification_title as string,  
    		expiry_date as string  
    	),  
    	useSchema: false,  
    	allowSchemaDrift: true,  
    	validateSchema: false,  
    	ignoreNoFilesFound: false,  
    	format: 'delimited',  
    	container: 'input',  
    	fileName: 'vasmi.csv',  
    	columnDelimiter: ', ',  
    	escapeChar: '',  
    	quoteChar: '\"',  
    	columnNamesAsHeader: true) ~> source1  
    source1 derive(certification_id = split(trim(certification_id,'"'),", "),  
    		certification_title = split(trim(certification_title,'"'),", "),  
    		expiry_date = split(trim(expiry_date,'"'),", ")) ~> MakeIntoArrays  
    MakeIntoArrays derive(compound = unfold(mapLoop(  
        size(certification_id),  
        array(certification_id[#index],certification_title[#index],expiry_date[#index])  
    ))) ~> CompoundAndUnfold  
    CompoundAndUnfold select(mapColumn(  
    		user_id,  
    		compound  
    	),  
    	skipDuplicateMapInputs: true,  
    	skipDuplicateMapOutputs: true) ~> CleanColumns  
    CleanColumns derive(cert_id = compound[1],  
    		cert_title = compound[2],  
    		expiry_date = compound[3]) ~> SeparateIntoProperties  
    SeparateIntoProperties sink(allowSchemaDrift: true,  
    	validateSchema: false,  
    	format: 'delimited',  
    	fileSystem: 'data',  
    	folderPath: 'out',  
    	columnDelimiter: ',',  
    	escapeChar: '\\',  
    	quoteChar: '\"',  
    	columnNamesAsHeader: true,  
    	umask: 0022,  
    	preCommands: [],  
    	postCommands: [],  
    	skipDuplicateMapInputs: true,  
    	skipDuplicateMapOutputs: true,  
    	mapColumn(  
    		user_id,  
    		cert_id,  
    		cert_title,  
    		expiry_date  
    	)) ~> sink1  
    

    251175-image.png

    unfold(mapLoop(  
        size(certification_id),  
        array(certification_id[#index],certification_title[#index],expiry_date[#index])  
    ))  
    

    251189-image.png

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.