I checked the file you provided and I tried to read the entire row as a single column.
This means you will not split the data into multiple columns based on commas.
If using CRLF \r\n as the row delimiter is not in the cards, I would go for splitting the rows into columns manually by applying a split function based on the comma.
While splitting, you need to be careful with the quotes and commas as they can be part of the data as well.
You will have a challenge with multi-line fields, so you need to create a rule or condition to identify such cases.
One way to identify them is by counting the number of quotes. An uneven number of quotes can be an indicator of a multi-line field.
If a multi-line field is identified, you can merge it with the next row. This step might be iterative until all fields are properly aligned.
Now, once you have a clean and structured row, you might want to clean up, like removing extra quotes or any additional special characters that might have been introduced during the process.
Here is my logic :
- in your source settings, set the row delimiter as CRLF
- in a derived column, you can use expressions where you can apply the logic to merge rows and split them into columns.
For example, if you read each row as a single string, you might do something like this:
split(column1, ',')[0] // To get the value of "site_name"
split(column1, ',')[1] // To get the value of "container_uuid"
Here is an example :
if(equals(length(column1) - length(replace(column1, '"', '')), 1), 'start',
if(equals(length(column1) - length(replace(column1, '"', '')), 1), 'end', 'complete'))
- in a onditional Split you can route data rows to different outputs based on 3 conditions :
- If a row is likely starting a multi-line field ( it has an uneven number of quotes), you might flag it as 'start'.
- If a row is likely ending a multi-line field, you might flag it as 'end'.
- Rows that are complete and don’t require further processing can be directly sent to the output.