data validation - conditional split

Question

data validation - conditional split

arkiboys 9,711

Hello,
In dataflow, I am reading data and trying to do some datachecks which I find difficult to do.
I have started with conditional split (I guess this is one method of achieving my goal) but finding it difficult to apply the rules (See below) to the dataset.
I attached sample file here.

Can you help please?

Thank you

Question:
How is it possible to have a conditional split for the following datacheck rules?

Note: Data as you see in attached file is grouped by V_Number and m_name

datacheck rules:
1-
If
[actual] date of [t DateName] rows is not null
and [actual] date of [l DateName] rows is not null
and [actual] date of [v DateName] rows is not null
and weight is not null
then re-direct all the rows for V_Number to valid.
2-
If
[actual] date of [t DateName] rows is null
and [estimated] date of [t DateName] rows is null
then re-direct all rows for V_Number to in-valid.

3-
If
[actual] date of [t DateName] rows is null
and [estimated] date of [t DateName] falls within the Previous month
then re-direct all rows for V_Number in-valid.

4-
if
each group of v_Number has more than one party
then re-direct to in-valid

5-
if
each group of V_Number has more than one [actual] date for [t DateName] rows
then re-direct to in-valid.

MartinJaffer-MSFT 26,161 Reputation points

2022-05-09T19:55:55.34+00:00

Rule 1, as you stated sends to Valid. Does this mean the remaining rules are ignored if rule 1 passes? @arkiboys

Answer accepted by question author

MartinJaffer-MSFT 26,161

source(output(\n\t\tV_Number as string,\n\t\tm_name as string,\n\t\tparty as string,\n\t\tm_shipment as string,\n\t\tName as string,\n\t\testimated as string,\n\t\tactual as string,\n\t\tml_name as string,\n\t\tDoc_Name as string,\n\t\tweight as double\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tignoreNoFilesFound: false,\n\tformat: 'delimited',\n\tfileSystem: 'martin',\n\tfileName: 'question.txt',\n\tcolumnDelimiter: ',',\n\tescapeChar: '\\',\n\tquoteChar: '\"',\n\tnullValue: 'null',\n\tcolumnNamesAsHeader: true) ~> source1\njoin1 split(UnionRules,\n\tdisjoint: false) ~> split1@(Invalid, Valid)\nsource1 aggregate(groupBy(V_Number),\n\tweightIsNull = 0 < (count(isNull(actual)) + count(isNull(weight)) ),\n\t\tmultipleParty = 1 < countAllDistinct(party),\n\t\tcolumn1 = countIf(in(["t DateName","l DateName"], Name), Doc_Name),\n\t\tCountDistinctIfDate = 1 < countDistinct(iif(like(Name,"t DateName"),actual))) ~> DoRules\nsource1, CleanRules join(source1@V_Number == CleanRules@V_Number,\n\tjoinType:'inner',\n\tbroadcast: 'auto')~> join1\nDoRules derive(UnionRules = weightIsNull || multipleParty || CountDistinctIfDate) ~> HasError\nHasError select(mapColumn(\n\t\tV_Number,\n\t\tUnionRules\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> CleanRules

arkiboys 9,711 Reputation points

2022-05-10T07:26:59.213+00:00

thank you.
This helped a-lot

1 additional answer

Your answer

MartinJaffer-MSFT 26,161 Reputation points

2022-05-09T19:55:55.34+00:00

Rule 1, as you stated sends to Valid. Does this mean the remaining rules are ignored if rule 1 passes? @arkiboys
arkiboys 9,711 Reputation points

2022-05-10T07:26:59.213+00:00

thank you.
This helped a-lot

Answer 1

Hi again @arkiboys . Looks like you brought us a fun one this time.
I think I see where the difficulty is. The Conditional Split works as if each row stands on its own, but the conditions you bring are more aggregate-like.

I'm not sure whether everything can be done at once, but it is often better to break up into steps anyway. I'm going to try to refactor some of this logic, tell me if this checks for you. "any" refers to any row in a set of [v_number].

rule 1 covers [actual] for all 3 values of [Name] so we can simplify. None of the rules depend on [m_Name], so we can group by [V_number] only.
in rule 2, the check on [actual] has been handled in rule 1, so it just becomes a check on [estimated] being null? Or is this paring important? Please double check my logic. Are any nulls allowed?

Invalid when any [actual] is null
Invalid when any [weight] is null
Invalid when any(?) estimated is null
Invalid when 1 < count distinct [party]
Invalid when 1 < count distinct [actual] where [name] = t_DateName

Anyway, the solution I am trying is an aggregation, Grouping by only [V_Number] becaues you want to move entire sets of [V_Number]. The aggregation will be the rules. After that Conditional Split based upon join of the aggregates to the original data.
In case I am wrong in my logic on the rules, I'll try to make countIf separate the values of [name] during aggregation.

MartinJaffer-MSFT 26,161 Reputation points

2022-05-09T19:54:01.09+00:00

Code so far:

Rule 1: True when any row has null [weight] OR any row has null [actual]

0 < (count(isNull(actual)) + count(isNull(weight)) )

Rule 4: True when there is more than one value for [party]. Nulls are included as separate value. To ignore nulls in [party], replace CountAllDistinct with CountDistinct

1 < countAllDistinct(party)

Rule 5: Counts Distinct [actual] when [name] like 't DateName' . Needs test with nulls.

countDistinct(iif(like(Name,"t DateName"),actual))

Share via

data validation - conditional split

1 additional answer

Your answer