Share via

data validation - conditional split

arkiboys 9,711 Reputation points
2022-05-07T00:53:12.773+00:00

Hello,
In dataflow, I am reading data and trying to do some datachecks which I find difficult to do.
I have started with conditional split (I guess this is one method of achieving my goal) but finding it difficult to apply the rules (See below) to the dataset.
I attached sample file here.

Can you help please?

Thank you

Question:
How is it possible to have a conditional split for the following datacheck rules?

Note: Data as you see in attached file is grouped by V_Number and m_name

datacheck rules:
1-
If
[actual] date of [t DateName] rows is not null
and [actual] date of [l DateName] rows is not null
and [actual] date of [v DateName] rows is not null
and weight is not null
then re-direct all the rows for V_Number to valid.
2-
If
[actual] date of [t DateName] rows is null
and [estimated] date of [t DateName] rows is null
then re-direct all rows for V_Number to in-valid.

3-
If
[actual] date of [t DateName] rows is null
and [estimated] date of [t DateName] falls within the Previous month
then re-direct all rows for V_Number in-valid.

4-
if
each group of v_Number has more than one party
then re-direct to in-valid

5-
if
each group of V_Number has more than one [actual] date for [t DateName] rows
then re-direct to in-valid.

Azure Data Factory
Azure Data Factory

An Azure service for ingesting, preparing, and transforming data at scale.


Answer accepted by question author

MartinJaffer-MSFT 26,161 Reputation points
2022-05-09T20:58:08.673+00:00
source(output(\n\t\tV_Number as string,\n\t\tm_name as string,\n\t\tparty as string,\n\t\tm_shipment as string,\n\t\tName as string,\n\t\testimated as string,\n\t\tactual as string,\n\t\tml_name as string,\n\t\tDoc_Name as string,\n\t\tweight as double\n\t),\n\tallowSchemaDrift: true,\n\tvalidateSchema: false,\n\tignoreNoFilesFound: false,\n\tformat: 'delimited',\n\tfileSystem: 'martin',\n\tfileName: 'question.txt',\n\tcolumnDelimiter: ',',\n\tescapeChar: '\\',\n\tquoteChar: '\"',\n\tnullValue: 'null',\n\tcolumnNamesAsHeader: true) ~> source1\njoin1 split(UnionRules,\n\tdisjoint: false) ~> split1@(Invalid, Valid)\nsource1 aggregate(groupBy(V_Number),\n\tweightIsNull = 0 < (count(isNull(actual)) + count(isNull(weight)) ),\n\t\tmultipleParty = 1 < countAllDistinct(party),\n\t\tcolumn1 = countIf(in(["t DateName","l DateName"], Name), Doc_Name),\n\t\tCountDistinctIfDate = 1 < countDistinct(iif(like(Name,"t DateName"),actual))) ~> DoRules\nsource1, CleanRules join(source1@V_Number == CleanRules@V_Number,\n\tjoinType:'inner',\n\tbroadcast: 'auto')~> join1\nDoRules derive(UnionRules = weightIsNull || multipleParty || CountDistinctIfDate) ~> HasError\nHasError select(mapColumn(\n\t\tV_Number,\n\t\tUnionRules\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> CleanRules

Was this answer helpful?


1 additional answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,161 Reputation points
    2022-05-09T19:43:19.267+00:00

    Hi again @arkiboys . Looks like you brought us a fun one this time.
    I think I see where the difficulty is. The Conditional Split works as if each row stands on its own, but the conditions you bring are more aggregate-like.

    I'm not sure whether everything can be done at once, but it is often better to break up into steps anyway. I'm going to try to refactor some of this logic, tell me if this checks for you. "any" refers to any row in a set of [v_number].

    rule 1 covers [actual] for all 3 values of [Name] so we can simplify. None of the rules depend on [m_Name], so we can group by [V_number] only.
    in rule 2, the check on [actual] has been handled in rule 1, so it just becomes a check on [estimated] being null? Or is this paring important? Please double check my logic. Are any nulls allowed?

    • Invalid when any [actual] is null
    • Invalid when any [weight] is null
    • Invalid when any(?) estimated is null
    • Invalid when 1 < count distinct [party]
    • Invalid when 1 < count distinct [actual] where [name] = t_DateName

    Anyway, the solution I am trying is an aggregation, Grouping by only [V_Number] becaues you want to move entire sets of [V_Number]. The aggregation will be the rules. After that Conditional Split based upon join of the aggregates to the original data.
    In case I am wrong in my logic on the rules, I'll try to make countIf separate the values of [name] during aggregation.

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.