Azure Data Factory Data Flow Not Correctly Joining Tables

Drew Nelson 31 Reputation points
2024-11-19T16:24:20.0333333+00:00

I am attempting to create an Azure Data Flow in which I join data from an Azure SQL Database to several different tables within a SQL Server database that's hosted on an Azure VM. Data Factory connects to both servers via Managed Private Endpoints.

I built the flow and noticed some issues, so I've simplified it down to just a single join across servers, comparing a single column in each table. Source1 is a view with about 196 rows. Source2 is a table with about 18,000 rows. I am comparing a column called CntrctCode, and I expect most or all of those values in Source1 to be in Source2. For now I am sinking to a temporary table I set up on the same DB as Source1, just to investigate this issue.

User's image User's image

When I do a data preview or run the data flow, I'm seeing only 85 of the 196 being found. I have spot-checked several of them by manually querying the database for Source2 and did indeed find them, so the dataflow join is not working properly.

I have debug settings configured to allow 20,000 records from Source2, I have also published this Data Flow and ran a pipeline that uses a data flow activity to run it, with the same results. I tried various combinations of settings on the Join action itself and saw no change, so currently anything other than what I have pictured is set to its default value.

Is there some configuration necessary that I'm missing? Or is this just broken?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,990 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 27,131 Reputation points
    2024-11-19T21:29:21.93+00:00

    Verify if the CntrctCode column is case-sensitive in the comparison. Azure Data Factory joins are case-sensitive by default, so "abc" and "ABC" will not match. Ensure the case matches in both Source1 and Source2. You can use a Derived Column transformation to standardize the case (e.g., converting both columns to lowercase or uppercase).

    Ensure there are no leading or trailing spaces in the CntrctCode values. Use a Derived Column transformation to trim the values.

    Verify also if the CntrctCode columns in both Source1 and Source2 have the same data type. If one is a string and the other is an integer, or if one has a different character encoding, the join might fail. Use a Derived Column transformation to cast the column to the same type if necessary.

    Since Source1 is a view, confirm the view logic is accurate and returns the expected data. Manually query Source1 and ensure the 196 rows match the expected values for the join condition.

    In Data Preview, increase the data sampling size to ensure that all rows are included during debugging. For large datasets, ADF might only process a subset of rows during the preview

    What I suggest :

    1. Add a Derived Column transformation to:
      • trim and lowercase both CntrctCode columns.
    2. Explicitly cast the CntrctCode columns to the same data type using a Derived Column.
    3. Test the join using a subset of the data to confirm if specific rows fail.
    4. Increase debug limits and run the Data Flow again.

  2. Pinaki Ghatak 5,230 Reputation points Microsoft Employee
    2024-11-22T10:52:31.7566667+00:00

    Greetings, @Drew Nelson

    1. Check the data types of the columns you are joining on. Make sure they are the same data type in both tables. If they are not, you may need to convert one of the columns to match the other.
    2. Check for leading or trailing spaces in the values of the columns you are joining on. This can cause mismatches in the join operation. You can use the TRIM function to remove any leading or trailing spaces.
    3. Check the join type you are using. Make sure it is the correct type for your data. For example, if you are expecting all values in Source1 to be in Source2, you may want to use a left outer join.
    4. Check the join condition. Make sure it is correct and that it is using the correct columns.
    5. Check the data preview for both sources to make sure the data is being read correctly.
    6. Check the debug logs for any errors or warnings related to the join operation.
    7. Try running the join operation on a smaller subset of data to see if it produces the expected results.

    Let us know if this helps. Have a great weekend.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.