best way to cleanse huge data

Radi Soufan 6 Reputation points
2021-07-17T11:52:13.823+00:00

if I have a huge table in data base.. around 40 million records that I have to check that all data are correct inside fields based on some business rules.
let us assume this is my huge table... and the rules are:

  1. US only can have [national_id_type]=NID.. others must be VISA
  2. US only with NA can own house.. other nationality with VISA cant own a house

enter image description here
gcdRL.png

what I think

first to land the table inside a landing table to preserve the orginal data
rewrite error in the column to display the errors and the number of them inside a power bi dashboard so the quality department can fix them
enter image description here
https://i.stack.imgur.com/gcdRL.png

what I will do is

case when nationality <> 'US' and NATINAL_ID_TYPE = 'VISA' then '99\US only can have [national_id_type]=NID' ELSE NATINAL_ID_TYPE END as NATINAL_ID_TYPE ,
case when nationality <> 'US' and NATINAL_ID_TYPE <> 'NA' AND own_house='YES' then '99\US only with NA can own house' ELSE NATINAL_ID_TYPE END as own_house
is this way efficient to detect errors in 40 million Record X 70 Columns I think this will be very slow...

can you help me with some tips I can use to provide the errors for dashboarding and cleansing that can be used for performance

thank you

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,827 questions
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,639 questions
{count} vote

2 answers

Sort by: Most helpful
  1. Erland Sommarskog 111.6K Reputation points MVP
    2021-07-17T22:07:04.833+00:00

    Well, you will have to scan that big table, there is no way around that.

    It is not clear to me, if you are only going to look for errors, or also correct them.

    Sometimes it can be a good idea to break up operations like this in batches, but I'm not sure that it is worth here. That depends on how many rows you will need to update - if you are updating at all.

    Although, if you need to do this when then table is being accessed by other users, it's slightly different.

    0 comments No comments

  2. MelissaMa-MSFT 24,201 Reputation points
    2021-07-19T02:40:24.843+00:00

    Hi @Radi Soufan ,

    Welcome to Microsoft Q&A!

    I found that the image descriptions you provided were the same, please help check and provide the correct ones.

    What environment is your table or database in? If it is in production, you have to be very careful and make more tests before updating.

    You could try to add indexes on the columns like nationality, NATINAL_ID_TYPE and own_house.

    Besides, you could have a try to update in batches of 1000 or 10000.

    What is your recovery mode of your database? If it is full, you also have to pay attention to the size of the transaction log.

    Please refer below links which may be helpful to you.
    How to update large table with millions of rows in SQL Server?
    Best way to update huge table

    Best regards,
    Melissa


    If the answer is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.