Why One Row Did Not Insert?

JAL 571 Reputation points
2022-09-04T01:00:00.663+00:00

About five years ago our company created an app to record the findings of a human evaluation.

A person examines a device, checks it against each criterion onscreen, and then marks it "Pass" or "Fail". Each criterion is one row in our DB.

The app hasn't changed much in the last five years. About 1.3 million evaluations done so far. About 50 million criterion-rows.

Typically it's a two-person process:

  • Evaluator-1
  • Evaluator-2 (a senior evaluator).

Evaluator-2 doesn't start fresh. Rather the rows of Evalutator-1 are first cloned (but marked as Evaluator-2).

After five years of success, suddenly we are seeing an occasional failure in the cloning process, where Evaluator-2 is mysteriously missing exactly one row. The cloning is done in a stored procedure that looks roughly like this:

Insert Into dbo.Workspace (PassOrFail, EvaluationID, Iteration)
Select PassOrFail, EvaluationID, 'Evaluator-2'
From dbo.Workspace where EvaluationID = @EvalID and Iteration = 'Evaluator-1'
-- The same SP ends with the following validation to verify that all rows got cloned:
Declare @count1 int = (Select count() From dbo.Workspace where EvaluationID = @EvalID and Iteration = 'Evaluation-1');
Declare @count2 int = (Select count(
) From dbo.Workspace where EvaluationID = @EvalID and Iteration = 'Evaluation-2');
If @count1 <> @count2 Throw 51000, N'Error in cloning', 1

  • That last step (the validation-by-count) has been in place for five years, and has never thrown an exception - until now - and it is always exactly ONE ROW that is missing. If we delete the cloned rows and re-run the app, all rows clone properly.

We tried to brainstorm on this - maybe another app is deleting one cloned row? But we can't find anything in our system that would do this.

Frankly, we're fresh out of theories. Any ideas, anyone?

Note:

  • The INSERT statement is not wrapped in a try-catch, nor in an explicit tran.
  • Seems xact_abort is Off. I suppose we should turn it on?
Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,552 questions
{count} votes

6 answers

Sort by: Most helpful
  1. Erland Sommarskog 100.9K Reputation points MVP
    2022-09-04T09:41:13.35+00:00

    We can't debug code we don't see. But I like to add that beside that there is an error in the cloning code, the issue could also be that one more row was added to the source while the operation is running, and that is why you get the error in validation.

    But yes, using TRY-CATCH and XACT_ABORT ON is best practice. Whether it would address this issue, I don't know.

    0 comments No comments

  2. JAL 571 Reputation points
    2022-09-04T13:18:43.797+00:00

    Thanks guys for responding.

    "What transaction isolation level are you using?"

    At the top of the SP is this (not my decision and I don't think I will be allowed to change it).
    Set Transaction Isolation Level Read Uncommitted;

    "Do you mean that @count1 is 1 and @count2 is 0?"

    Typically Evaluator-1 has, say, 30 rows. But when the problem occurs, only 29 rows copied to the Evaluator-2 workspace. The least number of rows where this has happened is 9 rows for Evaluator-1, and then of course only 8 rows for Evaluator-2.

    0 comments No comments

  3. Erland Sommarskog 100.9K Reputation points MVP
    2022-09-04T13:33:02.047+00:00

    Set Transaction Isolation Level Read Uncommitted;

    Which basically means "I don't care if I get correct result or not".

    It is not clear to me what you are doing, since you only gave rough outline of the code. But if you are doing an INSERT and then a count, and you want to be sure that nothing changes, you need to have the either the serializable ro the snapshot isolation level. The former protects you against phantom reads, that is a row is inserted while the operation is running and before the SELECT COUNT runs. With SNAPSHOT you read from a version store to get the state of the database when the transaction started.

    With READ UNCOMMITTED all sorts of interesting thing can happen, like rows being read twice, or committed rows not being read at all, because there are concurrent operation occurring while the query is running.

    I don't think I will be allowed to change it).

    That sounds serious. Using READ UNCOMMITTED in production code without understanding the full implications is very careless. Even worse if you "you must change that". I would only agree to using that isolation level in production code with a special sign-off and a very good explanation of why it is needed, and a good analysis that there cannot be issues with it.

    Well, at least you have implemented a test if the operation succeeded. Now you also need to implement a retry for when the operation processes the incorrect number of rows.

    0 comments No comments

  4. JAL 571 Reputation points
    2022-09-04T14:02:15.573+00:00

    "We can't debug code we don't see."

    Thanks for wanting to help me but I can't show company code. I had to handwrite this version, renaming all the columns, to comply with company policy. But the SP really is pretty simple - it's pretty much the code that I showed you.

    I'm not expecting a necessarily "correct" answer - just hoping that experts like you would brainstorm for possibilities that I haven't thought of.

    "But I like to add that beside that there is an error in the cloning code..."

    If you're referring to any typos, that's probably because I hand-wrote the code for this post.

    "... the issue could also be that one more row was added to the source while the operation is running, and that is why you get the error in validation."

    Not likely that a row got added late. The Evaluator-1 rows come from an original Criterion table and, in each case, the correct number of Evaluator-1 rows exist with Pass or Fail. If a row got added late, the Pass/Fail column on the new row would be null. And we have plenty of validations preventing Evaluator-1 from submitting his work if he hasn't marked all, say, 30 rows. Only when he has finished all 30 rows does the evaluation go into the senior queue where Evaluator-2 can find it and begin working on it. Evaluator-2 assigns the evaluation to himself, and then TRIES to open a web page to begin working on it. That's when the cloning happens. In these cases, Evaluator-2 has never been able to open the page, not even once, because the cloning-SP throws the exception, "Error in cloning". Instead of a page containing 30 items, all he sees is an error message.

    Also, if a row got added late, we would see out-of-sequence numbers. The dbo.Workspace table has an integer identity column as primary key. We checked for numbers of out sequence but didn't find any,

    As I now recall, it was in virtue of sequencing that we pretty much ruled out the idea that Evaluator 2 lost a row at some point. To summarize:

    • Seems no row got added late.
    • No row got deleted.

    Therefore: Seems the cloning process simply failed to copy one of the rows.


  5. JAL 571 Reputation points
    2022-09-04T14:06:26.553+00:00

    When I said the provided code was "roughly" accurate, I was referring to

    • renaming the columns
    • omitting some columns irrelevant to this discussion

    The code I gave you is actually very accurate.

    0 comments No comments