SQL Server test databases fail when multiple databases open

Paul Tobey 1 Reputation point
2022-03-22T20:53:30.793+00:00

I'm working on an intermittent issue experienced with:

  • Entity Framework Core 3.1, C#
  • SQL Server 2019
  • Unit tests (xUnit)

Tests:
* About 2000 tests total
* About 150 tests which exercise system elements using SQL Server database, each test creating a new database using a GUID-based database name for uniqueness. When a given test starts, it creates the database it needs and then runs some operations against it, deleting the database when complete
* Test failures appear random. Total failure set is between 1-10 of 150 and the same tests are not reliably failing, nor is any single test immune

Failure:
The failure is always the same, although the name of the database (of course) varies, and the test triggering the failure is different each time as noted above. The failure appears to occur either during creation or migration of the database. The attempt to open/create the database seems to fail after trying multiple times (as shown in the SQL Server log further below).

[xUnit.net 00:05:35.37]       System.InvalidOperationException : An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call.
[xUnit.net 00:05:35.37]       ---- Microsoft.Data.SqlClient.SqlException : A connection was successfully established with the server, but then an error occurred during the login process. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)

The SQL Server log (SQL Server Express 2019 in this case), shows this each time (other I/O for simultaneously-running tests removed):

03/22/2022 13:27:50,Logon,Unknown,Login failed for user 'DOMAIN\ptobey'. Reason: Failed to open the explicitly specified database 'zFlowLogsRepositoryTest_Sqld00bb232661c400991e4397fda214c1e'. [CLIENT: <local machine>]
03/22/2022 13:27:50,Logon,Unknown,Error: 18456<c/> Severity: 14<c/> State: 38.
03/22/2022 13:27:49,Logon,Unknown,Login failed for user 'DOMAIN\ptobey'. Reason: Failed to open the explicitly specified database 'zSensorRepositoryTest_Sql7d2bacd79f3c4a57b02f4f4aedc9aea9'. [CLIENT: <local machine>]
03/22/2022 13:27:49,Logon,Unknown,Error: 18456<c/> Severity: 14<c/> State: 38.
03/22/2022 13:27:49,Logon,Unknown,Login failed for user 'DOMAIN\ptobey'. Reason: Failed to open the explicitly specified database 'zSensorRepositoryTest_Sql7d2bacd79f3c4a57b02f4f4aedc9aea9'. [CLIENT: <local machine>]
03/22/2022 13:27:49,Logon,Unknown,Error: 18456<c/> Severity: 14<c/> State: 38.
03/22/2022 13:27:47,spid52,Unknown,Parallel redo is shutdown for database 'zFlowLogsRepositoryTest_Sqlec352b5994904f9bb2cb7f563ad1f26b' with worker pool size [4].
03/22/2022 13:27:47,spid52,Unknown,Parallel redo is started for database 'zFlowLogsRepositoryTest_Sqlec352b5994904f9bb2cb7f563ad1f26b' with worker pool size [4].
03/22/2022 13:27:47,spid52,Unknown,Starting up database 'zFlowLogsRepositoryTest_Sqlec352b5994904f9bb2cb7f563ad1f26b'.
03/22/2022 13:27:46,spid65,Unknown,Parallel redo is shutdown for database 'zSatelliteRepositoryTests_Delete267dd7a8e81240f39de88da975d152f2' with worker pool size [4].
03/22/2022 13:27:46,spid65,Unknown,Parallel redo is started for database 'zSatelliteRepositoryTests_Delete267dd7a8e81240f39de88da975d152f2' with worker pool size [4].
03/22/2022 13:27:45,spid65,Unknown,Starting up database 'zSatelliteRepositoryTests_Delete267dd7a8e81240f39de88da975d152f2'.
...
03/22/2022 13:27:45,Logon,Unknown,Login failed for user 'DOMAIN\ptobey'. Reason: Failed to open the explicitly specified database 'zFlowElementRepositoryTest_Sql8999d59d52534d65a85054dac805af14'. [CLIENT: <local machine>]
03/22/2022 13:27:45,Logon,Unknown,Error: 18456<c/> Severity: 14<c/> State: 38.

So Error: 18456, Severity: 14, State: 38 basically seems to indicate a login failure. However, the same user is opening all of the databases (all 150) and only 1-10 are failing in this way, so it's not a real login issue. As you can see Windows identity is being used, if that helps identify the issue. There can be anywhere from 4-5 to 30-40 databases open at the same time, depending on the state of the corresponding tests and which tests are active. xUnit is managing when tests run so it might run a bunch of non-SQL tests, then mix in a few SQL tests, then mostly SQL tests (as the faster non-SQL tests complete).

I've limited which pieces of which tests can be running at the same time without impact on the result; I've locked access to certain parts of the code so that only one or only a few simultaneous operations can occur without impact on the result. The only reliable improvement is running only one database test at a time, generally completing all tests successfully. I'm looking at possible issues with the domain controller identity checking but this seems a stretch. Help!

Entity Framework Core
Entity Framework Core
A lightweight, extensible, open-source, and cross-platform version of the Entity Framework data access technology.
694 questions
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,642 questions
{count} votes

7 answers

Sort by: Most helpful
  1. Paul Tobey 1 Reputation point
    2022-03-22T22:52:07.24+00:00

    @Erland Sommarskog

    So it is only the first 10 tests that fail? Or rather any of the first ten? Does the tests proceed to run remaining tests once there has been a failure, or does it give up?

    No, not the first or last or any pattern that I can see. 1-10 tests somewhere in the full set fail. It continues running tests until all have run, passed or failed.

    Does these ten initial tests all run in the same database?

    Every test is running aginst its own unique database; they're very isolated from one another. If you're familiar with xUnit, each test is either a [Fact] or a [Theory]/[InlineData] so they all run on a different instance of their test class.

    When the script creates the database does it wait for the database creation complete? Or does that happen on an asynchronous thread?

    The database creation precedes execution of each test. Database creation occurs in the test class constructor and the [Fact]/[Theory] isn't executed until the constructor completes.

    I ask these questions, because the error messages suggest that the database is not ready to access yet.

    Understood, and thanks!


  2. Paul Tobey 1 Reputation point
    2022-03-23T14:49:00.537+00:00

    @Erland Sommarskog Thanks!

    I guess I should draw diagrams but the steps go something like this and are performed sequentially (step 1 completes before step 2 begins and step 2 completes before step 3 begins) within each of the 150 tests.

    1. Test class encapsulates all database initialization operations. When an instance of this class is created for a single test of the 150, a new database is built with a unique name (chosen using an GUID so there's zero chance of a name conflict). Starting conditions for the test are configured here (maybe some seeding of the database, etc.) All of the tables, stored procedures, types are created (just as they would be for an empty production database). When this step is complete we essentially have a production database (other than the name). All initialization steps, stored procedures, etc. carefully avoid directly accessing the database by name so there's no interference with each other and no use of the production database's actual name.
    2. Once the class instance is built (and the database constructed), one test is run against that database. That test might do some CRUD operations checking that our code does the right thing. There are 150 different versions of this step each testing a different aspect of our code. They all run in the context of a database opened specifically for them above and destroyed when their testing is complete. The tests are all synchronous in the sense that when they check the results, the database is in a known state (or at least there's a state we expect and we check that against the actual state as part of the test); there are no triggers or other async operations running in the database that might leave it in an unexpected state with respect to the test.
    3. Once the single test sequence in step 2 for this unique database is completed, either passing or failing, the database is torn down and the result logged

    This sequence is performed for each of the 150 tests mentioned and the tests are executed in parallel with each other. If all 150 tests were active at once (and they could be), you'd have 150 databases open at the same time, each in a different state based on where they are during construction, testing or destruction of the database. They all have the same schema but the state differs.

    The intermittent failure occurs during between one and ten of the tests, during step 1. The database for that test appears not to be created and this confusing error is returned. The other tests, 140+ work fine and give the passing results expected.

    0 comments No comments

  3. Erland Sommarskog 100.8K Reputation points MVP
    2022-03-23T22:11:12.1+00:00

    Thanks for the details. I think I'm starting to get an understanding of the setup.

    You say that it always step 1 that fails, and this step and generic of the test. No surprise then that any test could fail. I should also say that I was expecting that the failure was in step 1 given the error messages.

    Have you investigated where in step 1, the failure occurs? It seems that this step perform quite a few action. I would expect that the failure on the first connection after having created the database. Again, this is based on the error message.

    It seems that the issue is that the database is not ready for you when you attempt to connect to it. SQL Server is not really geared for this scenario. That is, creating a database is assumed to be an occasional event, not 150 being created in parallel. So it may be that you simply have to reduce the number of threads that run in parallel. Or have a startup part that create all 150 databases in sequence. They would just be created as empty, and population could still be done in parallel.

    Alternatively, when you get this error, you will need to have some sort of retry, either coming back to this database a couple of seconds later. Or just drop it and start over.

    0 comments No comments

  4. Paul Tobey 1 Reputation point
    2022-03-23T22:45:23.93+00:00

    @Erland Sommarskog Unfortunately that would involve debugging through EF Core code, so no, I don't have a sense of more-precisely where it fails. The database does not seem to exist yet when the error is thrown, so initial contact with SQL Server seems most-likely. We call the Migrate() method of our database context and EF Core takes it from there. It creates the database if necessary, checks any migrations needed bringing schema up-to-date with code, and applies any which are missing. When Migrate() returns the database is in a working state and we can modify it directly with suitable SQL calls or via the EF Core ORM elements. Basically all I know is that something in the most-complex part of the process throws a non-informative exception from the SQL client code...intermittently! :-(

    Yes, that could be; I can't find any evidence that this is a known limitation of EF Core and the error message from SQL Server is not mentioned in the context of multiple clients connecting, via TCP sockets or whatever to the server. I've used semaphores limiting the active number of Migrate() calls down to as few as five (no more than five tests in that state, migration, at a time), without improving the result noticeably, one or more tests still failing. Also based on the SQL Server logs, it looks like either SQL Server itself or the client code in EF Core is retrying the open attempt, unsuccessfully.

    My best result so far, better on my development machine but not yet verified on our build/test environment, is changing from Windows authentication to SQL Server authentication. On my machine, the error no longer appears after switching the server to mixed auth mode, enabling the sa account, and changing the test connection string using sa rather than Windows identity (all 150 operations work the same but use the sa identity). Not much of a confidence-builder that it won't reappear later but maybe a work-around.

    Much appreciate your patience and brainstorming!


  5. Paul Tobey 1 Reputation point
    2022-03-25T15:13:51.387+00:00

    @Erland Sommarskog Thanks for traveling the path with me. When I have a free week I'll correlate with the EFCore database code (which I believe is using the SQL Server client library), trying to get a deeper understanding of the failure point.

    0 comments No comments