MPI error in Azure Batch

Green, Jim 75 Reputation points
2024-10-31T15:30:54.37+00:00

In MPI-enabled tasks, occasionally we will get this error:

Aborting: smpd on a3315966100000A failed to communicate with child smpd manager

If we restart the exact same task it will complete.

This seems to suggest that a node has terminated and is no longer communicating. The application does normally catch and report exceptions but nothing is generated in these cases. I've tried various ways to terminate a node prematurely but MPI still manages to report it.

What could we do to track down the cause of this? Don't know if this is an application problem or just a Batch glitch that we have to live with.

Thanks.

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
{count} votes

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.