Periods of prolonged inactivity and frequent MessageLockLostException in QueueClient

Lakshmi Kanthan M 26 Reputation points
2020-07-22T06:36:27.15+00:00

Hello everyone,

Background
We have a data transfer solution with Azure Service Bus as the message broker. We are transferring data from x datasets through x queues - with x dedicated QueueClients as senders. Some senders publish messages at the rate of one message every two seconds, while others publish one every 15 minutes.

The application on the data source side (where senders are) is working just fine, giving us the desired throughput.

On the other side, we have an application with one QueueClient receiver per queue with the following configuration:

  • maxConcurrentCalls = 1
  • autoComplete = true (if receive mode = RECEIVEANDDELETE) and false (if receive mode = PEEKLOCK) - we have some receivers where, if they shut-down unexpectedly, - would want to preserve the messages in the Service Bus Queue.
  • maxAutoRenewDuration = 3 minutes (lock duraition on all queues = 30 seconds)
  • an Executor service with a single thread

The MessageHandler registered with each of these receivers does the following:

public CompletableFuture<Void> onMessageAsync(final IMessage message) {

    // deserialize the message body
    final CustomObject customObject = (CustomObject)SerializationUtils.deserialize((byte[])message.getMessageBody().getBinaryData().get(0));

    // process processDB1() and processDB2() asynchronously
    final List<CompletableFuture<Boolean>> processFutures = new ArrayList<CompletableFuture<Boolean>>();

    processFutures.add(processDB1(customObject));  // processDB1() returns Boolean
    processFutures.add(processDB2(customObject)); // processDB2() returns Boolean

    // join both the completablefutures to get the result Booleans
    List<Boolean> results = CompletableFuture.allOf(processFutures.toArray(new CompletableFuture[processFutures.size()])).thenApply(future -> processFutures.stream()
        .map(CompletableFuture<Boolean>::join).collect(Collectors.toList())

    if (results.contains(false)) {
        // dead-letter the message if results contains false
        return getQueueClient().deadLetterAsync(message.getLockToken());
    } else {
        // complete the message otherwise
        getQueueClient().completeAsync(message.getLockToken());
    }
}

We tested with the following scenarios:

Scenario 1 - receive mode = RECEIVEANDDELETE, message publish rate: 30/ minute
Expected Behavior
The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).

Actual behavior
We observe random, long periods of inactivity from the QueueClient - ranging from minutes to hours - there is no Outgoing Messages from the Service Bus namespace (observed on the Metrics charts) and there are no consumption logs for the same time periods!

Scenario 2 - receive mode = PEEKLOCK, message publish rate: 30/ minute
Expected Behavior

The messages should be received continuosuly with a constant throughput (which need not necessarily be the throughput at source, where messages are published).

Actual behavior
We keep seeing MessageLockLostException constantly after 20-30 minutes into the run of the application.

We tried doing the following -

  • we reduced the prefetch count (from 20 * processing rate - as mentioned in the Best Practices guide) to a bare minimum (to even 0 in one test cycle), to reduce the no. of messages that are locked for the client
  • increased the maxAutoRenewDuration to 5 minutes - our processDB1() and processDB2() do not take more than a second or two for almost 90% of the cases - so, I think the lock duration of 30 seconds and maxAutoRenewDuration are not issues here.
  • removed the blocking CompletableFuture.get() and made the processing synchronous.

None of these tweaks helped us fix the issue. What we observed is that the COMPLETE or RENEWMESSAGELOCK are throwing the MessageLockLostException.

We need help with finding answers for the following:

  • why is there a long period of inactivity of the QueueClient in scenario 1?
  • how do we know the MessageLockLostExceptions are thrown, because the locks have indeed expired? we suspect the locks cannot expire too soon, as our processing happens in a second or two. disabling prefetch also did not solve this for us.

Versions and Service Bus details

  • Java - openjdk-11-jre
  • Azure Service Bus namespace tier: Standard
  • Java SDK version - 3.4.0
Azure Service Bus
Azure Service Bus
An Azure service that provides cloud messaging as a service and hybrid integration.
626 questions
{count} votes

Accepted answer
  1. MayankBargali-MSFT 70,466 Reputation points
    2020-07-23T08:34:12.813+00:00

    Hi @Lakshmi Kanthan M

    For Scenario 1
    Once you have observed that your client application stops pulling the message from Queue then how did the issue got mitigated? Have you restarted the application? Did you observe any exception/error when your client application stops consuming the messages? I am assuming that the sender application was still sending the messages during that time and you can see the active messages were increased for that queue.

    There could be multiple reasons that the client was not able to pull the messages from the queue
    • There might be some issues at the client end that would cause this behavior such as updates, high utilization, etc. Where is the client application hosted and have you looked into the client if there was an issue.
    • There can be server busy/internal error at the service bus end and your client has retried multiple times (assuming you have implemented the retry logic in your code) and if the retries get exhausted then the exception would be thrown at the client end and your application stops.

    For Scenario 2:
    Lock lost exception generally is a client-side error, there are two possible causes, either there was some issue at the service bus end (high utilization, noisy neighbor, etc. ) or the client is too busy that it was unable to complete the request in a timely manner or some very slow network.

    If you are observing lock lost more frequent and as you have mentioned that your business logic takes 1 or 2 sec then I suggest you to add logging in your application code to confirm the same. You can add a timer to your business logic that you are executing when you have acquired the lock and called the complete on that message. The log should give you the details what was the messageID that your application is consuming, timeframe in UTC, lock token that is acquired on that message along with the execution time for each and every statement (business logic)
    You can also add a timer before and after CompleteAsync call to check how much time it is taking for the message to be marked as completed. I will also suggest you to to remove the prefetch when you are performing this test.

    This needs in-depth troubleshooting from both client and service bus end. Once you have the above information and added more logging in your application please send an email with the subject line “Attn:Mayank” to AzCommunity[at]Microsoft[dot]com referencing this thread along with your Azure subscription ID, service bus name, queue name, receiver application details along with the exception message, message ID, lock token, timeframe for individual messages for which you have observed the issue.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.