Spot the defect (2/2): Custom Pipeline Component Sample "FixMsg" can fail without any apparent reason

Part 1 Part 2

In part 1 , I pointed out that the FixMsg sample has some issues. I am sure you all understood what is wrong.

What is wrong?

When Read() is called the first time with count = 10, the code did not yet run. We determine that we need to read 10 bytes (byteRead = 10) and we copy the data to the output buffer:

  {

     bytesRead = count > prependData.Length ? prependData.Length : count;

     Array.Copy(prependData, prependDataOffset, buffer, offset, bytesRead);

     if (bytesRead==prependData.Length)

       prependData = null; // prependData consumed

                       

     prependDataOffset += bytesRead;

     offset += bytesRead;

The following line then sets count to 0:


count -= bytesRead;

     ret += bytesRead;

  }

We attempt to read 0 byte from the original stream, which triggers an exception:

  bytesRead = stm.Read(buffer, offset, count); <-- Attempts to read 0 byte

So the current implementation assumes that the first read will fully consume the prependData buffer and there will be room in the output buffer for at least one byte coming from the original message.

How can we fix it?

The fix involves figuring out how much is left to read in prependData and making sure that always fill out the buffer and not much. We cannot assume any buffer size so we have to honor the request no matter what.

The following code snippet implements one possible way to fix this. It does not consider the appendData part. Only the prependData is dealt with. It is left as an exercise to the reader to implement the append part.

override public int Read(byte[] buffer, int offset, int count)

{

   int ret = 0;

   int bytesRead = 0;

   // Do we have data to prepend?

   if (prependData != null)

   {

      // Compute how much prepend data is left

      int availablePrependData = prependData.Length - prependDataOffset;

      bytesRead = count > availablePrependData ? availablePrependData : count;

      Array.Copy(prependData, prependDataOffset, buffer, offset, bytesRead);

      prependDataOffset += bytesRead;

      offset += bytesRead;

      count -= bytesRead;

      ret += bytesRead;

      // Did we consume all prepend data?

      if (prependDataOffset >= prependData.Length)

              prependData = null;

   }

   // Did we fill the buffer with prepend data?

   if (count > 0)

   {

      bytesRead = stm.Read(buffer, offset, count);

      ret += bytesRead;

   }

   return ret;

}

Note that size calculations are performed with signed integers so integer arithmetic overflow is possible. Fixing this is left as an exercise to the reader.

Why is the sample so complex?

In my previous post , Patrick asks why the "FixMsg" sample is so complicated and also suggests the following pseudo-code:

stream a···= BodyPart.GetOriginalDataStream
stream b···= converttostream(prependdata)
stream c···= converttostream(appenddata)
stream new·= nothing

write new b
write new a
write new c

OutBodyPart.Data = new

This code creates a new stream, populates it with prependData then copies the original message and the adds appendData. The new stream is then returned to BizTalk for further processing.

From a logical point of view, this code does the job. It modifies the stream the way we expect it to be. However, in production, this code should be avoided. Let's analyze the performance of this code.

Every time a new message comes in, Patrick's code creates a new stream and fully populates it. This means that the code forces the messaging engine to fully read the incoming message and copy it into some in-memory stream. If the message is large, we just allocated a large amount of memory which we do not really need (at least for this example).

Also, depending at which stage your custom component is running, you might not need to read the whole document. Maybe your component is in a pipeline right before a disassembler that only needs the first 2Kb of data and discards everything else. In that case, you allocated resources that you will never use. Moreover, you payed for the full initialization of a data structure that you did not need: reading the full stream may be expensive.

There is at least another thing you should consider. The messaging engine exposes messages as streams because in theory, streams do not need to be fully populated when you read from them. Let's assume I have a 40Mb message. The messaging engine can just offer a stream object that knows the current position in the data and the location to read from. This consumes a very very small amount of memory and delays expensive operations (reading the actual data) as lon as possible. If your component reads chunks of 4Kb, only 4Kb are actually being allocated (by your component) and if you tune the buffer carefully, you can achieve impressive throughput.

This is why the sample does not create a temporary stream. It attempts to demonstrate a best practice: you should avoid building a fully populated stream as much as possible for performance reasons. The sample maintains three "data sources" and reads from them, switching from one to the other when hitting End Of File. This way, the allocated memory is very small, we do not force the messaging engine to reads the message and we move only the requested amount of data.

Of course, for a proof of concept or quick prototyping, you can implement a component by creating a temporary stream and fully populate it. However, if you are attempting to prove that you can handle load, I suggest switching from streams as the sample does instead.