Effects of incorrect QoS policies: A story behind a slow file copy...

Hi there,

In this blog post, I’ll talk about another network trace analysis scenario.

The problem was that some Windows XP clients were copying files from a NAS device very slowly compared to others. As one of the most useful logs to troubleshoot such problems, I requested a network trace to be collected on a problem Windows XP client. Normally it’s best to collect simultaneous network traces but it was a bit diffcult to collect a trace at the NAS device side so we were limited to a client side trace.

Before I start explaining how I got to the bottom of the issue, I would like to provide you with some background on how files are read by Windows via SMB protocol so that you’ll better understand the resolution part:

Windows XP and Windows 2003 use SMB v1 protocol for remote file system access (like creating/reading/writing/deleting/locking files over a network connection). Since it was a file read from the remote server in this scenario, the following SMB activity would be seen between the client and server:

Client Server

===== ======

The client will open the file at the server first:

SMB Create AndX request ---->

                                          <---- SMB Create AndX response

Then the client will send SMB Read AndX requests to retrieve blocks of the file:

SMB Read AndX request ----> (61440 bytes)

                                          <---- SMB Read AndX response

SMB Read AndX request ----> (61440 bytes)

                    <---- SMB Read AndX response

SMB Read AndX request ----> (61440 bytes)

                                          <---- SMB Read AndX response

SMB Read AndX request ----> (61440 bytes)

                                          <---- SMB Read AndX response

...

Note: SMBv1 protocol could request 61 KB of data at most in one SMB Read AndX request.

After this short overview, let’s get back to the original problem and analyze the packets taken from the real network trace:

Frame# Time delta Source IP Destination IP Protocol Information

===== ======== ========= ========== ====== ========

....

59269 0.000000 10.1.1.1 10.1.1.2 SMB Read AndX Request, FID: 0x0494, 61440 bytes at offset 263823360

59270 0.000000 10.1.1.2 10.1.1.1 SMB Read AndX Response, 61440 bytes

59271 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65993793

59272 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65995249

59273 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65996705

...

59320 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66049121

59321 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66050577

59322 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66052033

59323 0.000000 10.1.1.1 10.1.1.2 TCP foliocorp > microsoft-ds [ACK] Seq=67600 Ack=66053489 Win=65535

59325 0.406250 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > folioc [PSH, ACK]Seq=66053489

59326 0.000000 10.1.1.1 10.1.1.2 SMB Read AndX Request, FID: 0x0494, 61440 bytes at offset 263884800

59327 0.000000 10.1.1.2 10.1.1.1 SMB Read AndX Response, 61440 bytes

59328 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59327] microsoft-ds > foliocorp [ACK] Seq=66055297

...

Now let’s take a closer look at some related frames:

Frame# 59269 => Client requests the next 61 KB of data at offset 263823360 from the file represented with FID 0x0494 (this FID is assigned by server side when the file is first opened/created)

Frame# 59270 => Server starts sending 61440 bytes of data back to the client in SMB Read AndX response.

Frame# 59271 => The remaining parts are sent in 1460 bytes chunks because of TCP MSS negotiated, it’s generally 1460 bytes. (like frame# 59272, frame# 59273 etc)

The most noticable thing in the network trace was to see many such 0.4 seconds delays (like the one that we see at frame #59325). Those 0.4 seconds delays were always present at the last fragment of 61 KB of data returned by the server.

Normally 0.4 seconds could be seen as a very low delay but considering that the client will send n x SMB Read Andx request to the server to read the file it will quickly be clear that 0.4 seconds of delay is huge (for example, the client needs to send 1000 SMB Read AndX requests to read a 64 MB file)

Generally we’re used to see some delays in network traces due to packet retransmissions (due to packet loss) or link transfer delayes etc. But seeing a constant delay of 0.4 seconds in every last fragment of a 61 KB block made me suspect that a QoS implementation was in place somewhere between the client and server. By delaying every read request about 0.4 seconds, actually file copy is being slowed down on purpose: traffic shaping/limiting.

Since we didn’t have a network trace collected at the NAS device side, we couldn’t check if the QoS policy was in effect at the NAS device side or on a network device running in between the two. (we checked the client side and there was no QoS configuration in place). After further checking the network devices, it turned out that there was an incorrectly configured QoS policy on one of them. After making the required changes, the problem was resolved...

Hope this helps

Thanks,

Murat