External users cannot sign in to Lync 2010

Today I’ll be talking about a problem where the external Lync 2010 clients fail to sign in through Lync 2010 Edge server intermittently and rebooting the Edge server helps resolve the issue for sometime until the problem reappears. There could be multiple problems that might prevent an external user from signing in successfully like authentication issues, service problems, communication problems so on and so forth. In this instance it was a pure communications problem. Let me explain how I troubleshooted this issue now...

 

First of all we collected a number of logs from relevant systems to better understand at which point the sign-in process was failing. Taking corrective actions without logs is generally shooting in the dark so the best way to deal with a problem is to cover the problem with appropriate logs. Here is the action plan that was used to collect the logs when the problem occurred:

 

Action plan:

=========

1) Please wait until the issue occurs and then run the following commands on respective systems once the issue occurs:

 

=> On Edge server:

- Please start a new network trace

- Please start SIPStack & S4 logging

- Please run the following command from an elevated command prompt:

netsh trace start provider=Microsoft-Windows-TCPIP provider=Microsoft-Windows-Winsock-AFD tracefile=Edge_tcpip_winsock.etl maxsize=500 capture=yes

 

=> On Frontend server:

- Please start a new network trace

- Please start SIPStack & S4 logging

- Please run the following command from an elevated command prompt:

netsh trace start provider=Microsoft-Windows-TCPIP provider=Microsoft-Windows-Winsock-AFD tracefile=FE_tcpip_winsock.etl maxsize=500 capture=yes

 

=> On the internal client:

- Please enable Lync tracing from Lync client

 

=> On the external client:

- Please enable Lync tracing from Lync client

 

3) Now please reproduce the problem with signing in with a user from the external client

 

4) Now please stop logging on all endpoints. (network traces, Lync client side logging, Lync server side logging). You can also run the following command to stop netsh tracing on Edge and Frontend servers:

 

netsh trace stop

 

Troubleshooting:

==============

After the logs were collected as per the action plan, I analyzed them starting from external client side:

 

Note: Please note that all IP addresses, server names, user names etc are replaced for privacy purposes.

External client IP: 10.1.1.1

Access Edge IP: 172.16.1.1

Edge internal IP: 192.168.1.1

FE server IP: 192.168.3.10

1) Remote client to Edge server activity

 

- Remote client has SIP connectivity to Edge server but those sessions are closed shortly after the session is established

 

Example:

ip.addr==10.1.1.1 and tcp.port==62973

 

No. Time Delta Source Destination Protocol Length Info

   4298 2013-01-08 09:18:53.832 0.000 10.1.1.1 172.16.1.1 TCP 66 62973 > 443 [SYN] Seq=2046472569 Win=8192 Len=0 MSS=1380 WS=4 SACK_PERM=1

   4299 2013-01-08 09:18:53.832 0.000 172.16.1.1 10.1.1.1 TCP 62 443 > 62973 [SYN, ACK] Seq=4248548130 Ack=2046472570 Win=8192 Len=0 MSS=1460 SACK_PERM=1

   4313 2013-01-08 09:18:54.441 0.609 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [ACK] Seq=2046472570 Ack=4248548131 Win=16560 Len=0

   4316 2013-01-08 09:18:56.020 1.578 10.1.1.1 172.16.1.1 TLSv1 181 Client Hello

   4317 2013-01-08 09:18:56.037 0.016 172.16.1.1 10.1.1.1 TLSv1 1434 Server Hello, Certificate[Unreassembled Packet]

   4318 2013-01-08 09:18:56.037 0.000 172.16.1.1 10.1.1.1 TLSv1 1434 Ignored Unknown Record

   4319 2013-01-08 09:18:56.417 0.380 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [ACK] Seq=2046472697 Ack=4248550891 Win=16560 Len=0

   4320 2013-01-08 09:18:56.417 0.000 172.16.1.1 10.1.1.1 TLSv1 1434 Ignored Unknown Record

   4321 2013-01-08 09:18:56.417 0.000 172.16.1.1 10.1.1.1 TLSv1 560 Ignored Unknown Record

   4322 2013-01-08 09:18:56.787 0.370 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [ACK] Seq=2046472697 Ack=4248552777 Win=16560 Len=0

   4323 2013-01-08 09:18:56.791 0.004 10.1.1.1 172.16.1.1 TLSv1 380 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message

   4324 2013-01-08 09:18:56.796 0.004 172.16.1.1 10.1.1.1 TLSv1 113 Change Cipher Spec, Encrypted Handshake Message

   4327 2013-01-08 09:18:57.193 0.397 10.1.1.1 172.16.1.1 TLSv1 379 Application Data

   4328 2013-01-08 09:18:57.196 0.003 172.16.1.1 10.1.1.1 TLSv1 475 Application Data

   4333 2013-01-08 09:18:57.599 0.402 10.1.1.1 172.16.1.1 TLSv1 891 Application Data

   4337 2013-01-08 09:18:57.802 0.203 172.16.1.1 10.1.1.1 TCP 54 443 > 62973 [ACK] Seq=4248553257 Ack=2046474185 Win=64860 Len=0

   4506 2013-01-08 09:19:18.612 20.809 172.16.1.1 10.1.1.1 TLSv1 507 Application Data

   4509 2013-01-08 09:19:19.094 0.482 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [FIN, ACK] Seq=2046474185 Ack=4248553710 Win=15627 Len=0

   4510 2013-01-08 09:19:19.094 0.000 172.16.1.1 10.1.1.1 TCP 54 443 > 62973 [ACK] Seq=4248553710 Ack=2046474186 Win=64860 Len=0

   4511 2013-01-08 09:19:19.094 0.000 172.16.1.1 10.1.1.1 TCP 54 443 > 62973 [RST, ACK] Seq=4248553710 Ack=2046474186 Win=0 Len=0

 

Approximately 21 seconds later Edge server returns the response to Client (SIP server timeout) and that the connection is closed. I’ll explain from where that 21 seconds come from later.

 

=> Looking at the same activity from Edge server SIPStack log:

 

TL_INFO(TF_PROTOCOL) [0]05E0.0AAC::01/08/2013-09:18:57.600.02e733e9 (SIPStack,SIPAdminLog::TraceProtocolRecord:SIPAdminLog.cpp(125))$$begin_record

Trace-Correlation-Id: 3490147474

Instance-Id: 000AE0DB

Direction: incoming;source="external edge";destination="internal edge"

Peer: 10.1.1.1:62973

Message-Type: request

Start-Line: REGISTER sip:contoso.com SIP/2.0

From: <sip:username@contoso.com>;tag=eee093290f;epid=927c603558

To: <sip:username@contoso.com>

CSeq: 1 REGISTER

Call-ID: 20d82282a7d94f4fb66a0a6df0c0377f

Via: SIP/2.0/TLS 192.168.2.98:62973

Max-Forwards: 70

Contact: <sip:192.168.2.98:62973;transport=tls;ms-opaque=6b23bb065a>;methods="INVITE, MESSAGE, INFO, OPTIONS, BYE, CANCEL, NOTIFY, ACK, REFER, BENOTIFY";proxy=replace;+sip.instance="<urn:uuid:7C426EA7-E80D-53A0-95E6-ED07D8D0CD1E>"

User-Agent: UCCAPI/4.0.7577.4356 OC/4.0.7577.4356 (Microsoft Lync 2010)

Supported: gruu-10, adhoclist, msrtc-event-categories

Supported: ms-forking

Supported: ms-cluster-failover

Supported: ms-userservices-state-notification

ms-keep-alive: UAC;hop-hop=yes

Event: registration

Content-Length: 0

Message-Body:

 

 

TL_ERROR(TF_DIAG) [0]05E0.0AAC::01/08/2013-09:19:18.612.02e7707a (SIPStack,SIPAdminLog::TraceDiagRecord:SIPAdminLog.cpp(143))$$begin_record

LogType: diagnostic

Severity: error

Text: Message was not sent because the connection was closed

SIP-Start-Line: REGISTER sip:contoso.com SIP/2.0

SIP-Call-ID: 20d82282a7d94f4fb66a0a6df0c0377f

SIP-CSeq: 1 REGISTER

Peer: fepool1.contoso.com:5061

 

and the response back to client:

 

TL_INFO(TF_PROTOCOL) [0]05E0.0AAC::01/08/2013-09:19:18.613.02e7734c (SIPStack,SIPAdminLog::TraceProtocolRecord:SIPAdminLog.cpp(125))$$begin_record

Trace-Correlation-Id: 3490147474

Instance-Id: 000AE0E9

Direction: outgoing;source="local";destination="external edge"

Peer: 10.1.1.1:62973

Message-Type: response

Start-Line: SIP/2.0 504 Server time-out

From: <sip:username@contoso.com>;tag=eee093290f;epid=927c603558

To: <sip:username@contoso.com>;tag=586991893A5520DA53F83BE6C4EBF4F9

CSeq: 1 REGISTER

Call-ID: 20d82282a7d94f4fb66a0a6df0c0377f

ms-user-logon-data: RemoteUser

Via: SIP/2.0/TLS 192.168.2.98:62973;received=10.1.1.1;ms-received-port=62973;ms-received-cid=2E6400

Server: RTC/4.0

Content-Length: 0

Message-Body:

 

 

2) Communication between the Edge and Frontend servers:

 

=> Edge server side trace:

 

- Edge server try to connect to Frontend server at TCP port 5061 but it fails to do so after three attempts:

 

106631 11:18:57 AM 1/8/2013 713.5078058 RTCSrv.exe (1504) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192

106984 11:19:00 AM 1/8/2013 716.5111523 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:[SynReTransmit #106631]Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192

107347 11:19:06 AM 1/8/2013 722.5093069 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:[SynReTransmit #106631]Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192

 

=> Now let’s see the same activity from TCPIP/AFD drivers perspective:

 

- RTCSrv process on Edge server creates a stream type socket and binds to it (TCPIP driver assigns a local port 50375) and then requests for a connect to 192.168.3.10:5061

 

106612 11:18:57 AM 1/8/2013 713.4871675 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket: 0 (0x0): Process 0x0000000007AE3060 (0x00000000000005E0), Endpoint 0x0000000007A05010, Family 2 (0x2), Type SOCK_STREAM, Protocol 6 (0x6), Seq 1006 (0x3EE), Status Success

106613 11:18:57 AM 1/8/2013 713.4872298 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint (Family=IPV4 PID=1504 (0x5E0)) created.

106614 11:18:57 AM 1/8/2013 713.4872311 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint 0x00000000082D19B0 (Family=IPV4, PID=1504 (0x5E0)) created with status = Success.

106615 11:18:57 AM 1/8/2013 713.4872401 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket: 1 (0x1): Process 0x0000000007AE3060 (0x00000000000005E0), Endpoint 0x0000000007A05010, Family 0 (0x0), Type Unknown value: 0, Protocol 0 (0x0), Seq 1013 (0x3F5), Status Success

106617 11:18:57 AM 1/8/2013 713.5075474 RTCSrv.exe (1504) 192.168.1.1 Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:bind: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Address 192.168.1.1:0, Seq 7010 (0x1B62), Status Success

106618 11:18:57 AM 1/8/2013 713.5075751 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint/connection 0x00000000082D19B0 acquired port number 50357 (0xC4B5).

106619 11:18:57 AM 1/8/2013 713.5075798 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint (sockaddr=192.168.1.1:50357) bound.

106620 11:18:57 AM 1/8/2013 713.5075837 RTCSrv.exe (1504) 192.168.1.1 Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:bind: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Address 192.168.1.1:50357, Seq 7022 (0x1B6E), Status Success

106621 11:18:57 AM 1/8/2013 713.5076201 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:Socket option: 4 (0x4): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Option FIONBIO, Value 1 (0x1), Seq 11002 (0x2AFA), Status Success

106622 11:18:57 AM 1/8/2013 713.5076435 RTCSrv.exe (1504) 192.168.3.10 Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:connect: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Address 192.168.3.10:5061, Seq 5023 (0x139F), Status Success

106623 11:18:57 AM 1/8/2013 713.5076564 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 transition from ClosedState to SynSentState, SndNxt = 0 (0x0).

106624 11:18:57 AM 1/8/2013 713.5076832 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 attempted to acquire weak reference on port number 50357 (0xC4B5) inherited from endpoint 0x00000000082D19B0. Successful = TRUE.

106625 11:18:57 AM 1/8/2013 713.5076863 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Tcb 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) requested to connect.

106626 11:18:57 AM 1/8/2013 713.5077111 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Inspect Connect has been completed on Tcb 0x000000000829C860 with status = Success.

106627 11:18:57 AM 1/8/2013 713.5077296 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Tcb 0x000000000829C860 is going to output SYN with ISN = 2232090326 (0x850AFED6), RcvWnd = 8192 (0x2000), RcvWndScale = 0 (0x0).

106628 11:18:57 AM 1/8/2013 713.5077335 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) connect proceeding.

106629 11:18:57 AM 1/8/2013 713.5077421 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 SRTT measurement started (seq = 2232090326 (0x850AFED6), tick = 233642658 (0xDED1AA2)).

106630 11:18:57 AM 1/8/2013 713.5077433 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer started. Scheduled to expire in 3000 (0xBB8) ms.

 

=> As a result of the upper layer activity, the first SYN sent to the wire:

106631 11:18:57 AM 1/8/2013 713.5078058 RTCSrv.exe (1504) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192

 

- After 3 seconds, the second SYN is sent because the retransmit timer has expired:

 

 

106979 11:19:00 AM 1/8/2013 716.5110434 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer has expired.

106980 11:19:00 AM 1/8/2013 716.5110467 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: Send Retransmit round with SndUna = 2232090326 (0x850AFED6), Round = 1 (0x1), SRTT = 0 (0x0), RTO = 300 (0x12C).

106981 11:19:00 AM 1/8/2013 716.5110498 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) retransmitting connect attempt, RexmitCount = 1 (0x1).

106982 11:19:00 AM 1/8/2013 716.5110733 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: SRTT measurement cancelled.

106983 11:19:00 AM 1/8/2013 716.5110861 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer started. Scheduled to expire in 6000 (0x1770) ms.

=> The second SYN sent to the wire:

 

106984 11:19:00 AM 1/8/2013 716.5111523 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:[SynReTransmit #106631]Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192

 

 

- After 6 seconds, the third SYN is sent because the retransmit timer has expired:

 

107342 11:19:06 AM 1/8/2013 722.5091976 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer has expired.

107343 11:19:06 AM 1/8/2013 722.5092010 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: Send Retransmit round with SndUna = 2232090326 (0x850AFED6), Round = 2 (0x2), SRTT = 0 (0x0), RTO = 300 (0x12C).

107344 11:19:06 AM 1/8/2013 722.5092043 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) retransmitting connect attempt, RexmitCount = 2 (0x2).

107345 11:19:06 AM 1/8/2013 722.5092289 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: SRTT measurement cancelled.

107346 11:19:06 AM 1/8/2013 722.5092418 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer started. Scheduled to expire in 12000 (0x2EE0) ms.

=> The third SYN sent to the wire:

107347 11:19:06 AM 1/8/2013 722.5093069 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:[SynReTransmit #106631]Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192

 

 

- After 12 seconds, TCPIP driver doesn’t send anymore TCP SYNs and terminates the connection attempt and the socket is closed.

 

107780 11:19:18 AM 1/8/2013 734.5061235 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer has expired.

107781 11:19:18 AM 1/8/2013 734.5061285 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) terminating: retransmission timeout expired.

107782 11:19:18 AM 1/8/2013 734.5061305 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) shutdown initiated (0xC00000B5 - STATUS_IO_TIMEOUT). PID = 1504 (0x5E0).

107783 11:19:18 AM 1/8/2013 734.5061347 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 transition from SynSentState to ClosedState, SndNxt = 2232090327 (0x850AFED7).

107784 11:19:18 AM 1/8/2013 734.5061433 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) connect attempt failed with status = 0xC00000B5 - STATUS_IO_TIMEOUT.

107785 11:19:18 AM 1/8/2013 734.5062093 System (4) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint/connection 0x000000000829C860 released port number 50357 (0xC4B5). WeakReference = TRUE.

107786 11:19:18 AM 1/8/2013 734.5062386 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:connect: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 5024 (0x13A0), Status 0xC00000B5 - STATUS_IO_TIMEOUT

107787 11:19:18 AM 1/8/2013 734.5065252 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket cleanup: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2002 (0x7D2), Status Success

107788 11:19:18 AM 1/8/2013 734.5065297 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint (sockaddr=192.168.1.1:50357) closed.

107789 11:19:18 AM 1/8/2013 734.5065339 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint/connection 0x00000000082D19B0 released port number 50357 (0xC4B5). WeakReference = FALSE.

107790 11:19:18 AM 1/8/2013 734.5065425 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket cleanup: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2003 (0x7D3), Status Success

107791 11:19:18 AM 1/8/2013 734.5065490 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD :closesocket: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2000 (0x7D0), Status Success

107792 11:19:18 AM 1/8/2013 734.5065498 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:closesocket: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2001 (0x7D1), Status Success

 

- So the connection times out after 21 seconds. This also explains why the external client to Edge server connection is terminated after 21 seconds. Since Edge server cannot access Frontend server, it returns a SIP 504 and then the client closes the session.

 

- Also the Edge server SIPSTack ETL trace confirms that behavior:

 

TL_ERROR(TF_CONNECTION) [0]05E0.0AAC::01/08/2013-09:19:18.611.02e77050 (SIPStack,SIPAdminLog::TraceConnectionRecord:SIPAdminLog.cpp(160))

$$begin_record

LogType: connection

Severity: error

Text: Failed to complete outbound connection

Peer-IP: 192.168.3.10:5061

Peer-FQDN: fepool01.contoso.com

Connection-ID: 0x2E6603

Transport: TLS

Result-Code: 0x8007274c WSAETIMEDOUT

Data: fqdn=" fepool01.contoso.com ";peer-type="InternalServer";winsock-code="10060"

$$end_record

 

- Also a telnet test confirms the same behavior:

 

Edge:

 

C:\Users\Administrator>telnet 192.168.3.10 5061

Connecting To 192.168.3.10...Could not open connection to the host, on port 5061: Connect failed

 

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

 

=> Frontend server side trace:

 

- Frontend server see the incoming connection requests like below:

 

606227 11:19:07 AM 1/8/2013 727.4974204 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192

606232 11:19:07 AM 1/8/2013 727.4975478 Idle (0) 192.168.3.10 192.168.1.1 TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=154470907, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192

606233 11:19:07 AM 1/8/2013 727.4980222 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=154470908, Win=8192

612948 11:19:10 AM 1/8/2013 730.5001959 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192

612953 11:19:10 AM 1/8/2013 730.5003515 Idle (0) 192.168.3.10 192.168.1.1 TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=155953667, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192

612967 11:19:10 AM 1/8/2013 730.5007007 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=155953668, Win=8192

623999 11:19:16 AM 1/8/2013 736.4983502 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192

624004 11:19:16 AM 1/8/2013 736.4985345 Idle (0) 192.168.3.10 192.168.1.1 TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=157301184, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192

624005 11:19:16 AM 1/8/2013 736.4987452 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=157301185, Win=8192

 

 

- It gets the TCP SYN from Edge, sends a TCP SYN ACK back to it but very shortly afterwards it gets a TCP RST from Frontend. In reality that TCP RST is not sent by the Edge server as we have seen above. Most likely that TCP RST comes from a network device running in between. The sender MAC address for that TCP RST is the following (but it doesn’t necessarily mean it’s that device sending the RST:)

  Frame: Number = 606233, Captured Frame Length = 161, MediaType = NetEvent

+ NetEvent:

+ MicrosoftWindowsNDISPacketCapture: Packet Fragment (60 (0x3C) bytes)

- Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[ 00-50-56-33-22-11],SourceAddress:[ 00-0F-8F-11-22-33]

  + DestinationAddress: VMWare, Inc. 8C7B52 [00-50-56-33-22-11]

  + SourceAddress: Cisco Systems 112233 [00-0F-8F-11-22-33]

    EthernetType: Internet IP (IPv4), 2048(0x800)

    UnknownData: Binary Large Object (6 Bytes)

+ Ipv4: Src = 192.168.1.1, Dest = 192.168.3.10, Next Protocol = TCP, Packet ID = 50541, Total IP Length = 40

+ Tcp: Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=154470908, Win=8192

- Now let’s check the above activity at TCPIP/AFD driver level:

 

=> The first TCP SYN received from the network:

 

612948 11:19:10 AM 1/8/2013 730.5001959 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192

 

=> TCPIP driver sends a TCP SYN ACK:

 

612949 11:19:10 AM 1/8/2013 730.5002398 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCPIP: NBL 0x000000000CCBE7F0 fell off the receive fast path, Reason: Session state is not compatible. Protocol = TCP, Family = IPV4, Number of NBLs = 1 (0x1). SourceAddress = 192.168.1.1 Null. DestAddress = 192.168.3.10 Null.

612950 11:19:10 AM 1/8/2013 730.5002968 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 transition from ListenState to SynRcvdState, SndNxt = 0 (0x0).

612951 11:19:10 AM 1/8/2013 730.5003286 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 SRTT measurement started (seq = 155953667 (0x94BAA03), tick = 1684327497 (0x6464CC49)).

612952 11:19:10 AM 1/8/2013 730.5003309 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x0000000011D4DCF0 RetransmitTimer timer started. Scheduled to expire in 3000 (0xBB8) ms.

612953 11:19:10 AM 1/8/2013 730.5003515 Idle (0) 192.168.3.10 192.168.1.1TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=155953667, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192

 

- It gets a TCP RST from Edge server side right after this: (this wasn’t sent by the Edge server by the way)

 

612967 11:19:10 AM 1/8/2013 730.5007007 Idle (0) 192.168.1.1192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=155953668, Win=8192

612968 11:19:10 AM 1/8/2013 730.5007382 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 (local=192.168.3.10:5061 remote=192.168.1.1:50357) connection terminated: received RST.

612969 11:19:10 AM 1/8/2013 730.5007404 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 (local=192.168.3.10:5061 remote=192.168.1.1:50357) shutdown initiated (0xC000020D - STATUS_CONNECTION_RESET). PID = 2640 (0xA50).

612970 11:19:10 AM 1/8/2013 730.5007432 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 transition from SynRcvdState to ClosedState, SndNxt = 155953668 (0x94BAA04).

 

 

- The same action is repeated two more times:

 

623999 11:19:16 AM 1/8/2013 736.4983502 Idle (0) 192.168.1.1192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192

624000 11:19:16 AM 1/8/2013 736.4983890 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCPIP: NBL 0x000000000CCE1E20 fell off the receive fast path, Reason: Session state is not compatible. Protocol = TCP, Family = IPV4, Number of NBLs = 1 (0x1). SourceAddress = 192.168.1.1 Null. DestAddress = 192.168.3.10 Null.

624001 11:19:16 AM 1/8/2013 736.4984630 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 transition from ListenState to SynRcvdState, SndNxt = 0 (0x0).

624002 11:19:16 AM 1/8/2013 736.4985060 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 SRTT measurement started (seq = 157301184 (0x96039C0), tick = 1684328097 (0x6464CEA1)).

624003 11:19:16 AM 1/8/2013 736.4985125 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000EDC7960 RetransmitTimer timer started. Scheduled to expire in 3000 (0xBB8) ms.

624004 11:19:16 AM 1/8/2013 736.4985345 Idle (0) 192.168.3.10 192.168.1.1TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=157301184, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192

624005 11:19:16 AM 1/8/2013 736.4987452 Idle (0) 192.168.1.1192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=157301185, Win=8192

624006 11:19:16 AM 1/8/2013 736.4987656 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 (local=192.168.3.10:5061 remote=192.168.1.1:50357) connection terminated: received RST.

624007 11:19:16 AM 1/8/2013 736.4987684 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 (local=192.168.3.10:5061 remote=192.168.1.1:50357) shutdown initiated (0xC000020D - STATUS_CONNECTION_RESET). PID = 2640 (0xA50).

624008 11:19:16 AM 1/8/2013 736.4987706 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 transition from SynRcvdState to ClosedState, SndNxt = 157301185 (0x96039C1).

...

 

 

SUMMARY:

=========

The behavior we see above points to a problem somewhere between at or below the NIC layer on Frontend server and at or below the NIC layer on Edge server including any network devices sitting in the path in between the two servers like LAN switches, firewalls etc.

 

As such, I have made the following recommendations to our customer

 

a) Involving their networking team for further investigations on the network devices running in between the Edge and frontend servers

b) Making some improvements on the endpoints (edge/frontend servers) from vmware perspective since both servers are virtualized on vmware ESX server(s)

 

=> The NIC could be configured as a VMXNET3 driver in the vmware config:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001805

Choosing a network adapter for your virtual machine

 

=> Vmware recommends turning STP protocol off on the switchport to which VMware ESX server is connected:

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003804

STP may cause temporary loss of network connectivity when a failover or failback event occurs

 

=> We can check if the current physical NIC is on the compatibility list of Vmware:

https://partnerweb.vmware.com/comp_guide2/search.php?

(Please choose your vmware server version and network adapter brand/model from the above link)

 

=> Vmware releases patches on releated components. For example the following is one of the latest releases for ESXi 4.0. It is best to check if your Vmware server binaries are recent.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2011768

 

=> Also it’s suggested to run virtual machine version 7 for the guest:

https://www.vm-help.com/esx40i/upgrade_virtual_hardware/virtual_hardware_upgrade.php

 

Hope this helps you in troubleshooting similar problems...

 

Thanks,

Murat