Outlook Intermittent Disconnections – Exchange 2010

 

This is part 2 of my troubleshooting weeks where we were facing issues with Outlook losing connections to Exchange intermittently AND Database copies losing synchronization as well. The part 1 of this blog focused on what/how we troubleshoot so far mostly using database connectivity alerts. In this part we’ll be focusing on Outlook disconnections and how things got heated up for next few days.

I believe you will all agree that most annoying and most interesting issue for troubleshooting with Outlook & Exchange is?  Outlook intermittently disconnects or pops a message saying “Outlook lost connection to server, trying to reconnect”

The pop-up basically indicates that Outlook either lost MAPI connection to exchange services or/and is taking very long time to make complete connection causing Outlook to exhibit behavior such as it has lost connectivity.

 

The issue in itself doesn’t causes any critical business loss, it’s more of an annoyance and if users in your environment are as “peculiar” about Outlook’s well-being then you should be facing same level of annoyance as we do. Sometimes the level of annoyance is so high, that you feel to mask the problem using custom registry discussed below with blog ;-).

So here I am, discussing my mid-month of June where we had one such painful and interesting issue to troubleshoot where Outlook was generating RPC disconnection pop-up and lot of experts were basically head scratching along with us to remediate same.

Stage 2: Outlook intermittent Disconnections

The second and basically more critical sign of issues started happening few days after the database connectivity issues discussed in part 1 of this blog. Basically we started users complaining about:

  • Outlook intermittent pop-ups shown above saying Outlook is trying to connect.
  • Users unable to access shared mailbox and shared calendars, getting access denied error. (Users access shared mailbox/folders in online mode in our environment with their primary mailbox/folder in cached mode)

The Exchange hardware report/SCOM was not triggering any alerts for performance or RPC latency. When running server’s MAPI connectivity test, it failed & below error was reported:

____________________________________________________________________________________________

Error : [Microsoft.Exchange.Data.Storage.TooManyObjectsOpenedException]: Cannot open mailbox /o=Contoso/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=Server01/cn=Microsoft System Attendant. Inner error [Microsoft.Mapi.MapiExceptionSessionLimit]: MapiExceptionSessionLimit: Unable to open message store. (hr=0x80040112, ec=1246)? ? ? ? ? ? ? ? ? ? ?  Diagnostic context:

_______________________________________________________________________________________________

The error basically indicates issue with system attendant out of RPC connections. However it was not adding up as we already had increased exchange store limits documented here, majorly Maximum Allowed Sessions Per Userwas already set as high as 5000 (decimal). Running ExMon (Exchange User Monitor tool) generally helps in these scenarios as it points out which user/mailbox is spending maximum CPU time in server or is having highest operations against exchange store, but not in our scenario sadly.

After some expert help, we were able to determine that issue was not due to total number of MAPI connections, but was due to Exchange administrative connections open at a given time against exchange server store. The Exchange administrative connections are primarily used by:

  • Shared calendar/mailbox
  • Blackberry Enterprise Server
  • Exchange search service (was not listed at first but we determined it later)

Default value for exchange store limits for administrative connections Maximum Allowed Exchange Sessions Per Service is 10000, however as per performance monitors, we were going above it at which point exchange store was locking down all connections causing Outlook outage. At this point we increased the following registries to make sure we do not go into outage situation again:

  • The Maximum Allowed Exchange Sessions Per Service registry to 20000 (default value 10000)
  • The Maximum Allowed Concurrent Exchange Sessions Per Service registry to 100 (default value 50)
  • The Maximum Allowed Service Sessions Per User registry to 64 (default value 32)

And from here we monitored if we go above same. But for me, it was still not adding up as we were running fine all this time, how can we spike connections overnight?

Meanwhile, the Outlook continuous prompts for disconnection and reconnection kept bugging users intermittently. We provided store dump for analysis using below command and were told that the administrative connections are still spiking and we need to bump them up to 30000 for resolving issues, but I still wasn’t sold on same.

After going through some network captures, logon statistics & RPC log dumps from Exchange CAS and MBX servers, I noticed below:

  1. All users have high number of connections to the server irrespective of client version or device they??™re using to connect or server they??™re on. Also in the logon statistics dump, I could see passive nodes of database copy were having logon connection against mailboxes as well other than CAS servers. At this time I learnt that search service also uses administrative connections and it is by Design that the Microsoft Exchange Search Indexer service on the passive node(s) indexes only the active node database. The design is that, in this way,?  all content indexes of the database on all database copies of the DAG are always up to date. (Get-LogonStatistics Shows Logons to a Mailbox on the Active Node by the Exchange Search Indexer on the Passive Node(s) of a DAG in Exchange 2010) So when we increased the registry for service sessions above, we basically allowed exchange search service to open more connections against mailboxes on database.
  2. At time of packet drop, we can see in network capture that CAS is losing connection to MBX server as well and soon after it sends lot of packet requests for Request fast re-transmit. I believe this could be one of factor why we??™re seeing lot of connections across environment. Sample packets below:

2586723? ? ? ? ? ? ? ? 3:02:58 PM 6/21/2013? ? ? ? ? ? ? ? 1343.5206036? ? ? ? ? ? ? ? store.exe? ? ? ? ? ? ? ? Server01?  ? ? ? ? ? ? ? ? 10.0.0.2? ? ? ? ? ? ? ? TCP? ? ? ? ? ? ? ? TCP:[Segment Lost]Flags=…A…., SrcPort=38364, DstPort=32159, PayloadLen=1460, Seq=1546335157 – 1546336617, Ack=3329138042, Win=511? ? ? ? ? ? ? ? {TCP:361, IPv4:11}

2586724? ? ? ? ? ? ? ? 3:02:58 PM 6/21/2013? ? ? ? ? ? ? ? 1343.5206181? ? ? ? ? ? ? ? store.exe? ? ? ? ? ? ? ? Server01?  ? ? ? ? ? ? ? ? 10.0.0.2? ? ? ? ? ? ? ? TCP? ? ? ? ? ? ? ? TCP:[Continuation to #2586723]Flags=…A…., SrcPort=38364, DstPort=32159, PayloadLen=279, Seq=1546336617 – 1546336896, Ack=3329138042, Win=511? ? ? ? ? ? ? ? {TCP:361, IPv4:11}

2586737? ? ? ? ? ? ? ? 3:02:58 PM 6/21/2013? ? ? ? ? ? ? ? 1343.5227346? ? ? ? ? ? ? ? store.exe? ? ? ? ? ? ? ? 10.0.0.2? ? ? ? ? ? ? ? Server01?  ? ? ? ? ? ? ? ? TCP? ? ? ? ? ? ? ? TCP:[Request Fast-Retransmit #2586723]Flags=…A…., SrcPort=32159, DstPort=38364, PayloadLen=0, Seq=3329138042, Ack=1546335157, Win=3045? ? ? ? ? ? ? ? {TCP:361, IPv4:11}

  1. Parsing through CAS logs, I can see lot of session drops, attached is connection log for certain users.

At this point, I turned back to network team who’ve given up on us and pushed them hard to check the network switches connected to our exchange VM farm, took some pushing but at the end they expanded monitoring traffic at switch level from not just exchange but basically all traffic coming to switch, at that point they were able to determine the cause of the issue.

The cause appears to have been another VM server in the data center that had 2 trunk ports that were flapping.?  When the ports would flap it would trigger a spanning-tree change on the LAN.?  During the spanning-tree re-convergence, traffic would get flooded out through all ports on the core switch.? ?  The switch ports connected to this server were set up for ‘spanning-tree portfast’ (disable spanning tree) but since they were trunk ports this command was not preventing the spanning-tree change during a port flap.?  We had to disable spanning-tree on each trunk port with the ‘spanning-tree portfast trunk’?  command.? ?  This stopped the spanning-tree issues during the port flaps.?  The server was found to be down and was moved to a powered down state.?  This stopped the port flaps from occurring.

So in a nut-shell, the powered down VM server was flooding traffic against all switches connected to it, exchange switch being one of them. Hence the switch getting overwhelmed, servers sending/receiving Rx Pause packets as they’re unable to keep up with network data speed and hence the packet loss and hence the database disconnections and hence the Outlook intermittent connectivity issue ! Phew !

Once the issue was determined, we removed the below two registries from server such that we do not allow more than default connections available for Exchange 2010 store service and added exchange admin client connection column in our Exchange hardware report for monitoring:

  • he Maximum Allowed Exchange Sessions Per Service registry (default value 10000)
  • The Maximum Allowed Concurrent Exchange Sessions Per Service registry (default value 50)

Since then, the administrative connections which were going above 20000 now stay below 2000 even at peak business day, no more Outlook intermittent dis-connectivity, no more database synchronization issue and hence my interesting yet painful mid-month of June ended so I can take on other issues waiting in queue 😉

Other frequent factors that impact Outlook intermittent disconnections:

  • WAN Accelerators:?  WAN Accelerators generally come in 2 flavors: Compress the compressed, or pattern matching. RPC is already compressed and the re-compressed data does not necessarily have any large performance gains based on independent testing. You can read more on the web on independent testing that has been done. Some products will keep Outlook sessions open for users. If we exceed limits on sessions that are set in Exchange automatically, you will see event IDs 9646 on the server.
  • Server performance: If source?  or destination server are running high CPU cycles or memory, exchange lowers the cycles available to Exchange replication service causing high copy queue length or replay queue length. You can monitor server performance using Exchange RPC counter monitor script available for download here.
  • Storage performance: If the SAN/LUN/Spindle associated with exchange servers physical/VM is exhibiting high IOPS, there will be delay in replaying copied transaction log against the database causing high replay queue length. If the storage array is same for active & passive databases like in most environments, this will also degrade database RPC performance or cause high RPC latency causing degradation on Outlook side for end users as well.
  • VMWare/Hyper-V Physical Host performance: If physical ESX host or Hyper-V server is oversubscribed (i.e. configured in 2:1 ratio for logical:physical cores) and is running out of resources, there will be overall degradation in performance for exchange servers running on the host, causing database replication and RPC performance degradation in environment.

To read more about other factors that cause issue where users observe Outlook RPC dialog box discussed above, please read the MS blog below:

Troubleshooting Outlook RPC dialog boxes – revisited

 

 

Another tool for Check Exchange Logs 

https://github.com/clinthuffman/PAL

 

Thanks for reading !

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 8.00 out of 5)
Loading...