One Of Vamshidar Rawal's Favorite Bugs

Problem:

We had instrumented our code to spit out timing markers at various points of pages that were tracked via automation and measured actual end-user download times 24/7 on the production site. The results showed that 25% of the time the download times were in the excess of 20 Seconds and it did not matter if it was peak usage time or over the weekend. That did not make any sense and all debugging and investigation showed no cause for the behavior and it was really hurting our end user experience.

How was the cause determined:

We had no cause nor a solution even after a few months. I monitored the live site end user experience for some time and noticed that during a couple of times a month the issue with 20 second download times now dropped to 5 seconds. This triggered another investigation on what happened or if any changes occurred on the live site during these 2 hour windows where the performance issue was better. We came to know that during these 2 hour windows twice a month we were updating the production site with newer content/code. During these times our operations team takes out half the servers out of rotation and updates code/content and when done they swap the servers with rest of them being out of rotation. During these times the long download times were resolved. It stunned us to see that the site worked better on just half the number of servers and that didn’t make sense.

Issue:

  • Ultimately more investigation showed us that actual problem was the number of unique OLEDB connection strings that were created and persisted on each of our front end servers.
  • We had 42 Front End Servers (FEs) hitting 28 Back End (BE) servers that contained 47 language databases (DBs).
    • So the end result was that each FE server had to create and persist (28*47 = 1316) unique DB connection strings.
    • The problem is this number of unique connection strings exceeded way over the recommended limit (at around 100).
  • During the time for updating the site, the number of FEs hitting BEs was reduced by half (21 FE hitting 14 BE servers that contained 47 DBs.)
    • This reduced the unique connections strings by half (14*47 = 658).
    • That in turn dropped the long delays of 20 seconds to just around 5 seconds.

Result:

This led us to discuss a better way of arranging and linking our FE and BE servers in production so the number of connection strings was reduced. We ended up with a site that now had 6 servers in the middle tier between the FE and the BE that now solved the issue with the excessive connection strings.

  • Now we had 42 FE – 6 Search Web Servers – 28 BE servers (47 DBs).
  • This now reduced the number of connection strings on each FE to 42 * 6 = 252.
  • The issue with long download times was completely fixed and our download times were sub-second 100% of the time.

Lessons learnt:

  1. Production site can never be mimicked and issues occurring on the site will almost never be clear.
  2. Every aspect of the production environment must be thought about comprehensively.
  3. All parts of the production environment can and will impact performance and end user experience if not well thought about.

 -- Vamshidar Rawal, SDET

 

Do you have a bug whose story you love to tell? Let me know!