Utility Spotlight
12 Steps To Faster Web Pages With Visual Round Trip Analyzer
Jim Pierson
This article is based on a prerelease version of VRTA. All information is subject to change.
Code download available at:Visual Round Trip Analyzer(654 KB)
This article discusses:
|
This article uses the following technologies: Visual Round Trip Analyzer |
Contents
Location, Location, Location
12 Easy Steps
Using Visual Round Trip Analyzer
Final Thoughts
So many factors can affect the performance of a Web page—the distance between server and client, the size of the elements on the page, how the browser loads these elements, available bandwidth. Finding those bottlenecks and identifying the culprits is no easy task. But identifying the causes can yield some significant improvements.
In this article, I'll show you how to spot and remedy some common causes of poor performance. I'll also introduce you to Visual Round Trip Analyzer (VRTA), a tool that can distinguish among the various factors that cause performance problems and present them as a graphical map. Then I'll show you how to perform your analysis using VRTA. But before we begin, find out how much you already know by taking the speed quiz called "How Do You Improve Page Loading?" that is included in this article.
Location, Location, Location
Two of the biggest factors in page load delays are distance, measured as round-trip time (RTT), and the number of round-trips between the client and servers. A round-trip is a network packet sent by a client to a server, which then sends a response packet. Downloading a single Web page can require as few as 2 or 3 round-trips if the page is sparse, or many tens of round-trips for a rich appearance. Opening a TCP connection is one round-trip. Sending a Get and receiving the first 3KB is one round-trip. Sending Acknowledgements for more data is more round-trips.
For users who are located near the Web site data center, say less than several hundred miles away such as a user in Los Angeles connecting to a server in Silicon Valley, RTT is only 20ms (.020 seconds) and not a significant factor in page load time. There could be 50 round-trips back and forth between the client and server in these circumstances and the page-load time would still only have 1 second of network delay before any server time was added (50 RTs × 20ms = 1 sec).
But the story is much different for a user in Europe or Asia connecting to the same Silicon Valley server. Japan and England are currently 120ms in round-trip time from California at minimum. When you consider a server disk seek time for each request is usually less than 10ms, versus round-trips of hundreds of milliseconds repeated many times over, you'll see that distance and the number of round-trips are easily the most significant contributing factor in long page-load times.
To combat the impact of long round-trips there are three main solutions you can try: you can reduce the number of serial round-trips by doing more in parallel and getting rid of excess weight, you can reduce the round-trip-time by moving your location closer to the user, or you can reduce server time.
Visualizing the page download accurately, as it really flows, is essential to performance analysis. For example, it's difficult to know which parts of the page are serial and which are downloading in parallel. File downloads are typically depicted as a waterfall (as you see in Figure 1 ) with one file represented per row in the report. But such a depiction does not account for the fact that since HTTP1.1, TCP connections are reused. Keeping connections open and calling additional files on the existing connection reduces the number of round-trips, but it's hard to see this in the traditional waterfall illustration. For a more accurate picture of how a page really loads you need to add the TCP port as a dimension of the drawing.
Figure 1 HTTP1.0 Waterfall Illustration
Figure 2 shows files loading on seven TCP ports, represented by the grey horizontal lines, and the number of milliseconds that have elapsed. The colored bars are the files; red is HTML text, gold represents style sheets, tan represents JavaScript, and blue lines are images. Here a single port is opened first, and then two more ports are opened. After the first files (CSS) are loaded, more files (JavaScript and images) are loaded on the same ports.
Figure 2 HTTP1.1 Keep-Alives
VRTA sits on top of the Microsoft Network Monitor (known as NetMon) 3.2 packet analyzer, though you don't need to know how to use NetMon. In the latest version of this tool, now to be released publicly, the Global Performance team added visual details for files and packets. Figure 3 shows the details. Once a TCP connection is created, the pink shaded line shows the TCP response packet coming from the server. The grey line is the time-to-first-byte from the server and the stairs indicate the response packets ramping up using TCP slow-start.
Figure 3 Details for Files and Packets
12 Easy Steps
It's easy to assume a Web page is always downloading at full speed and that more bytes will only slow down the page. The reality, though, is that unless the user is near the datacenter, it's very likely the browser will not make use of all the available bandwidth. Here are 12 different steps you can take to alleviate the problem.
1. Open Enough Ports Both the browser and the application can limit the number of ports opened in parallel. Internet Explorer 7.0, as well as earlier versions, limited the number of HTTP1.1 ports to two per domain. So, for example, msn.com gets only two ports. If you host all of your static JavaScript, CSS, and images on a single domain, as in Figure 4 , then the older browsers would still open only two ports. Spreading the content over multiple domains allows the browser to open more ports. Alternatively, browsers can be modified to open more ports. This is currently under consideration by the Internet Explorer team.
Pages often bottleneck on the download of a single file or worse on a series of single files, one after the other. Ideally, a page is loading six to eight files in parallel. Take a look at the bit rate chart in Figure 4 . There you'll see very little traffic for the first three seconds. In Figure 5 , the orange file at the top-left corner of the chart is an SSL handshake. Inside the rectangles are indications about the packet flow that I will discuss in more detail later. If you scan the figure from left to right and count the number of files actively downloading at any one time, you'll see it's not until three seconds into the download that more than one file loads concurrently. The SSL starts out before another two red HTML files are loaded in series, then finally six ports are used to load nine files.
Figure 4 Too Few Ports and Many Small Files
Figure 5 Concurrent Ports
2. Limit the Number of Small Files to Be Downloaded It's hard to get full utilization of bandwidth across a wide area network (WAN) when shipping many small files. Figure 4 illustrates this throughout the page load but especially after the tenth second. Each of these icon image files is less than 500 bytes. If the server is 200ms RTT from the user and each file is one round-trip, then only five of these files could be loaded per second (5 files/sec × 500 Bytes/file × 8 bits/Byte = 20Kbps per TCP port).
Consolidating small files into larger ones allows more bytes to be sent per round-trip. For images, this is called image clustering or using sprites. A single file is downloaded with many images positioned side-by-side, which are then cropped out and placed on the page. Here's an example of this concept at work: a while back when we shipped Windows 2000, we had reports that it took six hours to download the upgrade to Germany from Redmond. My analysis found that 30,000 little files were being transferred one at a time. After we zipped them all into a single file, the download completed in 35 minutes instead.
3. Load JavaScript Files outside of the JavaScript Engine Browsers running Internet Explorer 7, and earlier versions, block on JavaScript files. Normally they download files in what's called speculative mode as quickly as possible. But when encountering JavaScript, the browser drops out of this mode and focuses on downloading the JavaScript. If you were to view this situation in VRTA you would see that none of the JavaScript files overlap with other JavaScript files and very few files of other types are loading at the same time.
One workaround is to move the JavaScript files to the end, but that misses the point that there is plenty of bandwidth to use. A better solution is to use a document.write of the files so they load outside of the JavaScript engine in the browser, as you see here:
function AsyncLoad()
{
var l = arguments.length;
for (var i=0;i<l;i++)
{
document.write("<script src='" + arguments[i] + "'></" + "script>");
}
}
AsyncLoad(
"file1.js",
"file2.js",
"file3.js");
We are anticipating that Internet Explorer 8.0 will finally resolve this problem.
4. Turn on Keep-Alives Opening a TCP port uses one round-trip, and getting a file uses at least one round-trip. If you can keep the connection open for additional files, you'll reduce the number of round-trips and increase the effective throughput. Note in Figure 6 that there is only one port per TCP connection (the grey horizontal lines). The bit rate jumps only on the larger files. You can see inside of each file rectangle the step pattern from increasing response volume.
Figure 6 Keep-Alives
This is TCP's slow start algorithm ramping up, sending more bytes per successful round-trip. That's how larger files can send at faster rates. Reusing a TCP connection allows subsequent files to continue with the expanded window size instead of starting over. Keep-Alives are on by default in IIS and unless you are hosting only a single file, they should always be left on.
5. Identify Network Congestion It's important to differentiate between server and network delays. Network delays can often be seen by looking at the TCP connect time or by looking for retransmissions. In the delta column of Figure 7 you see the time between packets (in milliseconds). The TCP request to open a connection was the first packet sent by the client (frame 67 in Figure 7 ). Then 2914ms (2.9 seconds) later the client resent the request (frame 306 in the figure) since it did not get a response from the server. After the second request, the server did respond 110ms later (frame 307).
Figure 7 Time between Packets
It is interesting to note that retransmissions of TCP connect packets happen on an exponential back-off basis while data packets within the connection are fast retransmitted. So, failed connections will often take two or more seconds for the first retransmission while data packets might only see a few hundred milliseconds to timeout.
6. Increase Network Max Transmit Unit (MTU) or TCP Windows Size Packet sizes can be as large as 1,514 bytes over ethernet. Sometimes servers may have the MTU (the size of the packets) limited to smaller sizes. If you see large files with response packets significantly less than 1,500 bytes, then you should check this.
Another related issue is the TCP window size. Normally this is 16 or 32KB, depending on the OS. That means that TCP can increase the number of bytes it sends before having to wait for an acknowledgment. You can think of this as the number of bytes per round-trip, though that is an over simplification. TCP/HTTP usually starts out by sending a window of two packets in the first response, about 3KB of data. Then the server waits for an acknowledgment before sending another window. The window increases in size with each successful acknowledgment received. This allows TCP to judge how much bandwidth is available between the two devices.
Look for window sizes of less than 16KB coming from a server for extended periods, then change these back to the default. To find the window size, open NetMon and look at the "window size" of the server response packets.
7. Identify Server Congestion A slow server often shows up with a fast TCP connect time, a fast acknowledgment response at the TCP level, and then a slow Time-To-First-Byte data response packet. Think of the server as operating in layers (as in the Open System Interconnection, or OSI, model), each working independently of the other. The TCP layer is running at the kernel level or in the NIC card. It's very fast. When the request packet comes in, it forwards the data up to the HTTP layer, which may need to do a disk seek. The TCP layer can't send off an acknowledgment until the HTTP layer can respond (see Figure 8 ).
Figure 8 Slow Server
8. Check for Unnecessary Round-Trips Not everything that happens during a page download actually needs to happen. Some actions are performed or decisions made simply because the browser is not given adequate instructions, like how long can a file be cached.
9. Set Expiration Dates Static files such as JavaScript, CSS, images, and XMLs are cached by the browser for when the user loads the page next. The browser will look for an expiration date in the file's HTTP header to see if the file is still good. Unfortunately, many sites are not setting these dates. So the browser pulls the files from the cache, doesn't find a date, and sends a Get-if-Modified to the server. In most cases, the server responds with 304 Not-Modified, and the file is used from the cache.
Performant sites set their dates to three years in the future. Changing the file name or path, or adding an argument string, causes the browser to reload the file. Pause to consider how many new servers and how much bandwidth was acquired to handle Get requests for files the users already had.
Another benefit of expiration dates is how they affect Entity tags (Etags). Etags tend to cause performance problems. They are meant to uniquely identity files using a number generated by the server. In clusters of servers, each server will create a different number. For example, let's say a browser is sending a Get-If-Modified for a file it found in its cache. If there is only one server, the file will most likely match and a 304 Not-Modified error will be sent. But, if the user is accessing a larger server farm, it's most likely to hit a different server with a different entity tag, causing that server to reload the entire file to the browser.
Using VRTA, if you notice more than one JavaScript file getting a 200-OK from the server then the server is not recognizing the entity tag and will resend the entire file. Not only is the file reloaded, but often the browser will recognize that the file is the same as the one in the cache and will attempt to stop the server from completely resending it.
The browser does this by sending a Reset flag to close the TCP port. That, in turn, increases the number of round-trips because of the fact that a new port is needed to complete the download of the remaining files. A simple solution is to set the expiration date for the file so the browser can reuse the cached file without making a call to the server.
Quiz: How Do You Improve Page Loading?
Answer these Web performance questions before you read the article and again after you've finished it to see how much your own performance has improved!
1. How many ports should you have running concurrently?
a. As few as possible
b. Many on broadband and fewer on narrowband
2. Which is faster to download across a WAN?
a. Many small files
b. A few large files of the same aggregate weight
3. JavaScript only blocks other JavaScript files.
a. False
b. True
4. Are Keep-Alives on by default?
a. Yes
b. No
5. The default expiration date
a. Is blank
b. Is 30 days from the download date
6. How do you stop Etags from reloading the file?
a. Set an expiration date
b. Ask the user to wait for the file to reload
7. Compression of HTML, JavaScript, and CSS files often achieves what reduction ratio?
a. 1.5 to 1
b. 3 or more to 1
8. Image files are always compressed.
a. True
b. Flase
10. Think before You Redirect Site addresses change over time and redirects are often used to point users to new locations. However, these redirects chew up time and should be replaced. Consider whether the site could respond from the old URL instead.
11. Use Compression Most JavaScript, CSS, and HTML files can be compressed by as much as 4-to-1. XML compresses even more. The neat thing about these file types is that the server can compress them one time and then store the compressed version. This reduces CPU load on the server and volume of bits on the networks.
Dynamic files such as ASPX need to be compressed for every user on the fly, so they are usually compressed by external devices like load balancers. I'm often asked whether compression consumes a lot of time on the browser and whether all browsers handle compression. First, consider the time it takes to load a large file across the network. If VRTA ran every file into a compression utility and then showed the ratio of the new file to the old in the compressibility column, ratios of 1 or less would represent files that are already compressed. In that instance, JavaScript files on this site are taking one to five seconds to load.
Uncompressing these files would take milliseconds for most modern PCs. Browsers must send a "will accept gzip, deflate" statement in the Get request to tell the server it can handle compression.
Full-size images should always be reformatted when used as thumbnails. Also, it's a good idea to check out different formats for images to see which compresses the best—JPG, GIF, or PNG. Strangely, I have found images showing up as not compressed, so don't ever assume that all images are compressed properly.
Also, remember to remove white space from files because this can often yield a 20% reduction in size. Note that white space includes tabs, new lines, and especially remarks. There are several tools available to do this. Crunching and compression are both advised as the effect is cumulative.
12. Edit Your CSS For some reason, developers tend to spell out every font instance in their CSS files instead of using the cascading effect. If you read a CSS file and see the same words repeated over and over, it's not written in a performant way.
Using Visual Round Trip Analyzer
You know what to look for; now go back and look at your site. Download VRTA from the MSDN Magazine Web site and install it. You will be prompted to install NetMon as well if you do not have version 3.2 or later already installed. This will require admin privileges on the machine you are using.
After the install is complete, VRTA will start automatically. You'll also find the icon sitting on your desktop. Click the Start button to have VRTA turn on NetMon capturing. Now visit your Web site again. Do only one atomic transaction (known as a Task) at a time. Navigate to one page and let it render—that's one transaction. Stop VRTA to view it. Before clicking on the Next button, start VRTA again and repeat after it renders.
Next, open the folder on the left to open an existing capture file. Also use this to view all files. VRTA creates a text version of all files, called RTA.txt. The NetMon icon takes you directly to the capture file. Save, copy, paste, and print work as usual. You've already used the Start and Stop button. The Zoom buttons allow you to magnify the file visualization. This is really helpful for looking at the smaller details. Use the Select network to tell VRTA to listen to your wired LAN interface or your wireless. Set the scale using the maximum time dropdown.
There are three hover windows in the tool. The first is on the Main Chart tab where you can see the downloading files by port. Hover over any of the files to see more details about it. On the All Files tab, you'll see the files sorted in start sequence. Hover over the URI to see details about each file. On the Analysis tab, hover over any of the rules to see the exceptions list of files.
In addition to the Copy and Paste buttons, VRTA creates two files. The NetMon capture file contains the raw packet details. From this you can recreate all of the VRTA UIs. You can also drill down into the capture to go beyond what the expert tool is able to present. The RTA.txt file is a tabular format of the page-load-time statistics. You can open this in Microsoft Office Excel for easier reading.
While you are looking at different Web pages with the tool, I'd like to make sure you are aware of a few important points in Web Performance testing. If you are sitting practically on top of your servers when you test (meaning basically within a few hundred miles), but your users are around the world, then you need to either get out in the world or use a WAN simulator (emulator). These are software utilities that can modify the bit rate, RTT, and packet loss on the network you are using.
At Microsoft, we use a WAN simulator benchmark of 300/64Kbps bit rate, 300ms RTT for global content and 50ms RTT for local Content Delivery Networks. Our packet loss setting is 3%. This benchmark allows us to test across multiple versions of multiple applications and see their relative performance in the same environment.
Remember, a sample of one is just that—a sample of only one. Every capture is slightly different, and often they may be very different because of the changing ad content and congestion on the network or servers. I highly recommend that you do several samples until you are able to see a steady state.
When NetMon installs, it includes a command-line version called NM_cap.exe. Create a script to start NM_cap, navigate to a page, then after the render, stop NM_cap and save the file. Open and review using the VRTA UI.
You should also recreate the first-time user experience (called Page-Load-Time 1) by going to an "About:Blank" page and then deleting the cached files in Internet Explorer | Tools | Internet Options. We use the blank page because Internet Explorer will automatically reload the page you are viewing. Close the browser to force any TCP connections to close. Then open the browser to the page you are testing. For Page-Load-Time 2 (the second time user returning the next day) just close the browser and reopen it to the test page.
One final warning: beware of corporate proxy servers—these servers firewall corporate resources from malicious intruders from the Internet. However, they also distort your performance sample. Compression is often turned off which may mislead you to thinking that it's off at your source server. Proxies also mask Keep-Alives. It's best to test outside of corporate proxies, directly on the Internet.
Final Thoughts
The report generated by VRTA and the best practices you've read about here about represent what is known today. But this is not all there is to know. Drill down into the raw data that NetMon captures and look for new patterns and best practices for yourself.
Many thanks to a great team for helping me to get this new visualization tool created: Ron Oshima, Lucius Fleuchaus, Jason He, Vinod Mangalpally, and Doug Franklin.
Jim Pierson has been with Microsoft for 12 years, 5 of which were focused on Web Performance Engineering. Jim is Principal Group Manager in the GlobalPerf team that provides tools for measurement, analysis, and training of best practices to the online services of MSN and Windows Live.
Quiz answers: 1=b, 2=b, 3=a, 4=a, 5=a, 6=a, 7=b, 8=b