Monitoring Page Load Times

I've been having an on-going problem for the past several months where the site page load times (that use web services) are occasionally 12 seconds, whereas they should only be 2-3 seconds.

For instance, you will see this on a Display Event page, but not on the homepage because that is cached.

The problem is sporadic. So I now run a cron job once an hour to load nine pages using the CURL extension of PHP. Then I can file better complaints to my webhost support. I've been complaining/filing tickets for a long time on this, but they have been either blaming me or not seeing the problem.

The load time for these nine pages is normally 10-20 seconds. However, several hours ago it hit 160-180 seconds.

I think using CURL is a good method for testing page load times. You could also monitor your server cpu load at the same time.

pingdom

Pingdom has an okish tool for breaking down the elements that are loaded in the page, IBM has a better one you can also get for free called IBM Page Detailer, both are really handy to diagnose these kinds of issues.

IBM Page Detailer

IBM Page Detailer - looks a lot like the Google extension for Firefox "Load Time Analyzer", though the user interface is better.

Firewall Related?

Could this be firewall related? I asked support and they said "no". But I found a quote on IBM.com that webservice requests can run into trouble at the firewall because they are treated differently:

"While SOAP was designed to work within existing Web application environments, the protocol may introduce firewall and routing problems. Unlike a normal Web server using HTTP, all SOAP messages are the equivalent of HTTP form submits. The calls move much more data than the average HTTP GET or POST call, and network performance is bound to be impacted.

Special testing of the firewall and routing equipment should be undertaken. For example, you should check your firewall's security policy to make certain it doesn't monitor SOAP-requests as Web traffic. If it does, you may find the firewall shunting away traffic that looks like a Denial of Service (DoS) attack."

(Source)

The most recent problem of slow loading pages lasted for 13 hours (plus or minus two hours). If there is a regular duration of the problem then I'm hoping that will lead to a setting on a piece of software that shares the same timeframe. Or if there is a regular date/time that the problem starts, I can check server logs to see what is happenning. I suspect it is caused by a surge of traffic - either due to regular users (and thus during weekdays on peak hours) or due to robots on the weekend (Googlebot and Yahoo's bot).

If I've alleviated the problem by using caching (which dramatically reduces the number of webservice requests), this will primarily affect regular users who focus their visits on a small set of pages. Eg the search engine bots will still try to visit all the pages and trigger the delay. Thus my theory is that the problem is more likely to occur on weekends, or during off-hours (between midnight and 5am) - when Googlebot is most active.

I've noticed it has happened

I've noticed it has happened again. So far it is a random pattern where it is slow for a couple hours and then is back to normal. Not enough data to detect a pattern.

I'm thinking of analyzing the number of hits coming from web services per hour to see if that is what triggers it. Not sure how I'll do this. I don't want to parse my 100mb+ log files. I guess I could log every use of a webservice (server-side), by modifying my server code.

Problem Might Be Fixed

So my host fixed this problem and their solution was:

"I changed the Resolver Nameservers on the server as the old resolvers don't have enough memory and when the memory reaches its limit they start responding slowly."

In which case I feel vindicated that I was right all the time (I was telling support they had a problem, and they like to deny it.). I'm even happier that the thing is resolved. 12 second page load times suck.

On the other hand, this sounds very plausible. But I really have no idea of how nameservers work. It's not fun to know that there are an entire area of problems that could cause your website to grind to a near halt, that you do not understand.