Yesterday I got an email from a client describing a really, really weird situation that had occurred with our product. Of course, they couldn’t provide a way to reproduce the problem. Fortunately, there were only two users on the system at the time (it was in their integration testing environment), so they could tell what each of them was doing.
One person’s actions I could dismiss pretty quickly as the cause of the problem, so it must have been what the other did. However, her actions also seemed unlikely to have caused the problem. I started exercising the system in ways related to her actions, in hope of reproducing the problem. No luck whatsoever.
So I stepped back a little and started reasoning from the code. What could possibly have caused this? I came up with a scenario, tried it, and sure enough, there it was. But the problem was that my actions in no way resembled the description of the client’s actions. And on top of that, my actions seemed rather bizarre. Why would anyone want to do this?
I know debugging isn’t always an exact science, but my hypothesis was in real need of some testing.
Enter log files. Our product is a web application running in Apache Tomcat, for which it’s pretty easy to enable logging. Tomcat’s access log follows the Common Logfile Format, which looks like this (all on one line):
127.0.0.1 8080 - - [27/Jun/2008:08:41:49 +0200] "GET /docato-composer/getLoginDialog.do HTTP/1.1" 200 3132
HTTP request is logged on a single line, with the IP address of the client first, then some identity information (missing in the example), the time, the kind of request (
URL, the protocol (
HTTP/1.1), the result status code, and the result size. (Tools like Webalizer can parse such log files easily to provide statistics for web sites.)
I got the access log from our client, and put on my CSI hat. For each of the steps in my scenario, I looked up the associated
URL and searched for it in the log. And yes, bizarre as it may have appeared to me, they were all there: conveniently one after the other, from the same IP address and just before the time the client noticed the problem. Case closed.
The morale of this story is that log files are a Good Idea™. Without them I might have dismissed my scenario as too unlikely, and have spent valuable time chasing alternative hypotheses. Also, while browsing the log files, I stumbled upon two other problems that the client didn’t even report. I fixed these as a bonus 😀