Log Files to the Rescue

Yesterday I got an email from a client describing a really, really weird situation that had occurred with our product. Of course, they couldn’t provide a way to reproduce the problem. Fortunately, there were only two users on the system at the time (it was in their integration testing environment), so they could tell what each of them was doing.

One person’s actions I could dismiss pretty quickly as the cause of the problem, so it must have been what the other did. However, her actions also seemed unlikely to have caused the problem. I started exercising the system in ways related to her actions, in hope of reproducing the problem. No luck whatsoever.

So I stepped back a little and started reasoning from the code. What could possibly have caused this? I came up with a scenario, tried it, and sure enough, there it was. But the problem was that my actions in no way resembled the description of the client’s actions. And on top of that, my actions seemed rather bizarre. Why would anyone want to do this?

I know debugging isn’t always an exact science, but my hypothesis was in real need of some testing.

Enter log files. Our product is a web application running in Apache Tomcat, for which it’s pretty easy to enable logging. Tomcat’s access log follows the Common Logfile Format, which looks like this (all on one line): 8080 - - [27/Jun/2008:08:41:49 +0200] 
"GET /docato-composer/getLoginDialog.do HTTP/1.1" 200 3132

Each HTTP request is logged on a single line, with the IP address of the client first, then some identity information (missing in the example), the time, the kind of request (GET), the URL, the protocol (HTTP/1.1), the result status code, and the result size. (Tools like Webalizer can parse such log files easily to provide statistics for web sites.)

I got the access log from our client, and put on my CSI hat. For each of the steps in my scenario, I looked up the associated URL and searched for it in the log. And yes, bizarre as it may have appeared to me, they were all there: conveniently one after the other, from the same IP address and just before the time the client noticed the problem. Case closed.

The morale of this story is that log files are a Good Idea™. Without them I might have dismissed my scenario as too unlikely, and have spent valuable time chasing alternative hypotheses. Also, while browsing the log files, I stumbled upon two other problems that the client didn’t even report. I fixed these as a bonus 😀

Strange things happen…

On Fridays, I work from home, to prevent wasting time commuting. So today, I started out fresh, ready to rock and roll. But alas, I was off to a slow start.

To access my company’s resources, I use a Virtual Private Network (VPN). In particular, I use the VPN client software from Cisco on Ubuntu GNU/Linux. This piece of software has the tendency to break on every kernel update, however 😦 Yes, you guessed right, this week I received a kernel update to 2.6.24-17.

When I previously upgraded Ubuntu to 8.04, I received kernel 2.6.24-16 and then the VPN client broke as well. I had to apply a patch, which didn’t work: it couldn’t apply all changes. I then manually fixed the code to make sure all the changes in the patch were applied. And then the VPN client finally worked.

So I expected another one of those sessions. But this time, googling turned up nothing. Since I’m close to a deadline, I decided to simply restart my computer and choose the 2.6.24-16 kernel from the GRUB boot menu. Since it used to work with this kernel, I expected it to work now. But no such luck. I still got an error about the Connection Manager being unable to read the connection entry.

Getting a bit desperate, I redid the VPN client installation. Now it worked 😀 Feeling lucky, I rebooted into kernel 2.6.24-17, and it still worked. Sometimes I just don’t understand computers…

Update 2008-08-15: Check out this page with Unofficial Cisco VPN Client Updates for Linux.