Root Cause Analysis

I’ve moved on to a new project recently. It’s quite different from the previous one. Before I worked on a monolythic web application, now we’re using OSGi. As a result, our project consists of a lot of sub-projects (OSGi bundles) which makes it very inconvenient to use Ant. So we’ve switched to Gradle. Our company has also standardized on Perforce, where we used Subversion before.

To summarize, a lot has changed and I’m not really up to speed yet. This is evidenced by the fact that I broke our build 4 times in 5 days. As an aspiring software craftsman, I feel really bad about that. So why did this happen so often? And can I do anything to prevent it from happening again? Enter Root Cause Analysis.

Root cause analysis is performed to not just treat the symptoms, but cure the disease:

The key to effective problem solving is to first make sure you understand the problem that you are
trying to solve – why it needs to be solved, how you will know when you’ve solved it, and what the
root cause is.
Often symptoms show up in one place while the actual cause of the problem is somewhere
completely different. If you just “solve” the symptom without digging deeper it is highly likely that
problem will just reappear later in a different shape.

Henrik Kniberg has written about one way of doing root cause analysis: using Cause-Effect diagrams. Using this method, I ended up with the following:

Build Failure Cause Effect Diagram

I started out with the problem: Build Failure, in the orange rectangle. I then repeatedly asked myself why this is a bad thing and added the effects in the red rectangles. Just repeat this until you find something that conflicts with your goal. Finally, I repeatedly added causes in blue rectangles. You can stop when you’ve found something you can fix, but in general it’s good to keep asking a bit deeper than feels comfortable. This is where the Five Whys technique comes in handy.

As you can see, my main problem is impatience. For those who know me, that won’t come as a surprise 😉 However, in this case this personal flaw of mine gets in the way of my goal of making customers happy.

With the causes identified, it’s time to think up some countermeasures. Pick some causes that you can fix, and think of a way to treat them. The ones I’m going to work on are marked with a star in the figure above:

  • Use a checklist when submitting code to the source code repository. This will prevent me from making silly mistakes, such as forgetting to add a new file
  • Take the time to learn the tools better, in particular Gradle and Groovy
  • In general, try to be more patient

The last one is the hard one, of course. Wish me luck!

Update 2010-08-01: I have created a checklist with things to consider before submitting code to our Perforce repository and used this checklist all week. I broke the build twice this week, both times because of special circumstances that were not accounted for on the checklist. So I think I’m improving, but I’m not there yet.

Advertisement

Performance tuning an Ant build

Common advice in the Agile world is to maintain an automated build that runs in under 10 minutes. I doubt anybody would disagree that a faster build is better than a slower one. But how do we keep the build slick?

Usually the bulk of the build time is spend in executing tests, so that is the first place to start. But as with any optimization effort, you shouldn’t guess where the pain is, but measure. So how do we do that for an Ant build?

This turns out to be not hard at all. Ant will happily inform your listener of any interesting event, such as task started or stopped. Using that information, it is pretty straightforward to write a listener that records the time the tasks and targets take.

But you don’t even have to do that, you can simply use the open source Ant Utilities project. Simply place the jar in Ant’s lib directory and run Ant as follows:

ant -listener net.java.antutility.BuildMetricsListener
[target]

At the end of the Ant build, a profile report will be displayed:

...
BUILD SUCCESSFUL
Total time: 4 minutes 6 seconds
BUILD METRICS:
Local Time, Child Time, Invocation Count, Type, Name, Location
88453, 0, 1, TASK, macker, build-core.xml:448:
70955, 0, 8, TASK, javac, dist\build.xml:608:
36563, 0, 6, TASK, jar, dist\build.xml:628:
31047, 0, 1, TASK, checkstyle, build-core.xml:251:
4031, 0, 1, TASK, exec, dist\build.xml:947:
1922, 0, 1, TASK, taskdef, build-core.xml:345:
1797, 0, 1, TASK, signjar, build-core.xml:233:
1688, 0, 1, TASK, uptodate, build-core.xml:425:
1018, 1557, 21, TASK, for, build.xml:19:
688, 0, 1, TASK, taskdef, build-core.xml:404:
609, 0, 1, TASK, property, build-core.xml:171:
533, 0, 21, TASK, taskdef, dist\build.xml:15:

The first column indicates the time spent in the element (task or target), the last two the name of the element and the build file name and line number.

But it gets better still. You can also run your Ant build from Eclipse and get a visual indication of the pain points in your editor:
Visual indication of slow Ant elements in Eclipse

Using factory classes in Ant tasks

So you have this nice factory class that prevents your client code from knowing the implementation class of the instances it needs to create and that lets it program to an API only.

Of course, at some point somebody needs to know the implementation class. Since the factory is the one creating instances, it either needs to know itself or be told. And since the factory is probably in the same package as the API, it shouldn’t know the implementation class itself, since that would tie the API package to the implementation package. So the factory needs to be told:

public class MyFactory {

  private static Class implementationClass = null;

  private MyFactory() {
    // Utility class
  }

  /**
   * Create a new instance.
   * @param data Data needed to initialize the instance
   * @return The newly created instance
   */
  public static MyInterface newInstance(final Object data) {
      assertImplementationClass();
      final Class clazz = implementationClass;
      if (data == null) {
        try {
          final Constructor constructor = clazz.getConstructor();
          result = (MyInterface) constructor.newInstance(
              new Object[0]);
        } catch (final Exception e) {
          result = null;
        }
      } else {
        final Constructor[] constructors = clazz.getConstructors();
        for (int i = 0; result == null && i < constructors.length; 
            i++) {
          final Constructor constructor = constructors[i];
          if (constructor.getParameterTypes().length == 1
          && constructor.getParameterTypes()[0].isInstance(data)) {
            try {
              result = (MyInterface) constructor.newInstance(
                  new Object[]{data});
            } catch (final Exception e) {
              result = null;
            }
          }
        }
    }

    return result;
  }

  /**
   * Register a class that implements the interface.
   */
  public static void registerImplementation(
      final Class implementation) {
    implementationClass = implementation;
  }

  /**
   * Unregister the implementation class.
   */
  public static void unregisterImplementation() {
    implementationClass = null;
  }

  private static void assertImplementationClass() {
    if (implementationClass == null) {
      throw new IllegalStateException(
          "Implementation class not set");
    }
  }

}

Now, who’s going to tell the factory what class to instantiate? There must be some entry point in the application where this happens. In your tests (you do write tests, right?), you can do that in the set up method. In a web application, you can do that in the ServletContextListener.

Ant

But what about in Ant tasks? You could create an Ant task that does just that and call it from a dependent target:

  <target name="--init-factory" unless="factory.inited">
    <property name="impl.class" 
        value="com.mycompany.myapp.MyImplementation"/>
    <taskdef name="register-impl"
        classname="com.mycompany.myapp.ant.RegisterTask" 
        classpath="..."/>
    <register-impl classname="${impl.class}"/>
    <property name="factory.inited" value="true"/>
  </target>

However, that doesn’t work. So what’s up?

Debugging Ant tasks

Our Ant task seems so simple that it is hard to see what could be wrong with it. So we want to debug it and find out.

You can debug Ant tasks by setting the environment variable ANT_OPTS:

SET ANT_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,address=6000,server=y,suspend=n

Now when you run your Ant script, you can attach your debugger on port 6000. You may want to use the input task to have the build wait while you attach your debugger.

Debugging reveals something interesting: The registerImplementation method does get called with the right parameter, but when newInstance is called, implementationClass is still null. Apparently Ant is doing some fancy classloader stuff that gets in our way.

The solution is to have the Ant task set a system property that the factory uses:

  private static void assertImplementationClass() {
    if (implementationClass == null) {
      final String className = (String) 
          System.getProperties().get(IMPLEMENTATION_CLASS_PROPERTY);
      if (StringUtils.isBlank(className)) {
        throw new IllegalStateException("Implementation class not set");
      }
      try {
        registerImplementation(Class.forName(className));
      } catch (final ClassNotFoundException e) {
        throw new IllegalStateException("Invalid implementation class: " 
            + className + "\n" + e.getLocalizedMessage());
      }
    }
  }

Automated distribution creation (4)

In this series of posts, I talked about my continuing quest for the fully automated creation of a distribution for our product. I talked about downloading release notes from our issue tracker and how to add those to our NEWS file. Last time, I had the build automatically update some text files. Now, my journey comes to an end.

The final manual step in creating a distribution for our product, is about incorporating the latest version of the manual. Since our product is an extension of the Component Content Management System Docato, we naturally use Docato itfself for writing our manual. This means the source code to the manual is not in Subversion with the rest of the code, and is not easily available to the build system. So what to do?

Well, Docato is pretty versatile. One can have a publication’s output sent to a well-known location on a server, for example. From there, we can download it using Ant’ scp task:

<scp file="${remote.manual.html.files}"
    todir="${local.manual.html.dir}"/>

where ${remote.manual.html.files} can use wildcards, like ${download.dir}/html/AMDS-CMS/*. But the download directory is on a different machine, so you need to provide the username and password for logging into that machine:

<property name="download.dir"
    value="${username}:${password}@${host}:${remote.dir}"/>

Of course, having a username and password in an Ant build file is a security risk. But in this case, the build file is not available outside our firewall, so we’re good. Otherwise, you’d have to provide the user credentials on the command line when running the Ant target:

ant create-dist -Dusername=foo -Dpassword=bar

scp also allows you to rename a file when downloading:

<scp file="${remote.manual.pdf.file}"
    localtofile="${local.manual.pdf.file}"/>

So now we can download the manual and incorporate it in our distribution. But how do we know we have the latest version of the manual?

Again, Docato comes to the rescue. It has the concept of scheduled tasks, actions that automatically run in the background from time to time. These are ideal for making backups, for instance.

So I created a scheduled task that builds a publication, and installed the task code on our manual server. Now every time a tech writer edits something, the edit will be automatically published at most a day later.

And so my journey ends.

But every end is a new beginning. Now that our distribution can be built by running a single Ant target, a whole new world opens up to us. My plan is to create a distribution automatically as part of our CruiseControl build. And then install it automatically, and run some tests against the installed version. Also, the distribution could be made available on some well-known server, so that interested people could always use the latest version for giving demos, for instance. But only when all the tests pass, of course 😉

Automated distribution creation (3)

In previous posts, I talked about my continuing quest for the fully automated creation of a distribution for our product. First, I talked about downloading release notes from our issue tracker. Then I showed how to add those to our NEWS file. This time, it’s time for updating some text files.

Our product is parametrized by some properties in property files. For instance, the number of cache pages to use for our embedded database is stored in the xhive.server.cache property in the file build.properties. In a typical installation on a production server, you’d want to set this to some high number, since you would have lots of RAM. However, on our development machines, we are not so fortunate. So the build.properties file contains a value that works for us developers. But when we build a distribution, we need to increase the value.

Enter the replaceregexp Ant task. This task lets you replace the occurrence of a given regular expression with a substitution pattern in a file. So the following piece of Ant script sets the property to 150000:

<replaceregexp file="${property.file}"
    match="(xhive.server.cache\s*=)\s*.+"
    replace="\1 150000" byline="true"/>

I use a backreference in the regular expression to prevent duplication of xhive.server.cache. Also note the use of \s* to match any amount of whitespace.

But once the value is set this high, I can no longer start up our web application on my local machine, since I don’t have the required amount of RAM for that many cache pages (they’re 4K a piece). So I use a similar piece of code to set it back to the development value once the distribution is built.

I also need to update some shell scripts that start Java programs. These shell scripts specify the maximum amount of RAM available to the JVM using the -Xmx and -Xms command line options to java:

<replaceregexp file="${shell.script.file}"
    match="(-Xmx)[0-9]+(m -Xms)[0-9]+(m)"
    replace="\1512\2128\3" byline="true"/>

The replace pattern isn’t as readable as I would like with the backreferences \1, \2, and \3, but I still prefer that to the duplication of the command line options.

Note however, that this messes up the file permissions on *nix systems. So I use the chmod Ant task to fix that:

<chmod file="${shell.script.file}" perm="755"/>

The above code snippets show how to change property values. But I also need to add a value:

<replaceregexp file="${property.file}"
    match="docato-server\s*=.*"
    replace="" byline="true"/>
<echo file="${property.file}"
    message="docato-server = ${tomcat.host}/docato-composer${line.separator}"
    append="true"/>

I first use a replaceregexp to remove any occurrence of the property, and then add it using the echo task. The replaceregexp is necessary to be able to run the Ant script multiple times without the property being added multiple times. Note the use of ${line.separator} to add a newline in a cross-platform manner.

Automated distribution creation (2)

In my previous post I talked about how I managed to automatically download the release notes from our issue tracker web site. These notes still needed adding to our NEWs file, which describes the changes between releases.

There are really two scenarios to deal with here: the release notes for the current release either are already in the NEWS file, or they are not. They are already there when you rebuild the distribution for a release, for example when you’ve found something wrong with it and fixed that. For a human, this is pretty simple to detect, but how does an Ant script know?

Enter the Ant filter chain. This construct resembles a Unix pipe in that you can use it to feed output of one as input to the other. Here’s how I retrieve the version that is currently in the NEWS file:

<loadfile property="current.version"
    srcFile="${news.file}>
  <filterchain>
    <headfilter lines="1"/>
    <striplinebreaks>
    <tokenfilter>
      <replaceregex pattern="[a-zA-Z\s]*([1-9]+\.[0-9]+).*"
          replace="\1"/>          
      <replacestring from="." to="\."/>
    </tokenfilter>
    </striplinebreaks>
  </filterchain>
</loadfile>

The loadfile task loads the srcFile into the current.version property. But not just as is, no there is a filterchain applied first. The first item in the chain is headfilter, which works just like the Unix head command: in this case it gives the first line of the NEWS file. I don’t want a line, but a string, so next I remove the line ending with the striplinebreaks filter.

Then it’s time for some good old regular expression to extract the version number from the string. The first line of the NEWS file looks like this: Changes in 1.4.0. So I match the text with [a-zA-Z\s]* and then the actual version number with ([1-9]+\.[0-9]+).*.

Note that I use a group to capture only the major and minor version (1.4 in the previous example). The reason for that is that whenever we deliver patch releases, we don’t add a whole new section to the NEWS file, but just expand the current section with the few cases that were fixed by the patch. Since we sort the cases in descending order of reporting, the patch cases will always be at the top.

Following the regular expression there is a replacestring filter that inserts backslashes before points. The reason for that becomes clear when we look at how the Ant script actually uses the current.version property:

<condition property="same.release">
  <matches string="${full.version}"
      pattern="${current.version}"/>
</condition>
<antcall target="--remove-current-release-from-news"/>
<antcall target="--add-current-release-to-news"/>

The --remove-current-release-from-news target is only executed when the same.release property is true:

<target name="--remove-current-release-from-news"
    if="same.release">
  <property name="previous.version.file"
      value="${news.dir}/previous.version.txt"/>
  <echo message="${previous.version}"
      file="${previous.version.file}"/>
  <loadfile srcFile="${previous.version.file}"
      property="escaped.previous.version">
    <filterchain>
      <tokenfilter>
        <replacestring from="." to="\."/>          
      </tokenfilter><tokenfilter>
    </tokenfilter>
  </filterchain>
  </loadfile>
  <delete file="${previous.version.file}"/>
  <replaceregexp file="${news.file}"
      match=".*(Changes in ${escaped.previous.version}.*)"
      replace="\1" flags="s"/>          
</target>

The bulk of the work is done in the final replaceregexp task, where everything before the text Changes in <x>.<y>.<z> is deleted. The code before that is just a convoluted way to escape points in the previous version number. Unfortunately, I’m not aware of any Ant task that can execute a regular expression against a property, so I first put the property into a temporary file and then operate on that file.

Finally, all that is left, is to add the release notes for the current version to the NEWS file:

<target name="--add-current-release-to-news">
  <property name="new.news.file"
      value="${news.dir}/new.news.txt"/>
  <concat destfile="${new.news.file}">
    <path>
      <pathelement location="${release.news.file}"/>
      <pathelement location="${news.file}"/>
    </path>
  </concat>
  <move file="${new.news.file}"
      tofile="${news.file}"/>
  <delete file="${release.news.file}"/>
</target>

The only tricky part here is that the concat task doesn’t allow one of its input files to also be the output file. So I have to introduce a temporary file. Then when all is done, the file containing the NEWS section for this release, release.news.file, is no longer needed.

Automated distribution creation

So we have this automated build with CruiseControl. It generates code, compiles, deploys, and tests. It’s saved my skin a gazillion times. It’s really great.

But it could be even better. It could also build a complete distribution, making the whole software release process a non-event. That’s one of my goals for the coming weeks. So stay tuned. 😉

Currently, the process to build a distribution of our product requires a couple of manual steps. One of these steps is to update the NEWS file, which describes the changes between releases. Of course, everything that changes between releases, is documented in the issue tracking system, in our case FogBugz. (FogBugz is OK to work with most of the time, although I think there are better alternatives, like Jira.)

FogBugz lets you add release notes to each issue (which it calls case), and it provides a standard report to show the release notes for all cases scheduled for a specific release. You can even download this report in XML.

The only problem is that this functionality doesn’t work most of the time. The only time when it is guaranteed to work, is when you try it on the server that hosts FogBugz. Since this machine is in the server room, this is inconvenient to say the least. But even if this functionality worked flawlessly every time, everywhere, it would still be a manual step to collect the XML file.

So I turned to HtmlUnit, a “browser for Java programs. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your normal browser.” We use this great tool a lot to write our acceptance tests.

This time, I used HtmlUnit’s WebClient from within an Ant task to log in into FogBugz, generate the release notes report, extract cases with release notes (some cases have none, since they are too trivial to bother the end user with), and write them to an XML file. This allows me to transform the XML file to plain text using XSLT, giving a NEWS file section for the current release. The next step is to automagically add this to the existing NEWS file. This should be easy enough using Ant’s concat task. I will let you know how this works out.