Last week, we tested the upgrade procedure for the new version of our product. We got a backup from one of our clients that was over 60Gb, so we could put it to good use by testing the performance of the upgrade against it. This sort of testing is always crucial for making sure the upgrade won’t disrupt production too much.
One of the steps in the upgrade was the deletion of stale data. It dealt with two entities in a one-to-many relationship. For this discussion, lets call these entities A and B. For each A, there can be multiple Bs, whereas each B is associated with exactly one A. The upgrade used our product’s API to select A objects matching the required criteria, and then deleting them. The API implementation makes sure that when an A object is deleted, its B objects are also deleted.
This is standard encapsulation practice, nothing fancy. But there was one problem with it: the deletion process was way too slow. We broke it off after over three and a half hours, which is clearly unacceptable.
So we turned to the code, and found two loops: one iterating over the A objects, and within the delete()
of A, one iterating over the B objects. Since there can be many, many B objects to search through, this inner loop really hurts when executed repeatedly. We say that this algorithm is O(n×m), where n is the number of A objects and m the number of B objects. By first deleting all B objects related to A objects that match the criteria, and only then deleting the A objects, we could potentially change the algorithm to O(n+m), which of course is much faster.
That didn’t work out, though, since the delete()
method in class A still contained the loop over B objects, even though we now knew for sure that none of the B objects would match (since we deleted them previously). So we broke encapsulation by extracting a doDelete()
method that just deletes the A object, nothing more.
We had a similar problem with B’s delete()
method. This code sends a notification to its A object and performs other housekeeping. In our situation, this is clearly unnecessary, since that A object is about to be deleted as well. So we again broke encapsulation and extracted a doDelete()
method for class B as well.
Now we had the performance we required: the deletion process was down to two minutes. But we lost encapsulation. Being well-experienced in object oriented techniques, we knew that would open the door to all sorts of trouble. But we also knew that this change was absolutely necessary to get the required performance.
So we went into damage control mode. We made the doDelete()
methods protected, and moved the upgrade code to the same package as the API implementation code, to still be able to call the doDelete()
s. Still not optimal, but sometimes a man’s got to do what a man’s got to do…