Celebrate Learning in Software Development

Every event is either a cause for celebration or an opportunity to learn.

celebrateI don’t remember where I came across this quote, but it has stuck with me. I like how it turns every experience into something positive.

Sometimes I need to remind myself of it, however, especially when there are a lot of, well, learning opportunities in a row.

One recent case was when I had started a long running performance test overnight. The next morning when I came back to it, there was no useful information at all. None whatsoever.

What had happened?

scdfOur system is a fast data solution built on Spring Cloud Data Flow (SCDF). SCDF allows you to compose stream processing solutions out of data microservices built with Spring Boot.

The performance test spun up a local cluster, ingested a lot of data, and spun the cluster down, all the while capturing performance metrics.

(This is early stages performance testing, so it doesn’t necessarily need to run on a production-like remote cluster.)

Part of the shutdown procedure was to destroy the SCDF stream. The stream destroy command to the SCDF shell is supposed to terminate the apps that make up the stream. It did in our functional tests.

But somehow it hadn’t this time. After the performance test ran, the supporting services were terminated, but the stream apps kept running. And that was the problem. These apps continued to try to connect to the supporting services, failed to do that, and wrote those failures to the log files. The log files had overflown and the old ones had been removed, in an effort to save disk space.

All that was left, were log files filled with nothing but connection failures. All the useful information was gone. While I was grateful that I still had space on my disk left, it was definitely not a cause for celebration.

So then what could we learn from this event?

Obviously we need to fix the stream shutdown procedure.

kubernetesCome to think of it, we had already learned that lesson. The code to shut down our Kubernetes cluster doesn’t use stream destroy, but simply deletes all the replication controllers and pods that SCDF creates.

We did it that way, because the alternative proved unreliable. And yet we had failed to update the equivalent code for a local cluster. In other words, we had previously missed an opportunity to learn!

Determined not that make that mistake again, we tried to look beyond fixing the local cluster shutdown code.

One option is to not delete old logs, so we wouldn’t have lost the useful information. However, that almost certainly would have led to a full disk and a world of hurt. So maybe, just maybe, we shouldn’t go there.

Another idea is to not log the connection failures that filled up the log files. Silently ignoring problems isn’t exactly a brilliant strategy either, however. If we don’t log problems, we have nothing to monitor and alert on.

release-itA better idea is to reduce the number of connection attempts in the face of repeated failures. Actually, resiliency features like circuit breakers were already in the backlog, since the need for it was firmly drilled into us by the likes of Nygard.

We just hadn’t worked on that story yet, because we didn’t have much experience in this area and needed to do some homework.

So why not spend a little bit of time to do that research now? It’s not like we could work on analyzing the performance test results.

It turns out that this kind of stuff is very easy to accomplish with the FailSafe library:

private final CircuitBreaker circuitBreaker = new CircuitBreaker()
    .withFailureThreshold(3, 10)
    .withDelay(1, TimeUnit.SECONDS);
private final SyncFailsafe<Object> safeService = Failsafe
    .withFallback(() -> DEFAULT_VALUE);

public void init() {
  circuitBreaker.onOpen(() -> LOG.warn("Circuit breaker opened"));
  circuitBreaker.onClose(() -> LOG.warn("Circuit breaker closed"));

private Object getValue() {
  return safeService.get(() -> remoteService.getValue());

learnI always feel better after learning something new. Taking every opportunity to learn keeps my job interesting and makes it easier to deal with the inevitable problems that come my way.

Instead of being overwhelmed with negativity, the positive experience of improving my skills keeps me motivated to keep going.

What else could we have learned from this incident? What have you learned recently? Please leave a comment below.


Please Join the Discussion

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s