Every event is either a cause for celebration or an opportunity to learn.
I don’t remember where I came across this quote, but it has stuck with me. I like how it turns every experience into something positive.
Sometimes I need to remind myself of it, however, especially when there are a lot of, well, learning opportunities in a row.
One recent case was when I had started a long running performance test overnight. The next morning when I came back to it, there was no useful information at all. None whatsoever.
What had happened?
Our system is a fast data solution built on Spring Cloud Data Flow (SCDF). SCDF allows you to compose stream processing solutions out of data microservices built with Spring Boot.
The performance test spun up a local cluster, ingested a lot of data, and spun the cluster down, all the while capturing performance metrics.
(This is early stages performance testing, so it doesn’t necessarily need to run on a production-like remote cluster.)
Part of the shutdown procedure was to destroy the SCDF stream. The stream destroy
command to the SCDF shell is supposed to terminate the apps that make up the stream. It did in our functional tests.
But somehow it hadn’t this time. After the performance test ran, the supporting services were terminated, but the stream apps kept running. And that was the problem. These apps continued to try to connect to the supporting services, failed to do that, and wrote those failures to the log files. The log files had overflown and the old ones had been removed, in an effort to save disk space.
All that was left, were log files filled with nothing but connection failures. All the useful information was gone. While I was grateful that I still had space on my disk left, it was definitely not a cause for celebration.
So then what could we learn from this event?
Obviously we need to fix the stream shutdown procedure.
Come to think of it, we had already learned that lesson. The code to shut down our Kubernetes cluster doesn’t use
stream destroy
, but simply deletes all the replication controllers and pods that SCDF creates.
We did it that way, because the alternative proved unreliable. And yet we had failed to update the equivalent code for a local cluster. In other words, we had previously missed an opportunity to learn!
Determined not that make that mistake again, we tried to look beyond fixing the local cluster shutdown code.
One option is to not delete old logs, so we wouldn’t have lost the useful information. However, that almost certainly would have led to a full disk and a world of hurt. So maybe, just maybe, we shouldn’t go there.
Another idea is to not log the connection failures that filled up the log files. Silently ignoring problems isn’t exactly a brilliant strategy either, however. If we don’t log problems, we have nothing to monitor and alert on.
A better idea is to reduce the number of connection attempts in the face of repeated failures. Actually, resiliency features like circuit breakers were already in the backlog, since the need for it was firmly drilled into us by the likes of Nygard.
We just hadn’t worked on that story yet, because we didn’t have much experience in this area and needed to do some homework.
So why not spend a little bit of time to do that research now? It’s not like we could work on analyzing the performance test results.
It turns out that this kind of stuff is very easy to accomplish with the FailSafe library:
private final CircuitBreaker circuitBreaker = new CircuitBreaker() .withFailureThreshold(3, 10) .withSuccessThreshold(3) .withDelay(1, TimeUnit.SECONDS); private final SyncFailsafe<Object> safeService = Failsafe .with(circuitBreaker) .withFallback(() -> DEFAULT_VALUE); @PostConstruct public void init() { circuitBreaker.onOpen(() -> LOG.warn("Circuit breaker opened")); circuitBreaker.onClose(() -> LOG.warn("Circuit breaker closed")); } private Object getValue() { return safeService.get(() -> remoteService.getValue()); }
I always feel better after learning something new. Taking every opportunity to learn keeps my job interesting and makes it easier to deal with the inevitable problems that come my way.
Instead of being overwhelmed with negativity, the positive experience of improving my skills keeps me motivated to keep going.
What else could we have learned from this incident? What have you learned recently? Please leave a comment below.
You must be logged in to post a comment.