On Observability

Running an application without having a proper monitoring is akin to driving without a dashboard. You don’t really know if you still have enough gas, or if you are within the speed limit, or how far are you till your next oil change. There are many uncertainties involved in running an application. Monitoring is instrumental in getting first hand awareness on possible incident or help predict that an incident is about to happen so we can prevent it.

This post outlines some observables that we can monitor and setup alert for along with some recommended practice.

Continue reading

Software Fragmentation – The Golden Path

There is a direct correlation between teams that give their engineers autonomy to own their technical decisions and the team’s ability to hire and retain A-class or Senior talent. There is a tradeoff, but an acceptable level of chaos in exchange for a stronger sense of individual/team ownership is usually the right one and leads to higher performing teams in the long run – at least this is what I’ve been seeing if a couple of companies in Indonesia.

So, how to make sure these “chaotic” things are manageable and actually give the benefit to the team?

Continue reading

Incident & Post Mortem Process

This article is part of On Managing Stability series.

Recurring incidents are the enemy of scalability. Recurring incidents steal time away from our teams – time that could be used to create new functionality and greater value. Our past performance is the best indicator we have of our future performance and our past performance is best described by the incidents we have experienced and the underlying problems that caused those incidents.

Failing to recognize and resolve our past problems means failing to learn from our past mistakes in either architecture, engineering, process, and operations, and also communication.

Continue reading