On Observability

Running an application without having a proper monitoring is akin to driving without a dashboard. You don’t really know if you still have enough gas, or if you are within the speed limit, or how far are you till your next oil change. There are many uncertainties involved in running an application. Monitoring is instrumental in getting first hand awareness on possible incident or help predict that an incident is about to happen so we can prevent it.

This post outlines some observables that we can monitor and setup alert for along with some recommended practice.

Continue reading

Incident & Post Mortem Process

This article is part of On Managing Stability series.

Recurring incidents are the enemy of scalability. Recurring incidents steal time away from our teams – time that could be used to create new functionality and greater value. Our past performance is the best indicator we have of our future performance and our past performance is best described by the incidents we have experienced and the underlying problems that caused those incidents.

Failing to recognize and resolve our past problems means failing to learn from our past mistakes in either architecture, engineering, process, and operations, and also communication.

Continue reading