Good Operations (OPS) do matter!
In a world in which customers are very important, their opinion on your company is also valued highly. This is done by calculating the Net Promotor Score (NPS). For the IT department it is very hard to influence the NPS directly. The only thing that matters is to design, build and maintain highly resilient IT solutions. But, as we all know, incidents will always occur! The most important way we can make sure that the customers can still use the provided services is to make sure we detected failures and incidents. This is where Mean Time To Detect (MTTD) and Mean Time To Recovery (MTTR) KPIs come into play.
So how to make sure that you shorten both the MTTD and the MTTR, since the shorter the lesser the potential outage towards customers will be?
Make sure your teams (at least):
0. are very strict in designing highly resilient IT solutions
1. value their Non Functional Requirements as much as their Functional Requirements
2. value proactive monitoring, detection and alerting
3. start from the bottom and work their way up the monitoring stack
4. design easy and simple methods of recovery
The following opensource tools can help you in a easy way to start the journey to reaching these goals:
Graphite – Scalable Realtime Graphing
logstash – open source log management