Wednesday, August 29, 2018

Automatic graph monitoring and anomaly detection

In today's world for any product , monitoring the product performance is of utmost importance and if often a product is judged on it.For instance facebook went down for 5 min on August 3 and it became such a big news even though it hardly is an essential commodity.

So today almost all products produce Key Performance Indicator(KPI's) which is almost always time series related data , most of the products already have alarming mechanism integrated into it , but the issue which is most profound in this way is firstly identifying hard upper and bottom limits which sometimes become very challenging for parameters which are very dynamic like number of connection to a server, how many connections if a good upper limit on it 100 or 1000 or 100000 , it can hugely vary depending on special occasion like e commerce site during sale can get huge sudden traffic for which hard values might not be a best strategy , also another issue we face is how many alarms are good enough , because for each new alarm you need to write code for sending and clearing alarm, and even if one KPI needs to be added again lot of coding needs to be done.


so here a generic approach to detect anomaly in patterns and relating it to other other time series data can be critical, every product is making graphs today but analysis of graph is mostly done post outage to get any insights or there are dedicated resources to do so, but i think there has to be a better and more aggressive approach.

So the anomaly detection s/w that we are proposing will detect sudden spikes in trend of time series data and further it will correlate  it to other time series data so see any pattern similarity or correlation between aggressively before issue requires much performance impact. So suppose if response time of railway site suddenly spiked it will not only detect it , but it will correlate it with other time series data like number of visitors on site or CPU utilization of server  or DB response time and can help us in getting better insights just as we would have done by seeing a graph but in more real time.So if now it finds that there is sudden spike in site's response time and also at the same time sudden increase in DB response time we can raise an alarm with better insights.






1 comment:

  1. products are becoming complex and big.. such things will help in better troubleshooting of performance related issues by real time monitoring of the KPIs.. good.... keep exploring more to make it a full fledged design...

    ReplyDelete