System Monitoring: A Simple System Monitor with JavaScript and Elasticsearch

System Monitoring: A Simple System Monitor with JavaScript and Elasticsearch

Sometimes simple is better

Monitoring an Elasticsearch index, well, any index for that matter is not as complex as it may seem. Some use cases can seem deeply complex as in, only do this if these 10 things happen first. But in the end it all boils down to one simple question, yes or no.

In most cases, the point of monitoring a stream of data is to alert a person, group of people, or another system when something in that data stream changes. The conditions can be many or few, but in the end, it is all a simple question; Should I send an alert on this data? That is it, nothing more. Sure, you can then trigger all kinds of other things to happen. You could alert the on-call personnel. If the system load is too high, you could trigger a script to automatically add more capacity. Or, if the value falls below a certain point, you can remove that capacity to save on computing resources. Once the alert is triggered, what you can do is only limited by your imagination and your budget.

This idea flows into all facets of data wrangling. From the smallest micro-mini RFID device to the stock market that this country depends on. I know of companies that collect terabytes of data on their farm equipment. Yes, most of farmland America is fully wired now. And it is all wired towards this one little thing, "Alert me when something changes."

The alerts can be set for many things for just that many more reasons. To be alerted when a system fails, to be alerted when a system is about to fail, to find product defects, improve performance, optimize supply lines, improve product performance, and many many other uses. In fact, new uses for this data are dreamed up every day. But in the end, it all comes down to a simple alert.

Artificial Intelligence and Machine Learning work exactly the same way, a system of alerts. If-then, nothing more, nothing less. The question is; how many If-then's do you cycle your data through before you are satisfied with the answer? It is quite that simple. You use if-then's all day, every day, throughout life. (If) you buy things (then) you are expected to pay for them. (If) you work (then) you expect to be paid. (If) you do bad things (then) expect to get in trouble. (If) you have children (then) expect to raise those children. And, in the end, this is all any of this data wrangling, ML, and AI are all about, if this happens then do that.

You will notice that all of those if-thens had expects tied to them. That is always the second part of the equation, the expected. What you tell a person or system to do is not necessarily going to provide the result you want, though you do expect some kind of response. Be that another system performing a task and returning the completed result or a person performing a task and returning the completed result. Any system, anywhere, that is all it is. If the task returns the expected result, everything is good. If not, then more questions have to be asked and answered, the end result being to finally provide the result you expected in the first place. It is all if-then from the bit to the Internet.

With all of the buzzwords floating around, growing in number every day, I decided to take on a challenge. I decided that this project would take a step back and say, look, it is a simple alert. It is not that complex. It is the systems that you wrap around the alert that become complex. It is the queries that you use to find the data that become complex. But the alert itself? It is just a series of 3 if-then statements. This is for an alert that recognizes poll cycles, re-alert cycles, and auto-recovery. All in one little 145 line JavaScript program. Actually, the Elasticsearch query, for reference purposes, is not minified. So the actual code logic is only 89 lines.

So, without further hesitation, here is the most simple Javacript I could come up with to explain this. This is written in a top down style and all runs inside of one async function. The point here was simplicity of explanation, not to write a tight production ready program. This will only run the queries, perform the logic and print results to console. As I mentioned above, what you do with the alert is where the complexities come in. You could write a module to send email, send data over a tcp socket, write log files, trigger another system to do something. Every single thing you want to do could be written as a function and executed when the program decides that an alert needs to be sent. This is only to explain that part. Additions will come in the future, but for now, this is just a simple flow of, if the condition is met, tell us about it.

The Code

This is the complete heavily commented script. You could easily run this against any Elasticsearch cluster that has metricbeat data from the system.yml module as the system Load is part of the standard system data set. All of the console.log() statements are in areas where other functions could be launched to do something with the alert. As mentioned, this was just meant to be a simple, this is basically how it works, exercise. I don't like chunking code or referring to line numbers so this code is heavily commented.


const elastic = require('elasticsearch');
const path = require('path');

require('dotenv').config({
   path: path.join(__dirname, '.env')
})

//Build Elasticsearch Client using dotenv variables
var elasticClient = new elastic.Client({
   host: process.env.ELASTICSEARCH_USER + ":" + process.env.ELASTICSEARCH_PASSWORD + "@" + process.env.ELASTICSEARCH_HOST,
   sniff: true,
   apiVersion: '7.x',
});

//build monitoring element here
//as we move forward we will take these
//monitoring elements to external sources
var monitoringElement = {
   systemLoad: {
      monitor_name: "systemLoad",
      use_index: "metricbeat-*",
      alert_at_value: 1,
      realert_in_ms: 30000,
      auto_ack_on_recovery: false,
      alert_type: ["log"]
   }
}

//object to hold realtime polling information for use calculating poll cycle, alerts, and realerts.
var elementTracker = {};
//Query unminified for reference
//This query looks for the mac values for system.load.1, systemload.5 and systemload.15
//Though we only ever alert on system.load.1 the others are available for informational purposes.
//This data is aggregated into buckets based on agent.hostname
//size: 0 means that This query returns no actual documents, only the aggregate buckets
//not to be confused with aggs.Hostname.size:50 which means look for up to 50 hosts.
//In the query section bool.filter.bool.should.range says to look for documents where system.load.1 is greater than 0
//0 is a holder valuie which is replaced when the query is built later with monitoring.Element.SystemLoad.alert_at_value
//The @timestamp range filter says look for documents greter than now - 30 sedonds and less than now.
//Though the poller runs ever 2 seconds and the query goes back 30 seconds, this is a good covrall to make sure nothing is missed
//The realert logic takes care of not sending alerts until realert_in_ms, not ever time the query runs
//Years of experience says this is a good tactic.
var loadQuery = {
   "aggs": {
      "Hostname": {
         "terms": {
            "field": "agent.hostname",
            "order": {
               "Load1": "desc"
            },
            "size": 50
         },
         "aggs": {
            "Load1": {
               "max": {
                  "field": "system.load.1"
               }
            },
            "Load5": {
               "max": {
                  "field": "system.load.5"
               }
            },
            "Load15": {
               "max": {
                  "field": "system.load.15"
               }
            }
         }
      }
   },
   "size": 0,
   "query": {
      "bool": {
         "filter": [{
               "bool": {
                  "should": [{
                     "range": {
                        "system.load.1": {
                           "gt": "0"
                        }
                     }
                  }],
                  "minimum_should_match": 1
               }
            },
            {
               "range": {
                  "@timestamp": {
                     "format": "strict_date_optional_time",
                     "gte": "now-1m",
                     "lte": "now"
                  }
               }
            }
         ]
      }
   }
};

//Start the monitoring timer, run the loop ever 2000ms.
var loopTimer = setInterval(loopTheElements, 2000);


async function loopTheElements() {
   try {
      //Stop the timer. We do not want other poll cycles launching if this one stalls
       clearInterval(loopTimer);
      //Loop through monitoring elements
      //We could go straight to the query but this
      //Leaves room to add other monitoring elements in the future.
      for (const [key, value] of Object.entries(monitoringElement)) {
         //Set the date in epoch millseconds that this poll cycle is happening
         const thisPollTime = Date.now();
         //build the query that the Elasticsearch js module needs using the query above and values from the monitoring Element
         var runThisQuery = {
            index: value.use_index,
            body: loadQuery
         };
         //run the query
         var result = await elasticClient.search(runThisQuery);

         //loop through buckets and check for alert condition
         console.log("--------------------------------------\n");

         for (let i = 0; i < result.aggregations.Hostname.buckets.length; i++) {
            //Shorten bucket name for this process
            var thisBucket = result.aggregations.Hostname.buckets[i];
            console.log("Checking "+thisBucket.key);

            //Add host key and poll start information if host does not exist in the objext
            if (!elementTracker[thisBucket.key]) {
               elementTracker[thisBucket.key] = {
                  first_poll: true,
                  needs_alert: false,
                  last_alert_ms: -1,
                  alerts_sent: 1
               };
               //copy monitoring Element data into elementTracker host key
               elementTracker[thisBucket.key][key] = monitoringElement[key];
            }
            //Check to see if the Load1.value for this host is greater than the alert_at_value that we set from monitoring elements
            if (thisBucket.Load1.value >= elementTracker[thisBucket.key][key].alert_at_value) {
               //If this is the first alert then set the needed keys and print the alert notification
               //If this is not the first alert in this cycle then continue to the else to check if it is time for realert
               if (elementTracker[thisBucket.key].first_poll === true) {
                  console.log("Load Threshold Exceeded " + thisBucket.key + " first notification. Threshold:" + elementTracker[thisBucket.key][key].alert_at_value + " Value:" + thisBucket.Load1.value + " Alert #" + elementTracker[thisBucket.key].alerts_sent);
                  elementTracker[thisBucket.key].first_poll = false;
                  elementTracker[thisBucket.key].needs_alert = true;
                  elementTracker[thisBucket.key].last_alert_ms = thisPollTime;
               } else {
                  //Technically this need an alert so set the key here
                  //It is used later to detect if auto clear is needed
                  elementTracker[thisBucket.key].needs_alert = true;
                  //calculate the realert time difference
                  var realertTime = thisPollTime - elementTracker[thisBucket.key].last_alert_ms;
                  //This is only calcualted for dispaly purposes and is not used anywhere else.
                  var timeUntilRealert = realertTime - elementTracker[thisBucket.key][key].realert_in_ms;
                  console.log("Realert for "+thisBucket.key+" will happen in" + timeUntilRealert);
                  //if realertTime is greater than the realert_in_ms that we set from the monitoring Element send realert
                  //increment the alerts_sent counter for this host
                  //update last_alert_ms with the current poll time for future calculations
                  if (realertTime >= elementTracker[thisBucket.key][key].realert_in_ms) {
                     ++elementTracker[thisBucket.key].alerts_sent;
                     console.log("Load Threshold Exceeded " + thisBucket.key + " Realert notification. Threshold:" + elementTracker[thisBucket.key][key].alert_at_value + " Value:" + thisBucket.Load1.value + " Alert #" + elementTracker[thisBucket.key].alerts_sent);
                     elementTracker[thisBucket.key].last_alert_ms = thisPollTime;
                  }
               }
            } else {
               //if host was in alert but has now cleared then send the alert cleared message 
               //and reset the needs_alert and alerts_sent keys
               if (elementTracker[thisBucket.key].needs_alert === true) {
                  ++elementTracker[thisBucket.key].alerts_sent;
                  console.log("Load Threshold Has Returned to Normal " + thisBucket.key + " Alert Clear notification. Threshold:" + elementTracker[thisBucket.key][key].alert_at_value + " Value:" + thisBucket.Load1.value + " Alert #" + elementTracker[thisBucket.key].alerts_sent);
                  elementTracker[thisBucket.key].needs_alert = false;
                  elementTracker[thisBucket.key].alerts_sent = 1;
               }
            }
         }
      }
      //This poll cycle is complete. Restart the loop timer
      loopTimer = setInterval(loopTheElements, 2000);
   } catch (error) {

   }
}
  

This is about as simple as it gets. This code will run all day and pop in and out of alert cycles as needed. There are no real error handlers or anything else fancy. Just a raw, brute force query that sends alerts, relaerts, and auto clears them when the data returns to normal. What can be done with just this alone is only limited by the data being collected from metricbeat. Trigger on memory usage, disk usage, process count, website visitors, and so on. It just takes a new query and tweaks to the object names to match your query. Below is a short video that explains the logic of how this works and a demo of the script running while I trigger Load events on a a couple systems that are being monitored by metricbeat.

Video explaining and running the above code