We Monitor Our Infrastructure 130 Million Times A Year

Here at Vokke, we manage hundreds of networking elements and servers to deliver services to our clients. We manage cloud servers in Sydney, as well as edge components in rural Australia, Tasmania, and London. Part of our service is to make sure these elements all work correctly, 24×7 so that our clients can continue to use technology to its fullest potential unimpeded and whenever they like.

But how do we keep track of all these elements, making sure each one is working correctly? We’ve developed a service called Zipline which allows us to do just that.

Monitoring

It starts with monitoring. Vokke annually purchases a license for an enterprise monitoring tool that allows us to perform checks on various system components 24×7, every day of the year. We currently have hundreds of automated probes that scan and re-scan each of our systems, recording performance and diagnostic data.

With this tool, we perform over 130 million automated system checks each year. That’s around 4 per second, every second, every day.

We monitor everything from free space on our servers, free memory, system timers, and even the temperature of our hard drives (so we know if they are getting too hot).

And we’re constantly adding more. Every time we discover a new way a system can fail or degrade, we try and implement a sensor for it.

And when we haven’t been able to use a default sensor, we’ve even built them and released them back into the community for free.

So what happens if something goes wrong?

This is where Zipline comes into play. If anything goes wrong, an engineer automatically receives an automated phone call indicating there is a problem. A full diagnostic read-out is also sent to their inbox.

The engineer is then able to login to the system and observe what’s gone wrong. These diagnostics are available to our team anywhere in the world and can even be accessed from a mobile phone.

But what happens if the phone system is down?

We have engaged with a 3rd party VOIP provider to reliably deliver our messages to our engineers. This allows us to leverage their global, carrier-neutral infrastructure so that if any internet service provider (ISP) or telecommunications provider is down, we are still able to send the message.

And what happens if the engineer’s phone is off, or they have lost signal?

If Zipline cannot deliver the message or an engineer fails to acknowledge a call in a given amount of time, Zipline will escalate it to a different engineer. Further, all escalations will travel through a different telecommunications provider than the engineer’s phone is using to circumvent any local issues and guarantee delivery.

That is, we use multiple phone numbers and multiple carriers to help ensure delivery.

The aim is to reduce a single point of failure.

Summary

When you trust Vokke to manage your critical IT services and cyber assets, we take that responsibility very seriously. This year we have some amazing improvements planned – stay tuned to see what we roll out next and how our continual investment in incident response will help improve the resiliency of the IT services we offer.

(By the way, a fun fact. In the time you read this article, we performed around 720 automated system checks!)

Photo by monicore from Pexels

Back to Blog