Breaking out of Silos

Overview
Monitoring practitioners be they L1 Operators, L2 Operators, L3 operators or Engineers tend to operate in a silo such as network monitoring, infrastructure monitoring, application monitoring or mobile monitoring.
Contributing Members
  • William S Andreas: Principal UX Designer

Practitioners work in the area they understand (e.g., they understand virtual networks), in the work group their organization feels is appropriate (e.g., the Business Operations Network Monitoring team). This pattern is often incorporated into security policies in many organizations so that a practitioner is only privileged to see a limited number of things (e.g., they may only see information about virtual machines in North America).

Only the most senior practitioners (usually architects or senior engineers) look at ALL the things that make up an organization’s infrastructure. They may look for systemic problems (a lot of alarms across multiple kinds of things are always raised at 11:00 AM every Tuesday).

They may be doing capacity planning for rolling out a new application (what will the new mobile shopping cart require in terms of capacity across all my infrastructure). They may be looking at how much time it takes each silo to respond to an issue.

To increase the value proposition of DXI we need to look at ways to make intelligent monitoring more available in each silo and we need to make comprehensive monitoring more valuable to the practitioners working in silos. What would make having both application and infrastructure monitoring available valuable to a network engineer?

 
Breaking out of Silos .png
 

A network engineer often works off a support ticket that identifies something is wrong with a device. Or they work off an alarm console working on critical alarms that have not been acknowledged or that have been assigned to them.

For example, a support ticket was opened because an alarm showed an interface was seeing a lot of packet loss or errors. Or a critical alarm is raised because an interface is seeing a lot of packet loss or errors,.

 

To diagnose the problem with the interface, they find the interface and then go through all the information available on the interface. For example, they look at the alarms raised on the interface, the amount of traffic flowing through the interface, the errors inbound and outbound, the discards inbound and outbound.

In an intelligent system, we should correlate this information into an easy to read and comprehend display. We can determine using basic pattern matching that an increase in discards or errors at a given time is occurring at the same time certain types of flow increase and that a configuration change occurred at about that time. We can organize the display to show charts illustrating alarms, discards, flow and config changes all with a uniform horizontal time axis and all scaled to be the same size and then draw a “cross time” indicator of what we think is the cause of the problem.

Breaking out of Silos 3.png
 

But this is only being intelligent about things which happen within the NetOps silo. Network issues are often caused by something outside of the network proper. For example, someone misconfigured when a data/network intensive application runs (it’s running at 11:00 AM instead of 11:00 PM). Someone scheduled a flash sale of widgets only for 11:00 AM only available on the mobile app. Solar flares are disrupting communication between the Midwest and the West Coast.

 
Breaking out of Silos 2.png

To draw a network engineer out of their silo we need to explore ways in which information from other monitoring tools can be identified as of interest to the network engineer and offer it to them while they are working in their silo. A simple way of doing this is to simply list a selection of things to explore in DOI (or any other product).

For example, assume that the probable cause can’t be tracked to a network element. Things that are of interest in tracking down the issue might be what are all the applications that include this interface in the network path, what are all the virtual machines that include this interface in their network path, what are all the alarms raised across the infrastructure at this point in time, what’s the infrastructure level full typology expanded out two levels?

 

We can encourage someone to move from the “bottom” of the monitoring stack (NetOps and Infrastructure Monitoring) “up” the stack by simply providing links to relevant information “up” the stack.