Monitoring Network Health

Craig Risi
Sep 9, 2022
7 min read

Software developers tend to focus a lot on application monitoring and ensuring the health of their applications. However, applications can only operate healthily if the underlying network infrastructure on which they run is operating in a healthy manner.

So it's important that software developers learn to work with their respective DevOps and NetOps engineers to understand the health of their networks. Ensuring they are ready to respond to critical outages and develop a better understanding of the applications flow of data between different servers and machines and how it impacts the underlying infrastructure. Offering up plenty of opportunities for software optimization if done correctly.

Even with the growth of cloud computing where many companies are now passing on their hosting and network infrastructure to cloud providers, it doesn't mean that this is something to take for granted. The software still needs to run efficiently and correctly on those cloud servers and so it's important to still keep tabs on the underlying infrastructure to ensure the efficient operation of the software

So, in this post, I want to discuss some of the factors to consider when it comes to monitoring network health and why they are so essential to monitor.

How Does Network Monitoring Work?

Let’s start by first discussing how network monitoring works before we look at the important things to monitor and how we can monitor them. Networks - both physical and software dependent - enable the transfer of data between systems, including physical computers and applications.

The Open Systems Interconnection (OSI) Model breaks down several functions that computer systems rely on to send and receive data. In order for data to be sent across a network, it passes through each component of the OSI, utilizing different protocols, starting at the physical layer and ending at the application layer. Network monitoring allows teams to have visibility into these various components that make up a network and ensures that engineers can troubleshoot network issues at any layer in which they occur.

To monitor these different layers effectively though, we need to separate how we monitor the actual hardware involved in the flow of data and the software layers, which represent a far more complex representation of the data needs of the software.

Monitoring Network Hardware

Whether you are running a small corporate network between a few servers or managing large data centers, you need to ensure that the physical hardware through which network traffic operates and travels is healthy and operational.

This typically comprises the physical, datalink, and network layers in the OSI model (layers 1, 2, and 3). In this device-centric approach to monitoring, companies need to monitor the components for transmitting data, such as cabling, routers, switches, and firewalls. A network device may have multiple interfaces that connect it with other devices, and network failures may occur at any interface. Being able to determine quickly where the failures arise and what is causing them is vital in helping teams resolve issues and restore network health as soon as possible - and even potentially mitigating them entirely in the future by being able to detect potential issues before they occur.

How to Monitor Network Hardware

Most network devices come equipped with support for the Simple Network Management Protocol (SNMP) standard. Using the SNMP standard, it is possible to monitor inbound and outbound network traffic and other important network telemetry critical for ensuring the health and performance of on-premise equipment.

Another important protocol to consider is that of the Internet Protocol (IP). Many people are familiar with IP addresses - a standard that is sussed on almost all networking systems and allows devices to be set with unique addresses that can help determine where data should be routed.

NetOps and Infrastructure engineers will typically use network monitoring tools to collect the following types of metrics from network devices:

Uptime

This represents the amount of time that a network device successfully sends and receives data. This can be done through your monitoring tools regularly polling each unique device on the network to verify that it is operating healthily. If any issues occur either through no response received or a delay in received responses, an alert can be raised to notify the team.

CPU utilization

It’s not just about the physical cables that connect devices, but also the computing power available across the different routing tools and connected machines that plays a role in the health of the overall flow of data. By monitoring CPU utilization, teams have a better idea of the extent to which a network device has used its computational capacity to process input, store data, and create output and can either route traffic elsewhere or look to provide a performance boost to try and resolve the problem. It’s also important for teams to set thresholds both for what constitutes critical usage levels and high usage levels so that devices that are approaching strain can be identified and the problem hopefully rectified before any actual traffic failures may occur.

Bandwidth usage

The amount of data, in bytes, that is currently being sent or received by a specific network interface. Through monitoring the different data coming in and out of each network terminal, engineers can see both the volume of traffic being sent and the percentage of total bandwidth that is being utilized. Like with CPU utilization, this allows teams to identify when traffic levels may be approaching levels that are likely to cause problems and look to either reroute traffic elsewhere or adjust routing settings to try and ensure that the traffic flow at each interface is manageable.

By measuring this, teams can also have a better understanding of how data is traveling through their network and map out more efficient ways of moving traffic across the entire network.

Throughput

This figure represents The rate of traffic, in bytes per second, passing through an interface on a device during a specific time period. It can be measured at the same time as individual bandwidth usage but needs to be tracked as a separate metric. This is important as it helps teams to understand the speed of data processing at each terminal, allowing engineers to identify bottlenecks in the network. Additionally, the data cna also be used to understand load at different points of a network at different times of the data to try and preemptively plan for high loads by changing routing patterns where necessary.

Interface errors/discards

These are errors on the receiving device that cause a network interface to drop a data packet. Interface errors and discards can stem from configuration errors, bandwidth issues, or a variety of other reasons. Through monitoring the flow of data in and out of each networking interface, these variances cna be detected and the team alerted. Then, in conjunction with many of the other metrics above, help the team to identify the root cause and rectify the issue appropriately.

IP metrics

IP metrics, such as time delay and hop count, can measure the speed and efficiency of connections between devices. To measure this, monitoring tools keep track of the flow of data packets from start to finish and count the number of hops made between each interface on the network and the time spent at all. This provides teams with vital information on how the traffic is flowing across different nodes in the network and makes adjustments to try and identify more efficient routes through different parts of the network.

Thankfully with the move to the cloud, many companies have been able to move most of these networking responsibilities over to their respective cloud providers. But that doesn't mean that they no longer need to be concerned about network traffic and monitoring, as the software layers are still essential to the health of a network and the individual clients will each need to take responsibility for monitoring.

Monitoring the software layers

These software layers represent the transport and application layers of the OSI model (layer 4 and layer 7). Monitoring these layers helps teams track the health of services, applications, and underlying network dependencies as they communicate over a network.

How to Monitor Application Network Traffic

Network monitoring applications may rely on a variety of methods to monitor these communication protocols, including newer technologies such as the extended Berkeley Packet Filter (eBPF). With minimal overhead, eBPF tracks packets of network data as they flow between dependencies in your environment, and translates the data into a human-readable format and allowing each different software node to be adequately tracked.

The following network protocols are important to monitor because they are the foundation for most network communication:

Application Layer (Layer 7)

Hypertext Transfer Protocol (HTTP/HTTPS)

This protocol is used by clients (typically web browsers) to communicate with web servers. HTTPS is a more secure, encrypted version of HTTP. Primary HTTP metrics that need to be monitored include request volume, errors, and latency. These metrics can be tracked by looking at the data transmitted to and from each HTTP application on the network. And by capturing these metrics, teams cna understand how each is behaving under various traffic flows and sometimes scale out new servers or change settings to help reduce volume, latency, and errors across the network.

Domain Name System (DNS)

This protocol translates computer names (such as “server1.example.com”) to IP addresses through various name servers. These IP addresses will then function in a similar way to the physical layer described above expect different applications that are sending data to each other. DNS metrics include request volume, errors, response time, and timeouts.

Transport Layer (Layer 4)

Internet Protocol (IP) - Transmission Control Protocol (TCP)

A protocol that sequences packets in the correct order and delivers packets to the destination IP address. TCP metrics to monitor may include packets delivered, transmission rate, latency, retransmits, and jitter. Jitter represents the variation in the latency on a packet flow between two systems when some packets take longer to travel from one system to the other

User Datagram Protocol (UDP)

UDP offers faster transmission speeds than IP or TCP protocols but without advanced features such as guaranteed delivery or packet sequencing. This can be very useful when speed is especially important and where software can often be used to help rectify some packet losses. The flow of traffic at this protocol layer can be measured the same as in the above two protocols and the same metrics should be tracked.

Conclusion

There is a lot that needs to be monitored across a network to be able to ensure network health, but there is no doubt that in monitoring the correct aspects of networks, teams can be well placed to respond to almost every eventuality and design software that is more capable of running on the required infrastructure.

CRAIG RISI