Revolutionizing SNMP Monitoring: How I Will Tackle SNMP Gluttony with Go and OpenSearch.

Revolutionizing SNMP Monitoring: How I Will Tackle SNMP Gluttony with Go and OpenSearch.
Technicians Monitoring a Futuristic Network

In the realm of network management, SNMP (Simple Network Management Protocol) gluttony has long plagued administrators. This term refers to the excessive and inefficient polling of network devices, leading to network congestion, increased latency, and even device failures. Traditional SNMP monitoring tools often exacerbate the problem by polling all interfaces on devices, including those that are inactive. This realization dawned on me during a particularly challenging project: monitoring approximately 2,300 cable modems on a 5 minute cycle using Go.

The Genesis of a More Efficient Approach

While working on the modem polling project, I confronted the inefficiency head-on. Each cable modem, functioning like a router, possesses multiple interfaces. Polling every single modem, regardless of its operational status, was not only redundant but also resource-intensive. This insight led to the development of a state table to track online modems, focusing polling efforts solely on active modems and their interfaces.

The breakthrough was recognizing that the solution wasn't just applicable to cable modems but could revolutionize SNMP polling for any network device. The principle was simple yet powerful: dynamically track and poll only operational interfaces to minimize unnecessary SNMP traffic and optimize resource usage.

Technical Implementation: A Go-Based Poller

The solution I devised involved implementing a Go-based poller that uses two OpenSearch indices:

  1. Interface State Index: Tracks the status of interfaces, updating periodically or based on specific triggers. Using OpenSearch we could store millions of interfaces in one index and query them in an instant.
    1. Also with the state index being a smaller index that is not storing billions of metrics it can be used as an instant snapshot of the latest state of all interfaces.
  2. Metrics Index: Stores time series metrics collected from active interfaces.

The Go-based poller operates by:

  • Maintaining a State Index: This index tracks the state of all interfaces on the specified devices.
  • Polling Active Interfaces: Only interfaces that are up and running are polled, significantly reducing the load.
  • Dynamic Updates: The interface state index is updated for all active interface every time that interface is polled. But, the interface state index is also updated based on configurable intervals (hourly, daily, weekly) or specific events like reboots or SNMP traps.

The biggest reason for the state index is that we can use Go workers to fetch chunks of OIDs prepared for one or multiple interfaces in the same query, rather than using SNMP bulkwalks that may fetch data from many inactive interfaces. The code decides which approach is better. For instance:

  • Fully Loaded Switch: On a fully loaded 48-port switch, a few SNMP walks are likely more efficient than numerous SNMP bulk gets.
  • Router with Many Interfaces: For a router with hundreds or thousands of interfaces but only a small percentage in use, SNMP bulk gets may be more economical.

The Go poller queries the state index to make these decisions. If 100% of interfaces are up, an SNMP walk is probably a good idea; if it's only 50%, it might not be.

Advantages of OpenSearch Over RRD Files

One of the most significant advancements in this project is the use of OpenSearch indices to store metrics instead of traditional RRD (Round Robin Database) files. This approach offers several key benefits:

  • Flexibility in Data Types: With RRD files, data must conform to predefined types and constraints set during the initial creation of the RRD file. In contrast, OpenSearch allows us to store raw data without such limitations, providing greater flexibility in data analysis and usage.
  • Unlimited Data Compression Options: As the project evolves, it will incorporate internal options for compressing data into various aggregates (hourly, daily, weekly, monthly averages, and more). There are no limits to how far you can compress the data. If certain data is no longer needed, it can be discarded without affecting the rest of the dataset.
  • No Data Inflexibility: RRD files are immutable once written; the data format is set in stone. OpenSearch, however, allows for dynamic data management, enabling more sophisticated data handling and analysis techniques.
  • Distributed Polling Over Time: Unlike RRD files, OpenSearch allows for spreading the polling load over the entire polling cycle. If you have a 5-minute poll cycle, the design can use the entire 5 minutes to gather data from different interfaces. This prevents the system from being hammered with one huge request at once, enhancing performance and reducing peak load.

Why This Approach is Superior

  1. Resource Efficiency: By focusing on active interfaces, the system reduces bandwidth and processing power consumption, optimizing resource use.
  2. Scalability: The poller can handle a larger number of devices without performance degradation, a crucial feature for growing networks.
  3. Relevant Data Collection: Collecting data only from active interfaces eliminates the noise from inactive ones, making the data more relevant and actionable.
  4. Resilient to ifIndex Changes: The state table can be wiped at any time without losing monitoring data, thanks to reliance on stable ifDescr (interface descriptions) instead of volatile ifIndexes.
  5. Dynamic Polling and Calculation: OpenSearch allows for real-time data querying and calculations. Whether the last poll was 5 minutes or 5 minutes and 30 seconds ago, the data can be accurately queried and converted to events per second, providing valid metrics without the strict timing constraints imposed by RRD files.

Conclusion

The journey from recognizing SNMP gluttony to developing an efficient polling system was driven by the need for practical and scalable network monitoring solutions. The Go-based poller, with its intelligent state table and dynamic polling mechanisms, represents a significant leap forward. By sharing this project on GitHub, I hope to contribute to the community and help others implement more efficient SNMP monitoring systems.

For years, I've had similar ideas simmering, but it wasn't until I tackled the cable modem monitoring project that everything clicked. This project sparked the development of an efficient SNMP polling system, which I'm excited to refine and eventually share on GitHub. There's a lot on the roadmap, from APIs for ChartJS and EasyJSON to configurable compression for back-end index management. While there's plenty more to explore, I think the current state of the project will really resonate with anyone who enjoys digging into innovative network monitoring solutions.

Many might argue that sharing so much about my project could lead to someone stealing the idea, but I believe in the power of open innovation. Consider Volvo, who gave away the patent for the seat belt, saving countless lives as a result. By openly sharing my work, I hope to contribute to the community, inspire others, and drive collective progress in network monitoring solutions. After all, true innovation benefits everyone.