Anthony Michaelson

Chop wood, carry water

IPncm

Large-scale network configuration management solution

Introduction

One day, while working in my office at my corporate job, I opened an email from the lead salesperson for our region. They provided me with an SOW and the managed services contract for a new $75 million-a-year deal and asked for a technical review. I noticed one line that seemed rather insane. The customer, or Cisco people, had added a requirement that:

every single IOS device in the customer production environment must be upgraded within 4 hours upon notice of any discovery of a potential security vulnerability!

Since I had developed THE Cisco vulnerability detection framework used up to that time, I knew many of these so-called vulnerabilities could be rather trivial and were detected frequently. This meant that if the customer wanted to demonstrate our inability to meet our contractual requirements, they could likely send an email at any time asking for this to be done.

I walked over to the salesperson’s office and exclaimed, “You know this is not a normal procedure, right?” I tried to talk some sense into him to have that line removed or updated. Of course, the response was, “This is already going to be approved as is by the CEO.” This was pretty normal in that company, as I learned over the years. In response, I decided to ensure our team could satisfy the requirement, and we set out to design a product that could upgrade 30,000+ Cisco IOS devices in 4 hours or less with zero fallout.

This product is the result: https://github.com/tony-michaelson/IPncm

Project Outcomes

Eventually, the product caught the attention of many operations managers, and they began using it for mass automation of all their configuration management duties for their customers. The project was used to manage the configuration state of over 200,000 devices at its peak before the company started shutting down its managed services business. After a 9-year run, the project was retired.

IPncm enabled small teams of network engineers to successfully manage thousands of devices. Rollback features based on network tests run after config changes ensured extremely high guarantees that these business-critical systems remained operational. As time went on, IPncm was expanded to support more device types from all major vendors and was able to manage configurations for Linux systems or anything with an SSH connection.

I focused on a product design that delivered high reliability above all. Teams using the product were provided with reports of every change and could use it for auditing before and after changes to ensure compliance.

The gross income from the project is difficult to estimate, but it could conservatively be credited as the lynchpin in securing the initial $75 million annual contract, which ran for 5 years. The cost savings for reducing 15–50 FTEs, depending on the year, was significant. The product became a bonus for any customer of the managed services line of business, and all customers eventually benefited.

Architecture and Design Choices

The product was written in Perl5 using Net::OpenSSH and threads. I chose Perl5 because it was the primary language used within the core monitoring team at the company and because Perl5 was capable of true multi-threading compared to Python or Ruby at the time.

For scalability, the product was designed to have a central location for issuing change requests with host lists, a Connector, and a Client running on the monitoring servers that had all the hostnames for the devices. Piggybacking on the existing monitoring infrastructure allowed us to guarantee reachability to all devices under management.

If I were designing this product today, I would consider Go, Rust, Nim, Python, TypeScript/Node.js, and Elixir.