This ones a doozie I was working on Cloud Management software for a Private Clou...

This ones a doozie

I was working on Cloud Management software for a Private Cloud at a major tech company in SV. We had software which would reserve Prod IP space for hypervisors, e.g. this hardware SKU can support up to 5 VMs, therefore it needs to reserve 5 IP addresses in the corresponding subnet.

Turned out the API call to reserve the IP space from the IP Manager wasn't asynchronous and because the manager tried to get consecutive space, the runtime increased exponentially with the requested # of IP addresses.

In preparation for Holiday traffic, we were onboarding a new SKU of Hardware. This hardware supported more tenants and so instead of requesting 7 IP addresses per HV, now we're asking for 15. This took the latency of a call to the IP Manager from 3-5 seconds to 5-10 minutes. To round off the perfect storm, the code was retrying requests which failed, without propagating the failure to the Cloud Admins using the software.

One day in October, I received a panicky call from our Capacity manager, customers are trying to spin-up VMs but are being told there's no IP space left. He knows we've onboarded all the racks, and he's done the math on the subnets (which are showing as fully reserved), and there still isn't IP space...WTF!!

Turned out the IP manager's VIP was cutting off requests after a few minutes, (never a possibility when reserving only 7 spaces) but the reservation process wasn't stopping, the IP was being reserved, marked as in-use, but never actually making it to the networking service to be used by VMs.

Solution: At 2am on a Friday night I ran a script to manually mark tens of thousands of production IP records as not-in-use in the IP manager, purely based on grepping through logs from my service, and nslookups. But don't worry, we pinged each IP just to be safe :)