Someone, somewhere, is the person who pressed go on that software update. And right now they know exactly what they did and what it has done. And as someone who knows that sinking feeling when you realise you’ve screwed up at work, I cannot imagine the state of them right now.
— Jim Waterson (@jimwaterson) July 19, 2024
Today’s tweet about the Crowdstrike incident, which seemingly brought the modern IT world to a standstill, reminded me of the darkest day of my professional life — when I accidentally knocked out internet access in a city of over 200,000 people.
It was my second year of university and I worked for a the largest local ISP in my home city as a junior system administrator. We had a large wireless network (~100km in diameter) covering our whole city and many surrounding rural areas. This network was used by all major commercial banks and many large enterprises in the area (bank branches, large factories, radio stations, etc).
To cover such a large area (in Ukraine in early 2000s), about 50% of which were rural villages and towns, we basically had to build a huge wifi network, that had a very powerful antenna in the center and many smaller regional points of presence would connect to it using directional wifi antennas and then distribute the traffic locally. The core router connected to the central antenna was located at the top floor of the highest building in the area about 20 min away from our office.
One day I was working on some monitoring scripts for the central router (which was basically a custom-built FreeBSD server). I’d run those scripts on a local stand I had on my table, make some changes, run it again, etc. We did not have VMs back then, so experimental work would happen on real hardware that was a clone of a production box. In the middle of my local debugging, I received a monitoring alert from our production saying that our core router had some (non-critical) issues. Since I was on-call that day, I decided take a look. Fixing the issue on the router, I went back to my debugging and successfully finished the job after about an hour.
And that’s where things went wrong… When I wanted to shut down my local machine, I switched to a terminal that was connected to the box, typed “poweroff”, pressed Enter… and only then realized that I did it on a wrong server! 🤦🏻♂️ I had that second terminal window opened ever since the monitoring alert an hour ago, and now I ended up shutting down the core router for our whole city-wide network!
What’s cool is that there was no blame in the aftermath of the incident. The team understood the mistake and focused on fixing the problem. We ended up having to drive to the central station and manually power the router back on. Back then we did not have any remote power management set up for that server and IPMI did not exist back then. Dark times indeed! 😉
As a result of that mistake, our whole city’s banking infrastructure and a bunch of other important services were down for ~30 minutes. Following the incident, we have made a number of improvements to our infrastructure and our processes (I don’t remember the details now) making the system a lot more resilient to similar errors.
Looking back now, huge kudos to my bosses for not firing me back then! This incident profoundly influenced my career in many ways:
First, the thrill of managing such vast infrastructures made me want to stay in technical operations rather than shifting to pure software development, a path many of my peers chose at the time. Then, having experienced such a massive error firsthand, I’ve always done my absolute best to safeguard my systems against failures, optimizing for quick recovery and being paranoid about backups and redundancy. Finally, it was a pivotal moment in my understanding of the value of blameless incident process long before the emergence of the modern blameless DevOps and SRE cultures — a management lesson that has deeply informed my approach to leadership and system design ever since.