[bars] 911 outage root cause revealed

Thu Jun 20 11:46:07 CDT 2024

It's easy to design a system in a lab that can withstand certain
predictable failures. With a large production system at scale, it's
virtually impossible. You just get much more exotic failure modes. "100%
redundant" does not mean "100% reliable."

Some fun ones I've personally encountered:

- A trivial config change, peer-reviewed and tested in lower environments,
took down all four production load balancers at a previous company. The
config change itself was fine, but it triggered a latent bug in healthcheck
rules that erroneously flagged all backend webservers as unhealthy. (This
also meant that our alerts steered us in the wrong direction.)

- Our gear in a well-regarded data center went offline during a blizzard.
There were independent A and B power grids with our equipment sitting on
both, and each was protected by a giant UPS and redundant generators. Turns
out the blizzard was a red herring, and the data center happened to do some
routine maintenance, failing one grid over to a generator that came up
putting out 60 Volts. It also turns out that some switches upstream of us
were only on the affected power grid, so despite our equipment staying
online and us having redundant switches with VRRP from the carrier,
everything in the building fell off the Internet for at least an hour.

- A small UPS fire in a remote data center caused no impact to my gear. The
Secaucus fire chief ordering that they kill all power sources while they
worked, however, caused a major impact. (So did the company declaring
bankruptcy a few months later, with less than 24 hours notice before
shutting down operations, but that's another story.)

- A "high-availability" clustered filesystem with STONITH fencing was
designed to detect split-brain scenarios and have the quorum shut down a
rogue node to prevent data corruption. Some network glitch we never
entirely figured out caused multiple partitions, each thinking the others
were bad, so they all powered each other off. By dumb luck, this happened
in our QA environment and not production, so it was a fun adventure rather
than a nightmare.

And these aren't even the really fun ones, like BGP route leaks/hijacking.

At this point they haven't said anything about the cause of the outage
except that it involved a firewall. It's possible it was a really foolish,
preventable mistake, but right now no one knows.

On Wed, Jun 19, 2024 at 10:48 PM JWAHAR BAMMI via bars <bars at w1hh.org>
wrote:

> Hmmm…. so the 911 system does not have a redundant system it can fail over
> to if the primary system goes down? Even way more modest enterprise systems
> (like the ones i sell) have 100% redundancy and fail over to the disaster
> recovery site in real time. Does MA not treat 911 as a mission critical
> system? Will be interesting to know from someone who has intimate knowledge
> of the setup.
>
> 73 de k1jbd
> bammi
>
>
> On Jun 19, 2024, at 10:03 PM, Juan Jiménez <k1cpr at bd5.com> wrote:
>
>
> https://www.boston25news.com/news/local/massachusetts-officials-reveal-cause-hourslong-911-system-ritical system
> outage/FA22RGI3QBB6FBPWLJMKO7SOFM/
> <https://www.boston25news.com/news/local/massachusetts-officials-reveal-cause-hourslong-911-system-outage/FA22RGI3QBB6FBPWLJMKO7SOFM/>
> _______________________________________________
> bars mailing list
> bars at w1hh.org
> http://mail.w1hh.org/mailman/listinfo/bars_w1hh.org
>
>
> _______________________________________________
> bars mailing list
> bars at w1hh.org
> http://mail.w1hh.org/mailman/listinfo/bars_w1hh.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.w1hh.org/pipermail/bars_w1hh.org/attachments/20240620/4c13f920/attachment.htm>