29
Sep 20

Who’s Behind Monday’s 14-State 911 Outage?

Emergency 911 systems were down for more than an hour on Monday in towns and cities across 14 U.S. states. The outages led many news outlets to speculate the problem was related to Microsoft‘s Azure web services platform, which also was struggling with a widespread outage at the time. However, multiple sources tell KrebsOnSecurity the 911 issues stemmed from some kind of technical snafu involving Intrado and Lumen, two companies that together handle 911 calls for a broad swath of the United States.

Image: West.com

On the afternoon of Monday, Sept. 28, several states including Arizona, California, Colorado, Delaware, Florida, Illinois, Indiana, Minnesota, Nevada, North Carolina, North Dakota, Ohio, Pennsylvania and Washington reported 911 outages in various cities and localities.

Multiple news reports suggested the outages might have been related to an ongoing service disruption at Microsoft. But a spokesperson for the software giant told KrebsOnSecurity, “we’ve seen no indication that the multi-state 911 outage was a result of yesterday’s Azure service disruption.”

Inquiries made with emergency dispatch centers at several of the towns and cities hit by the 911 outage pointed to a different source: Omaha, Neb.-based Intrado — until last year known as West Safety Communications — a provider of 911 and emergency communications infrastructure, systems and services to telecommunications companies and public safety agencies throughout the country.

Intrado did not respond to multiple requests for comment. But according to officials in Henderson County, NC, which experienced its own 911 failures yesterday, Intrado said the outage was the result of a problem with an unspecified service provider.

“On September 28, 2020, at 4:30pm MT, our 911 Service Provider observed conditions internal to their network that resulted in impacts to 911 call delivery,” reads a statement Intrado provided to county officials. “The impact was mitigated, and service was restored and confirmed to be functional by 5:47PM MT.  Our service provider is currently working to determine root cause.”

The service provider referenced in Intrado’s statement appears to be Lumen, a communications firm and 911 provider that until very recently was known as CenturyLink Inc. A look at the company’s status page indicates multiple Lumen systems experienced total or partial service disruptions on Monday, including its private and internal cloud networks and its control systems network.

Lumen’s status page indicates the company’s private and internal cloud and control system networks had outages or service disruptions on Monday.

In a statement provided to KrebsOnSecurity, Lumen blamed the issue on Intrado.

“At approximately 4:30 p.m. MT, some Lumen customers were affected by a vendor partner event that impacted 911 services in AZ, CO, NC, ND, MN, SD, and UT,” the statement reads. “Service was restored in less than an hour and all 911 traffic is routing properly at this time. The vendor partner is in the process of investigating the event.”

It may be no accident that both of these companies are now operating under new names, as this would hardly be the first time a problem between the two of them has disrupted 911 access for a large number of Americans.

In 2019, Intrado/West and CenturyLink agreed to pay $575,000 to settle an investigation by the Federal Communications Commission (FCC) into an Aug. 2018 outage that lasted 65 minutes. The FCC found that incident was the result of a West Safety technician bungling a configuration change to the company’s 911 routing network.

On April 6, 2014, some 11 million people across the United States were disconnected from 911 services for eight hours thanks to an “entirely preventable” software error tied to Intrado’s systems. The incident affected 81 call dispatch centers, rendering emergency services inoperable in all of Washington and parts of North Carolina, South Carolina, Pennsylvania, California, Minnesota and Florida.

According to a 2014 Washington Post story about a subsequent investigation and report released by the FCC, that issue involved a problem with the way Intrado’s automated system assigns a unique identifying code to each incoming call before passing it on to the appropriate “public safety answering point,” or PSAP.

“On April 9, the software responsible for assigning the codes maxed out at a pre-set limit,” The Post explained. “The counter literally stopped counting at 40 million calls. As a result, the routing system stopped accepting new calls, leading to a bottleneck and a series of cascading failures elsewhere in the 911 infrastructure.”

Compounding the length of the 2014 outage, the FCC found, was that the Intrado server responsible for categorizing and keeping track of service interruptions classified them as “low level” incidents that were never flagged for manual review by human beings.

The FCC ultimately fined Intrado and CenturyLink $17.4 million for the multi-state 2014 outage. An FCC spokesperson declined to comment on Monday’s outage, but said the agency was investigating the incident.

Tags: , , , , , , , ,

44 comments

  1. West.com (Intrado) was recently targeted by a large DDoS attack on August 25. Additionally, our scans found Intrado had multiple F5 BIG-IP servers vulnerable to CVE-2020-5902. We’re currently monitoring this as vector of compromise for ransomware and other cyber attacks.

    • Where did you hear about the August 25th breach? I don’t recall hearing that in and there is nothing in Intrado’s news releases that deal with that.

      • I have no idea whether this happened or not. But I would not expect that Intrado would put it on anything public facing if they could help it. So, the fact that it’s not in their press releases means nothing.

  2. CenturyLink here in central VA was Sprint’s new name a while back, but a quick check shows that these companies acquire new names more often than Zasa Gabor.

  3. Microsoft of course.

    Anyone depending on MS for life critical operations is flat-out incompetent and an active menace.

    “Sources”? Yeah okay. MS is at the root of it. The local governments and Intrado and Lumen used MS when they should not have and are also responsible but it always goes back to MS.

    • Your WAG is based on ………. absolutely nothing?

      • Well not sure if this is the argument Mike would have used but I would have to agree because of the “life critical operations” part. Software that is that important should be at as low a risk as possible to fail from its complex systems and/or infrastructure. Running the 911 service solely over something like azure cloud would had been quite a big risk where even a couple of minutes of down time is unacceptable (if that was how the system was setup in the first place and I highly doubt that is the case).

        IT solutions have a tendency for several reasons to have increased complexity over time. While OT only have one goal and that is reliability and that seems go hand in hand with not making complex systems.

        This is also the biggest reason why there has been a hard line between IT and OT operations historically. Big industrial machines are both to dangerous and expensive to be open to the IT infrastructure where it could be hacked or be affected by downtime of the IT infrastructure. This is about to change because many organization want even more production and effectiveness and are looking at the wall between IT and OT as a place to gain more of it. Would be interesting to know if that have happened here.

        Would also mentioned that a lot of programming languages are designed to handle this kind of situations where up-time is paramount. They have features that make it so you can update the software and the processes running will keep on going even if the update effects the process in question. So you can update a phone central and the calls will not even get a human detectable disruption. This types of software often run directly on hardware and works as its own operational system. (have only heard of this kinds of system second hand trough some other technicians working with the programming language rust, so take it with a grain of salt. Maybe somebody else can extend discussion on that part? :))

  4. The Sunshine State

    I live in the Sunshine State, I didn’t hear about this one , not in my raceway county.

    • Is your local company an “independent” or AT&T? These outages seem to be served by databases that supply services to small operating companies or what used to be known as independents. Century Link/Intrado/Lumen is a conglomeration of parts pieced together over the years from small rural telecom companies and more recently larger and more diverse telecommunication companies. The old Bell operating companies absorbed by AT&T were not involved from what I can tell. Where the independents relied on Intrado now called Lumen, AT&T relied on their own mostly in-house databases and routing.

      • You messed up some of your facts. Intrado and Lumen are and always have been separate companies. Lumen use to be known as CentruyLink and Intrado was West Safety Communications. It is all there in the story.

  5. all this is shows that economy is down and people do what ever they can do survive and living good life but it means we need to start money printing all over the world and those who want to earn moneywill get money without criminal activity like this quatntive easing can help economy go in to boom we need economy boom asap economy boom is for our own security and well beeing

    • This has nothing to do with the economy. It is about a poorly managed mission critical operation that substitutes developing technology for proven dependability. They are using the emergency call system as a lab to find cheaper but unproven ways to provide the service.

    • As an economy, you most of all should know that when you print more money, it becomes worth less. My best description – gold has never changed value. Currencies value change can be measured by how much gold you can buy with it.
      Review the printing of money in Venezuela, as a sample of over printing money. There are lots of examples you can find.

      • …well yes and no. if the expansion of currency that comes from printing exceeds productivity gains, then yes you have inflation and the currency is worth less…

        …if on the other hand, the expansion of currency does not exceed productivity gains you have the opposite effect and the currency actually increases in value regardless of how much you “printed”…

        …so you need to know the velocity of the currency in % and the productivity in % to know if the “printing” devalued or increased the underlying currency…

  6. If it’s mission critical why not have one server in your data center. One in Azure and one in AWS.

    • Because the trend is to get Government down to the size of it being able to be drowned in a bathtub. That is accomplished through cuts in funds. What you propose would be a budget item considered against something more critical than an expensive component of a Disaster Recovery Plan.

      • Redundancy is not a word that governments use, because of the almighty dollar talking.

        • Well, we do complain about them spending that dollar. We can’t simultaneously decry government spending then get upset that something went wrong because they cut spending as a result.

          • It can be done, but it takes someone smart enough to put it together.

            If you don’t have a backup plan, you are making a big mistake. Governments do that on a much bigger scale, and don’t learn until after the fact.

            Stupid in makes it stupid out and that famous quote:

            “I don’t recall”

            • Government spending is at its highest point in human history, where exactly are the cuts? More and more money is being printed and handed over to every government agency, most agencies have their highest budgets ever in nominal dollars.

              There is nobody advocating to cut spending on anything, if you actually look at what goes on and stop staring into the TV screen and your phone.

          • “We can’t simultaneously decry government spending then get upset that something went wrong because they cut spending as a result.”

            Politicians from a certain political party can and do.

            • Government spending is at its highest point in human history, where exactly are the cuts? More and more money is being printed and handed over to every government agency, most agencies have their highest budgets ever in nominal dollars.

              There is nobody advocating to cut spending on anything, if you actually look at what goes on and stop starting into the TV screen and your phone.

              • “Nominal dollars,” eh? That’s really misleading, although I suspect you know that.

                My mortgage is also “the highest it’s ever been in nominal dollars,” as are my property taxes. But that completely ignores factors like inflation.

                An honest way of measuring government spending would be to measure it as a % of U.S. GDP, which in 2019 was 34%. That’s likely to sky-rocket with the COVID pandemic spending, stimulus, etc. Although I’m not sure how you can’t money that you created out of thin air…I guess we’ll see.

                “There is nobody advocating to cut spending on anything,” – this is just straight up false. This has been the entire platform of the Republican party since Reagan was president. Although what they say and what they do tend to be two different things. Generally what it really means is take money away from social programs and services for the populace, and throw it into defense spending and military-industrial contracts for wealthy donors.

              • No budget cuts?….PK if you say so:

                https://www.cbpp.org/research/federal-budget/trumps-2021-budget-would-cut-16-trillion-from-low-income-programs

                I’m sure there are more examples, but, google can explain that.

      • This is not government. This is a for profit stock company.

        • This is a government service being run by the private sector under specifications and pay set by the government.

          • Mmmm…yes and no. The individual states/counties/etc. are the ones that provide the services. They just choose to purchase the software and routing services of privates companies to accomplish that. Not to say that I don’t think the 9-1-1 system is broken, because it certainly is. It’s just that these local and state governments don’t have enough money to spend on fixing it. The private companies involved in the public safety and 9-1-1 industries have been pleading for major overhauls for years, but if the state and local governments aren’t willing to spend the money to do so, then there’s nothing that can be done. The private companies can’t work on a deficit to fix it on their own, and they really are a small piece of a much larger puzzle anyway.

            The federal government is barely involved at all, and can hardly even be called a regulator. The FCC has some regulations and requirements for 9-1-1 service, but they’re very broad and bare-bones. The communications providers (Verizon, AT&T, CenturyLink, etc.) actually have the most influence, but no one (e.g. the fed) holds their feet to the fire, so they only do the bare minimum required by the FCC as 9-1-1 is concerned.

    • That would be avoiding a single point of failure and not something a private company is as concerned with as making as much profit as possible. This is where the old Bell System excelled. Everything had several redundancies built in, especially 911, but their mission was not profit. Their mission was dependable universal service. In return, they had a protected monopoly.

  7. …no one is yet saying what happened (or failed to happen) in this outage…

    …everything else is pure speculation at best…

  8. Intrado explicitly advertises E911 services integrated with Microsoft Teams, a service that was supposedly “jointly developed with Microsoft, [and] is field-tested and proven in deployments across North America.” https://www.west.com/safety-services/enterprise-e911-solutions/microsoft-teams-e911-solutions/

    So obviously there’s some Microsoft integration at some point inside Intrado. Is it just from Intrado to Teams, or is the infrastructure more dependent on Microsoft products than they’re letting on? If they “jointly developed” this solution with Microsoft my guess is there’s more of Microsoft at other levels of their E911 setup than just this.

    Otherwise it’s one really funny coincidence that this overlapped exactly with Microsoft’s massive worldwide failure yesterday.

    • …Teams = rebranded Skype…

      …so the “integration” was not what you’re thinking…

      ..now to the extent the 911 operators could not function w/o Skype – that’s a whole new kettle of phish (pun intended)…

  9. That’s 10’s of millions of folks. How many died?

  10. Jeannette Anderson

    On another topic, did MS say why they had the Azure outage?

  11. Threatening prosecution for paying ransomware to embargoed countries is incredibly stupid.
    Online attack attribution is extremely unreliable, that’s one problem.
    Ransomware attackers are rarely actually representing governments of nations, that another problem.
    And threatening prosecution for companies dealing with losses is a third: unless the government provides some type of loss mitigation, this is purely legislation without representation.