What lessons can the NERC community learn from the AWS outages?

Note from Tom:

I have moved to Substack as my primary blog platform. If you want free access to all my new posts, as well as my 1200+ legacy posts dating back to 2013, please support me by becoming a paid subscriber to my Substack blog. The cost is $30 a year. Thanks!

 

I’m sure just about everyone reading this post was affected by the Amazon Web Services outages on Monday. I have to use the plural “AWS outages” because there were many of them, although few were in AWS services used directly by end users. Instead, the most visible outages were in services provided by AWS customers, including Zoom, Venmo and Instacart. I was tempted to use the term “AWS-caused outages”, but that’s not accurate, since it seems the root cause was a faulty DNS change, probably not made by AWS.[i]

Here are my observations on what the outages may mean for the NERC CIP community, and especially for the effort (which I have been very involved in) to change the NERC CIP standards to allow NERC entities to deploy BES Cyber Systems in the cloud, without being out of compliance with any of the standards.

How can (and can’t) NERC entities utilize the cloud today?

A good friend emailed me on Monday morning with the comment (accompanied by a link to an article on the AWS outages), “Please explain to me why folks want to put their BES Cyber Systems in the cloud…” Note that the question whether BCS should be deployed in the cloud is quite different from the question whether BES Cyber System Information (BCSI) should be stored or used in the cloud. That question was answered in the affirmative by the Project 2019-02 BES Cyber System Information Access Management Standards Drafting Team. They developed the modified standards CIP-004-7 and CIP-011-3, which took effect on January 1, 2024. If a NERC entity complies with those standards, their storage and use of BCSI in the cloud will be both safe and compliant.

However, “BCS in the cloud” is a much more complex topic. First, I want to define the term “Cloud BES Cyber System” by using wording from the definition of BES Cyber Asset: “A system that, if rendered unavailable, degraded, or misused would, within 15 minutes of its required operation, misoperation, or non-operation, adversely impact one or more Facilities, systems, or equipment, which, if destroyed, degraded, or otherwise rendered unavailable when needed, would affect the reliable operation of the Bulk Electric System. Redundancy of affected Facilities, systems, and equipment shall not be considered when determining adverse impact.”

The most important part of this definition is the requirement that the system have a sub-15-minute impact on the BES. In my opinion, this means the Cloud BCS must have a direct connection to a device that directly monitors or can change the operation of the BES. If there is no direct connection to the BES, the software alone isn’t a BCS and shouldn’t be a concern for the CIP standards, unless it’s a SaaS product that consumes BCSI and is therefore subject to CIP-004-7 and CIP-011-3 compliance.

There are three questions regarding BCS in the cloud:

1.      Is deploying BCS in the cloud safe? Since we’re talking about risk management here, this question needs to be rephrased as, “Is it safe with regard to (risk type)?” We’re concerned with risks to the Bulk Electric System (BES), so we should ask, “Is deploying BCS in the cloud safe with regard to risks to the BES?” This will distinguish the question from, for example, “Is deploying BCS in the cloud safe with regard to financial risks to the NERC entity”? That is a legitimate question, but it’s not one we’re concerned with here.

This is where the issue of cloud outages comes in. We really want to ask, “Is the BES risk posed by cloud outages great enough that deploying BCS in the cloud should be forbidden, either by the NERC CIP standards or otherwise?” That is essentially what my friend was asking.

2.      Is deploying BCS in the cloud CIP compliant? The answer to that question is Yes, but it’s for a trivial reason: Since the word “cloud” is never even mentioned in the current BCSI requirements, the requirements don’t “prohibit” cloud use. They also don’t prohibit throwing BCS in a river or incinerating them, yet I wouldn’t recommend either of these as a compliance strategy.

This question is better phrased as, “Will deploying medium or high impact BCS in the cloud put a NERC entity at risk of non-compliance with one or more current NERC CIP requirements?” The answer to that question is Yes, although – once again – nothing in the CIP requirements today prohibits deploying any BCS in the cloud. However, if a NERC entity deploys even one medium or high impact BCS in the cloud, their CSP will need to provide them with evidence that they maintained CIP compliance for over 100 CIP Requirements and Requirement Parts; this includes evidence for highly prescriptive requirements like CIP-007 R2 (patch management) and CIP-010 R1 (configuration management). No CSP is likely to agree to provide this evidence, meaning the NERC entity may be found non-compliant with every current CIP requirement.

Note: One context in which BCS-in-the-cloud often comes up is the question of cloud-based SCADA software (i.e., SCADA as SaaS). Is using cloud SCADA software likely to put a NERC entity out of compliance with one or more CIP requirements? There are three possibilities:

a)     If this is for a low impact CIP environment, there is no restriction on what can be deployed in the cloud, although compliance needs to be maintained. In fact, in this post, I described (with a lot of help from Kevin Perry) how a complete low impact Control Center can be implemented in the cloud, while maintaining full compliance with the current CIP requirements. I’m told that a small number of low impact Control Centers have already been deployed in the cloud, although I don’t know whether they have yet passed a CIP compliance audit.

b)     If the SCADA-in-the-cloud software is being used by a medium or high impact Control Center and has a direct connection to a device deployed on the BES (e.g., a connection to a relay in a generating station or transmission substation that operates a BES-connected device like a circuit breaker), it has a sub-15-minute impact. Thus, it is a medium or high impact BCS. The NERC entity will need to provide compliance evidence for over 100 Requirements and Requirement Parts. As I’ve pointed out many times, no CSP will ever be able to provide this evidence, since software in the cloud never “resides” on a single device (whether virtual or physical) and even if it did, it will frequently move from device to device and data center to data center.

c)      If the SCADA software is being used by a medium or high impact Control Center but is not directly connected to a device deployed on the BES (e.g., the software provides recommendations to a human being or intelligent system[ii], which then makes the “decision” whether to command a circuit breaker to open), it is not a BCS. However, at least some of the data used by the software likely constitutes BCSI and is subject to compliance with CIP-004-7 and CIP-011-3.

3.      What changes are needed to the NERC CIP standards to allow medium and high impact BCS to be deployed in the cloud? This depends on the type of BCS you want to deploy. I believe that the only likely deployments of BCS in the cloud will be a) high and medium impact systems in Control Centers that only perform a monitoring function, b) medium impact BCS in renewables Control Centers (which may not require a real-time connection to the BES), and c) (as I already described) any low impact system deployed in a Control Center. For all other BES Cyber Systems (including all BCS deployed in substations and fossil generating stations), I agree that deployment in the cloud poses too much risk to be advisable. The AWS outages just reinforce that opinion. 

Do you call this “decentralization”?

I was struck by the statement in the WSJ article that a Coinbase customer replied to Coinbase’s acknowledgment of a problem by writing, “A crypto exchange brought down by a centralized cloud. You can’t preach decentralization and still depend on AWS to stay online.” I’ve always said that the decentralization of the power grid is the best protection against a cyber event causing a widespread outage. In other words, I think it is close to impossible for a single cyberattack, or a coordinated set of cyberattacks, to cause a widespread and/or lengthy power outage. The North American power grid is just too diverse, and the entities operating the grid are just too decentralized, for that to happen.

However, I must admit that, if enough grid cyber assets (which could include high, medium or low impact BES Cyber Systems, non-CIP transmission and generation assets, and even purely distribution assets) were deployed with a single CSP that experienced a large-scale problem (even if they were only indirect users, as the Coinbase and Venmo users found out the hard way on Monday), that might well lead to a widespread power outage.

For example, suppose a SaaS grid monitoring application is used by many renewables producers. The app meets the “Cloud BCS” definition because it would have a sub-15-minute impact if lost or compromised. Even if all the on-premises systems that use that application are low impact, the mere fact that all of them lost grid visibility at the same time could leave the grid (or more realistically, one Interconnect) dangerously vulnerable to collapse. This could happen if a serious event occurred during the outage, like a sudden surge in supply, which would require some of those producers to temporarily decouple from the grid to prevent frequency from rising; the renewables producers might not be aware of the need to decouple, due to the outage.[iii]

While it’s not hard to describe the problem, it’s quite hard to identify an easy solution. One way to prevent that might be to limit the number of NERC entities that can contract with one CSP at one time. However, neither NERC nor FERC has any ability to prevent a NERC entity from contracting with whichever CSP they want.

Another way might be to identify best practices that will help CSPs reduce the risk of outages, then require utilities to revise their CSP contracts to include the requirement to follow those best practices. However, there are some huge problems with this approach:

1.      Since CSPs have every incentive in the world to reduce their risk of incurring an outage, it’s quite unlikely that the NERC community could identify best practices the CSPs don’t already know about. Plus, we would need to have an intimate knowledge of their inner workings, which is not available to the general public.

2.      As stated earlier, neither NERC nor FERC can require a NERC entity to contract with, or not to contract with, an individual CSP. By the same token, neither NERC nor FERC can dictate the terms of a contract with a third party to a NERC entity.

3.      Not only do neither NERC nor FERC have the power to dictate the terms of contracts with CSPs to NERC entities, but it would be a violation of antitrust laws to do so. Antitrust laws work both ways: They prevent suppliers from conspiring to raise prices or reduce services for consumers, but they also prevent consumers (in this case, electric utilities) from conspiring to force concessions from a supplier.

I honestly don’t see any way good way to write a CIP requirement that would address the cloud outage problem, other than to prohibit NERC entities from using the cloud at all. However, that isn’t a solution to our current problem, which is that the wording of the current CIP standards is inadvertently preventing full use of the cloud by NERC entities that want to use it.

A good example of this problem is Electronic Access Control or Monitoring Systems (EACMS). Today, NERC entities with high or medium impact CIP environments are effectively banned from utilizing most cloud-based security monitoring services. This is because, if the service meets the definition of EACMS, the service provider will need to document their compliance with all the Requirements and Requirement Parts that an EACMS must comply with (well above 50); of course, this is just as impossible for the security services as it is for the CSPs.

However, unlike the other current CIP/cloud problems (like BCS in the cloud and PACS in the cloud), the EACMS problem has a looming deadline: October 1, 2028, when compliance with CIP-015-1, the standard for Internal Network Security Monitoring, will come into effect for medium and high impact BCS with External Routable Connectivity (ERC). Today, most of the services that provide this monitoring are cloud-based; if anything, more of them will be cloud-based in 2028. Even though there will probably still be some on-premises based services, they are likely to be less feature-rich and more expensive than their cloud-based counterparts.

I said last fall that I think the current “Cloud CIP” standards drafting team (SDT) will not see their new and revised standards become effective until late 2031. I have revisited that prediction recently, and I have good news and bad news. The good news is that I no longer think 2031 is a good estimate. The bad news is that my estimate – unless the SDT drastically shrinks their idea of what they need to accomplish – is now much later than 2031 (I don’t want to tell you my new estimate without first explaining how I arrived at it. I plan to do that in my next post).

However, I have an idea of a realistic path to move forward, which will solve the most pressing CIP/cloud problems by 2028, as well as move the other cloud problems (including the CSP outage problem, of course) from the realm of mandatory regulation to the realm of risk management, through a voluntary cloud risk management framework.

In addition to making that change, we need to acknowledge that new cloud risks are being identified all the time; if anything, that trend will accelerate year after year. The NERC standards development process is far too slow to deal with those risks in a realistic timeframe (for example, the risk of a laptop carrying malware being connected to a network and infecting devices on the network was well-known, through painful experience, in the later 1990s. When did CIP-010 R4, the requirement for protection of Transient Cyber Assets, come into effect? 2017).

This is another reason why we need to remove cloud risks from the realm of regulation to the realm of voluntary risk management. The electric power industry will be able to address new risks quickly, rather than never (by the way, do you know the status of the new CIP requirement for ransomware? I didn’t think so, since there isn’t one, and none has even been proposed. Ransomware has been a serious threat for more than one decade, yet there will probably never be a ransomware requirement in CIP).

I will discuss these ideas in much more detail in my next post. I’ll admit that my record for keeping my promises about a new post being imminent is terrible – I have almost never kept a promise like that. However, in this case I can say that the post is almost completely written, although it will require  some revisions. I will have it up tomorrow or Saturday.

You can take that to the bank.

 

If you would like to comment on what you have read here, I would love to hear from you. Please email me at [email protected] or comment on this blog’s Substack community chat.

I’m now in the training business! See this post for more information.


[i] My main source of information on the outages is this Wall Street Journal article, which is quite good. The article is behind a paywall. If you don’t want to sign up for the free WSJ trial, drop me an email and I’ll send you a PDF of the article.

[ii] In saying that an “intelligent system” can take the place of a human, I’m going a little out on a limb. However, if a deterministic algorithm for determining the appropriate action can be programmed into the device, that strikes me as something that should be doable, without raising the hackles of a NERC auditor who follows NERC’s current guidance that AI shouldn’t be making grid decisions. However, your mileage may vary; check with your NERC Region before implementing such a system.

[iii] This example is based on several assumptions that might be very unrealistic. However, with more time, an electrical engineer familiar with the workings of the BES could probably come up with a realistic one.

1