By Nicky Green - Advisory Director | 13/11/2024

Operational Resilience – Lessons from the CrowdStrike Outage

Share Categories

It’s inevitable that there’s a lot of inter-reliance between firms, whether it’s for technology provision, platform access or outsourced operations. Over time, the FCA has seen an increase in the number of third party issues causing operational incidents and as a result, they have understandably shone a spotlight on operational resilience and in particular, reliance on unregulated third parties to deliver important services.

In July 2024, the need for good operational resilience plans was brought into sharp focus when the CrowdStrike outage happened. CrowdStrike is a cyber security firm which lists many regulated financial services firms amongst its global client base. When the company released an update to its Falcon sensor security software that included a configuration error, it caused issues globally, taking many systems offline. Although the fix was issued within hours, manual intervention was needed for many of the affected organisations, meaning it took a prolonged period to resolve. The impact on financial services firms was significant with it affecting trading, customer system access and regulatory reporting for many firms. For those of us with a particular interest in operational resilience and third party reliance, it provided an important case study.

So, it’s no surprise that on 31 October 2024, the FCA published its observations on how firms responded to the outage and key lessons for preparedness for future incidents. The firms that had prepared for compliance with PS21/3 Building Operational Resilience undoubtedly came out better, but what worked well and what needs to change?

Resilience of infrastructure

Identifying single points of failure in technology is a basic step for identifying the weak points, but firms now have realised that more can be done on this front. The FCA has seen that firms have done work on diversifying risk (for example, using devices with different operating systems) to reduce the number of possible failure points. They have also done more to embed change management processes for software and content updates to mitigate risks.

The FCA has recommended testing of updates before using a phased approach to roll-out so that the risk of significant outage is minimised.

Third party management

The disruption caused in the CrowdStrike case was amplified because some of the firms affected provided services to other regulated firms. This meant that the impact went beyond those that were direct CrowdStrike users. The firms that fared best during the outage were those that had fully mapped third party relationships as well as the onward chains of relationships. This type of mapping enabled firms to quickly understand and manage their exposure. The firms that had good mechanisms for information sharing with third parties were able to respond more quickly than other firms.

The FCA has recommended that firms review third party management frameworks regularly including identifying changes needed to third party categorisation and the assessment and management of risk. Regular reviews of third parties are an integral part of this, looking at performance against agreed service levels and contractual obligations, continuity planning and exit plans. This should include consideration of indirect relationships with third parties as well as direct relationships.

Incident response and communications

The need for good incident response and crisis management arrangements can’t be over-emphasised and firms that had tried and tested plans available off-line fared the best during the outage. The effectiveness of communication with the FCA varied across impacted firms, with some providing more timely communications with a clear view of the impact on important business services. This type of communication enables the FCA to quickly assess the industry-wide impact of this type of incident, which in turn enables it to understand how it might impact its statutory objectives. Communication plans in place within firms should include not only communicating with clients, third parties, staff and other stakeholders but also with the FCA.

The FCA has recommended ensuring that stakeholder contact details are up to date and readily available. Firms should also consider the use of pre-approved communication templates, service status pages and updates ready to load onto websites and systems. Contracts with third parties should be clear on communication requirements including notification of incidents and timely updates, to ensure that the firm is in an informed position throughout an outage. Post incident reviews can also give insights into how firms can better manage risk and whether any changes to important business services or impact tolerances are needed.

So, what are the key takeaways for firms?

  1. 1. Make sure you have clearly mapped important business services, including interdependencies. Set out a plan for getting services back online in the event that they fail.
  2. 2. Carry out regular impact testing to ensure that your impact tolerances are up to date. Alongside this you should carry out scenario testing using severe but plausible scenarios – really think about what’s plausible when planning scenarios, even if it’s an unpalatable scenario. It’s easy to rule out a scenario as being implausible, but is that just because it’s not a scenario you really want to face
  3. 3. Make sure employees are equipped to be able to recognise incidents and trigger an appropriate response.
  4. 4. Have clear communication plans for customers, staff and other stakeholders, including the FCA. Make sure the plan is tested and that it will work when it needs to.
  5. 5. Ensure third party contracts set out responsibilities for communicating incidents including timely notifications and updates and take a proactive approach to third party risk management.
  6. 6. Have processes in place for post-incident reviews to assess the root cause of an incident, evaluate the effectiveness of response and identify opportunities for improvement.
  7. 7. Set the right ownership at senior management level. Good governance structures are essential for ensuring appropriate risk management and senior managers need to be clear on their responsibilities. Management needs to have access to good information and reports to enable them to assess and challenge how well third party risk is being managed.

How Square 4 can help 

Remember that for firms in-scope of PS21/3, there’s a clear deadline of 31 March 2025 for mapping and testing to have been carried out.

Square 4 Partners has expertise on the management of third party service providers, risk management and operational resilience. Get in touch with us at hello@square4.com if you would like to discuss how we can help to make sure you’re as prepared as possible for outages and incidents.

Categories

Share

Sign up to our Insights

    Download White Paper

      Privacy Policy