Three ways CSOs can mitigate outage impact from the inside out

April 30, 2024
While protecting organizations against external security threats is a widely recognized aspect of the CSO’s role, the truth is, often the call comes from inside the house.

Chief Security Officers (CSOs) and Chief Information Security Officers (CISOs) are facing an ever-increasing barrage of challenges. Balancing tightening budgets with new SEC mandates and increased scrutiny from stakeholders, security executives have plenty keeping them up late at night.

While protecting organizations against external security threats is a widely recognized aspect of the CSO’s role, the truth is, often the call comes from inside the house. 

With a 2023 Forrester survey reporting 39% of respondents estimated their company lost up to one million dollars due to disruptions in the preceding month, it’s imperative that CSOs work to prevent these outages from the inside out.

Internal issues, such as DNS misconfigurations, network congestion or monitoring and alerting failures pose significant risks to organizations' networks, complicating the already intricate task of managing and protecting digital assets. While cybersecurity threats are real and present, networking and connectivity issues are reported as the leading cause of IT service-related outages.

The impact of undiscovered network issues is profound, ranging from immediate financial losses to irreparable damage to customer trust. Beyond the risk of incidents escalating to prolonged outages, soaring consumer expectations demand speedy, responsive, and reliable sites. In fact, 82% of consumers say slow page speeds impact their purchasing decisions, and 40% of consumers won't wait more than three seconds before abandoning a site.

When it comes to a hybrid workforce, employee experience expectations are equally lofty. Even without a widespread, publicized outage, a compromised user experience can negatively impact sales or workforce productivity, resulting in incalculable loss for a company. 

For example, a Slack outage in 2021 caused by an internal DNS misconfiguration left some users unable to access desktop, mobile and web applications for more than 15 hours. With Slack’s status page down due to the same issue, users were left confused and struggling to identify where the disruption originated. Unfortunately, each year we see similar downstream user impact making headlines, frequently due to internal incidents.

These scenarios aren’t uncommon, with 10 to 20 high-profile IT outages or data center events occurring globally each year. Outages like Slack’s offer as a lesson for navigating the modern landscape of challenges: CSOs must command an effective, comprehensive harmonization of development, security, and operations, looking beyond internal threats to full economic impact of workforce productivity and customer experience.

1. Deploy internal and Internet network monitoring and anomaly detection tools

Beyond addressing security threats, implementing robust Internet Performance Monitoring (IPM) tools enables CSOs to proactively identify issues before they can compromise end user experience or escalate to full-blown incidents.

For instance, comprehensive IPM solutions can identify hijacks, leaks, or performance anomalies, proactively flagging potential issues like malicious activities or sincere misconfigurations, to reduce response time and minimize the impact. Lacking visibility into the end user experience frequently poses ongoing challenges for ITops.

Application Performance Management (APM) or Network Performance Monitoring (NPM) tools alone are inadequately prepared to handle the variability and instability brought about by the omnipresence of the Internet touching every aspect of business today. By promptly detecting and addressing these concerns with full-stack observability, organizations can prevent inadvertent disruptions and reduce digital friction, ensuring smooth operations and an uninterrupted user experience. 

A 2024 survey of Site Reliability Engineers (SREs) found that within IT teams, individual contributors and business leaders unanimously agreed that third-party services are a strategic necessity for modern reliability practices. Investment in such technologies bolsters the overall resilience of the organization against a wide range of internal issues and operational risks.

2. Develop and test incident response plans

A comprehensive incident response plan tailored to the organization's specific needs is crucial to mitigating the impact of incidents when—not if—they occur. Crafting and regularly testing response plans is an essential step for effectively managing both external threats and internal disruptions within an organization.

CSOs should ensure their organization’s response plan outlines clear protocols for detecting and resolving security incidents, as well as procedures for addressing inadvertent disruptions caused by internal factors.

By simulating various scenarios, organizations can evaluate the effectiveness of their response strategies and identify areas for improvement. A well-prepared incident response plan ensures swift and efficient recovery from internal disruptions, minimizing downtime and preserving business continuity.

By proactively preparing for incidents, organizations can minimize downtime, reduce financial losses, and uphold their reputation in the face of incidents.

3. Prioritize a balanced, blameless culture

Establishing a blameless culture within the organization is fundamental for preventing internal disruptions. In combination with promoting awareness and continuous learning among employees, organizations can leverage a blameless culture to mitigate the risk of inadvertent incidents caused by human error.

On my team, we fully embrace the idea of “radical transparency.” In fact, Catchpoint was founded in the spirit of radical transparency and failing fast, after founder & CEO Mehdi Daoudi took down the system for three hours as an employee at DoubleClick. Luckily, Mehdi’s manager at the time took a blameless approach, Mehdi kept his job, and the DoubleClick team was able to learn from the failure.

This ethos can be the difference between sustained success or failure for an organization in any industry but is particularly relevant as the intersection and collaboration between DevOps and security grows.

Implementing an open and honest culture allows DevSecOps personnel to focus on resolving issues rather than on assigning blame. Incidents should be viewed as valuable learning opportunities rather than occasions for assigning blame.

Following an incident, teams should conduct thorough post-mortems to analyze what went wrong and identify areas for improvement. By openly discussing incidents and sharing insights, organizations can empower teams to prevent similar incidents in the future.

With an open, blameless, and inquisitive culture in place, employees are empowered to prioritize resilience and accountability, prevent internal disruptions, and fortify defenses against a range of potential risks. As exemplified by Catchpoint’s own inception, leaning into a blameless, transparent culture could lead to important discoveries for your entire team.

In combination, these proactive measures enable swift detection and resolution of issues, ensuring smoother operations and preserving brand reputation. By deploying network monitoring tools, developing robust incident response plans, and fostering a blameless culture, CSOs can prevent internal disruptions and fortify organizational resilience.

With these strategies in hand, CSOs can confidently navigate the challenges of the digital age, safeguarding against costly outages and mitigating potential threats.

Each year, we hear a new version of any CSO’s nightmare: a major, preventable outage with widespread impact to end users, damaging brand reputation and potentially costing millions. Fortunately, by leveraging correct tools and strategies, CSOs can enable their increasingly integrated DevSecOps resources to address issues from the inside out, prevent disruptions, and as a result, sleep a bit better at night.

Leo Vasiliou is Director of Product Marketing at Catchpoint. Vasiliou has 30 years of experience and has progressed from an Electronic Computer and Switching Systems Specialist in the U.S. Air Force to an expert in web performance and IT management, currently enhancing internet performance, and digital experience, monitoring at Catchpoint.