Handling Cloud Outages: A Data Platform Architect’s Perspective

Rabby

09/11/2025

In today’s technology-driven world, cloud services create vast opportunities — but they also come with risks. Many businesses experienced this firsthand when AWS went down today. It made me reflect on how I, as a Data Platform Architect, would handle such a situation.

Every platform implementation is unique, depending on whether it’s built on IaaS, PaaS, or SaaS. Consequently, your recovery plan should also differ based on that. Your company’s IT governance, disaster recovery (DR), and business continuity (BC) plans will serve as key guiding frameworks in such events. If you don’t have a plan in place yet, here are a few key areas I would start by checking:  Categorise your action into three section below

Immediate Actions

  1. Keep your head cool. Think yourself as a Pilot going through turbulence in cloud.
  2. Check the cloud vendors(e.g. AWS, Azure) service status page. Check the primary region and its Availability Zone service status. Sometimes your other suppliers may use the same cloud service providers so the impact could be larger.
  3. Review pipeline failures, gateways, database logs, queues, network ports, dns, firewall, CPU, disk status, Cache or any authentication failures .
  4. Check integrity of your database by running few admin queries. Find any nulls, ghosted rows or corrupted values due to lack of data.
  5. Based on lineage understand data products impacted downstream.
  6. Depending on the criticality, form the control tower team (aka) crisis management team with key members to ensure you have agreed Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
  7. Alert the admins, impacted business teams and vendors so that business is not making critical decision based on corrupt on stale data.
  8. Understand what services in your infrastructure is down, hung, unresponsive or facing performance issues.
  9. Assign support engineers to analyse and monitor the technical issues and report to control tower on agreed interval.

Action for Recovery and Resolution

  1. Consider backups and restore points.
  2. Confirm database integrity and run sanity checks on volume.
  3. Consider restarting specific services over restarting VM machine/ Cluster.
  4. Consider minimal service level but keep it running. Its important to have slower service than nothing at all.
  5. Refresh downstream data products according to priority order.
  6. Ensure your downstream vendors get their data.
  7. Monitor the service and provide extra support until BAU resumes.
  8. Ensure stakeholders are updated
  9. Ensure engineers extra effort are recognised by crisis team and senior management team.

Preventive Actions for future

  1. If your service is SAAS, pray to God it does not happen again. 😊
  2. If its IAAS or PAAS, review your DR plan and make sure you append your learnings.
  3. Build or update the DR/BC scenarios from your data platform services point of view consider your immediate vendors such as data suppliers, ingestion, data governance and analytical tools.
  4. Raise the risk item in your risk registry
  5. In cloud world there are always ways to reduce downtime depending on your RTO and RPO objectives.
  6. Consider how this failure could have been avoided or better handled and build a business case along with crisis management team and propose change to senior management.
  7. In cloud world there are always ways to reduce downtime depending on your RTO and RPO objectives.
  8. Update your alerts, policies, frameworks and process flow as required.

When outage like this happens, the impact could be quite detrimental on public if your service is public facing such as NHS, National Rail, Gov UK etc. So having a clean DR/BC policy on your data platform also have a social and ethical perspective apart from technical.

Rabby

Web Manager

Subscribe to Our Newsletter

Get the latest data technology insights delivered to your inbox.

Related Articles

Ready to Transform Your Data Strategy?

Join thousands of data professionals who trust us with their most critical
projects.

Stay Updated with Data Tech Insights

Get the latest in data technology, consulting tips, and training opportunities.

Leading data technology consulting and training company, empowering professionals worldwide with cutting-edge insights and expertise.

© 2025 DataTech Insights. All rights reserved. | Privacy Policy | Terms of Service