Resiliency is a continuous improvement journey. Business Impact Analysis can determine the critical apps/systems and then the DR strategy can be implemented. The DR environment will need DR (recovery) workflows to cut over DNS and change database read replicas to writers etc, also the DR workflows should be invoked from DR environment. Code pipeline should deploy code and configuration one region at a time, so the failures don’t cascade. Choose the right data replication strategy(backup and restore, pilot light, warm standby, active/active) to meet RPO and data consistency requirements.
Sidenote: P99 latency is theĀ 99th latency percentile. This means 99% of requests will be faster than the given latency number. Put differently, only 1% of the requests will be slower than your P99 latency. CAP theorem model or Brewer’s theorem after its originator, Eric Brewer, states thatĀ any distributed system or data store can simultaneously provide only two of three guarantees: consistency, availability, and partition tolerance. For multi-region, choose between CP or AP depending on consistency vs availability requirements.
DR also is a business decision, as the cost and complexity involved in building a DR environment, and the downtime. Part of DR is to periodically test failure detection and also recovery (with service like Fault Injection). Before switching over, make sure the required capacity and updated data is present on the DR. Once can automate the steps to reduce RTO, but switching over is a manual and involved decision. Route 53 Application Recovery Controller can do readiness check, routing control and safety rules to switchover. To be fully consistent, data writes have to be acknowledged by both regions causing latency, so synchronous replication has tradeoffs, advisable to use async writes across regions.
Detection control is by the observability signals, so observability stack should be implemented in both regions. The tradeoffs: Operational overhead, Complexity, Cost.
IaC will help with consistent deployments but has its own challenges: Fix environment inconsistencies/ Use smaller templates, deploy one region at a time (using Wait conditions in CloudFormation). Use latency based routing to reach fastest region, and use Route 53 ARC to shift traffic from a region (by attaching health checks to the latency records) in case of maintenance. Deploy WAF Firewall service in front of API gateway to prevent SQL injection CSS attacks.
Observability needs to happen from multiple viewpoint, server logs(error rates and processing time), from outside the region (checking overall workflow times), clouwatch logs for user requests duration. For the operations team, the policies, procedures, runbooks, service quotas will need to change when its multi-region. Detecting inconsistency using R53 ARC so users have similar experience in all regions.
Operational Considerations: Latency (have regions dependent on the user base), cost (some services can cost different in different regions), service availability