Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion content/operate/rs/databases/active-active/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,6 @@ Other Redis Enterprise Software features can also be used to enhance the perform

- [Plan your Active-Active deployment]({{< relref "/operate/rs/databases/active-active/planning.md" >}})
- [Get started with Active-Active]({{< relref "/operate/rs/databases/active-active/get-started.md" >}})
- [Create an Active-Active database]({{< relref "/operate/rs/databases/active-active/create.md" >}})
- [Create an Active-Active database]({{< relref "/operate/rs/databases/active-active/create.md" >}})
- [Develop applications with Active-Active databases]({{<relref "/operate/rs/databases/active-active/develop/develop-for-aa">}})
- Review [disaster recovery strategies for Active-Active databases]({{< relref "/operate/rs/databases/active-active/disaster-recovery" >}})
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ An application deployed with an Active-Active database connects to a replica of
If that replica is not available, the application can failover to a remote replica, and failback again if necessary.
In this article we explain how this process works.

{{<note>}}
For other disaster recovery strategies including network-based, proxy-based, and client library approaches, see [Active-Active disaster recovery strategies]({{<relref "/operate/rs/databases/active-active/disaster-recovery">}}).
{{</note>}}

Active-Active connection failover can improve data availability, but can negatively impact data consistency.
Active-Active replication, like Redis replication, is asynchronous.
An application that fails over to another replica can miss write operations.
Expand All @@ -28,6 +32,8 @@ Your application can detect two types of failure:
1. **Local failures** - The local replica is down or otherwise unavailable
1. **Replication failures** - The local replica is available but fails to replicate to or from remote replicas

You can also use [database availability API requests]({{<relref "/operate/rs/monitoring/db-availability">}}) to determine if a database replica is available to handle read and write operations. The lag-aware database availability requests considers CRDT replication lag as a health check criterion to prevent reading stale data during failback scenarios.

### Local Failures

Local failure is detected when the application is unable to connect to the database endpoint for any reason. Reasons for a local failure can include: multiple node failures, configuration errors, connection refused, connection timed out, unexpected protocol level errors.
Expand Down
128 changes: 128 additions & 0 deletions content/operate/rs/databases/active-active/disaster-recovery/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
Title: Disaster recovery strategies for Active-Active databases
alwaysopen: false
categories:
- docs
- operate
- rs
- rc
description: Disaster recovery strategies for Active-Active databases using network, proxy, client library, and application-based approaches.
linkTitle: Disaster recovery
weight: 50
---

An application deployed with an Active-Active database connects to a database member that is geographically nearby. If that database member becomes unavailable, the application can fail over to a secondary Active-Active database member, and fail back to the original database member again if it recovers.

However, Active-Active Redis databases do not have a built-in [failover](https://en.wikipedia.org/wiki/Failover) or failback mechanism for application connections. To implement failover and failback, you can use one of the following disaster recovery strategies:

- [Network-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/network-based">}}): Global traffic managers and load balancers for routing.

- [Proxy-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/proxy-based">}}): Software proxies handle detection and routing logic.

- [Client library-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/client-library-based">}}): Database client libraries with built-in failover logic.

- [Application-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/application-based">}}): Custom application-level monitoring and connectivity management.

## Considerations for disaster recovery

When implementing a disaster recovery strategy for an Active-Active database, consider the following:

- Is the Active-Active database an on-premise, cloud, multi-cloud, or hybrid-cloud deployment?

- Number of regions and availability zones.

- Application server redundancy and deployment locations.

- Acceptable values for the maximum amount of data that can be lost during a failure (Recovery Point Objective) and the maximum acceptable time to restore service after a failure (Recovery Time Objective).

- Latency and throughput requirements.

- Number of application errors that can be tolerated during a failure.

- Tolerance for reading stale but eventually consistent data during a failover scenario.

- Is concurrent access, in which different application servers can read from or write to different Active-Active database members, acceptable?

- Are there any regulatory or policy requirements for disaster recovery?

- Does the application connect to the Active-Active database using a Redis client library or through a development framework or ecosystem?

- Does the Active-Active database use DNS, the [OSS Cluster API]({{<relref "/operate/rs/clusters/optimize/oss-cluster-api">}}), or the [discovery service]({{<relref "/operate/rs/databases/durability-ha/discovery-service">}})?

- Is rate-limiting control needed?

- Can you modify the existing codebase or introduce new components, such as load balancers or proxies?

## Detect failures with health checks

Depending on the failover strategy, you can use the following health checks to detect Active-Active database failures and determine whether to fail over to a secondary Active-Active member, or fail back to the primary member after the preferred endpoint is back online.

To determine which health checks to use, consider factors such as detection speed, integrability with the failover strategy, and writability or durability guarantees.

### Lag-aware database availability requests

Lag-aware database availability requests are the recommended method to detect database failures in Redis Enterprise Software deployments. This method guarantees that all the shards of a clustered database are connectable.

See [Lag-aware database availability requests]({{<relref "/operate/rs/monitoring/db-availability#lag-aware">}}) for more information.

{{<note>}}
Lag-aware database availability requests are not supported for Redis Cloud databases.
{{</note>}}

### Redis connection health checks

You can use an existing connection to the database to check its availability.

#### PING command

The [`PING`]({{<relref "/commands/ping">}}) command checks the following:

- The database is connectable.

- The database is readable.

- The dataset is available.

Example response for an available database:

```
127.0.0.1:6379> PING
PONG
```

If a database is connectable but not available for reads, such as when reloading from a snapshot, `PING` returns an error message:

```
127.0.0.1:6379> PING
(error) LOADING Redis is loading the dataset in memory
```

#### Connection timeouts or Redis errors

By capturing connection errors, you can determine when to fail over to a secondary Active-Active member or fail back to the primary member based on the [circuit breaker pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).

#### Custom health check

You can also implement custom health checks.

For example, you can check the keyspace with write operations such as using the [`SET`]({{<relref "commands/set">}}) command to write arbitrary data. This check verifies that database shards are available and writable.

For example:

```
SET <randomized_key_name> <arbitrary_value> EX 1
```

Use multiple write operations with different randomized keys to access different shards and guarantee that all the shards are available.

### Health check comparison

| Health check | Connectivity | Readability | Writability | Durability | Notes |
|--------------|--------------|-------------|--------------|------------|-------|
| Database availability requests |:white_check_mark: | | | | No guarantees on readability. For example, the shard might be reloading from a snapshot. |
| `PING` |:white_check_mark: |:white_check_mark: |:white_check_mark: | | No support for clustered databases. All `PING` requests will be forwarded to shard 1. |
| Keyspace sampling |:white_check_mark: |:white_check_mark: |:white_check_mark: |:white_check_mark: | Write operations are persisted, increasing disk usage for AOF and RDB. |

## Disaster recovery strategies

Depending on your requirements for Recovery Point Objective, Recovery Time Objective, consistency, scalability, resources, maintainability, and other factors, choose one of the following strategies to fail over to a secondary Active-Active member or fail back to the primary member.
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
Title: Application-based disaster recovery
alwaysopen: false
categories:
- docs
- operate
- rs
- rc
description: Application-based disaster recovery for Active-Active databases using custom application-level monitoring and connectivity management.
linkTitle: Application-based
weight: 40
---

For complete control over failover and failback, you can implement disaster recovery mechanisms directly in the application server.

For more information, see [Application failover with Active-Active databases]({{<relref "/operate/rs/databases/active-active/develop/app-failover-active-active">}}).
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
Title: Client library-based disaster recovery
alwaysopen: false
categories:
- docs
- operate
- rs
- rc
description: Client library-based disaster recovery for Active-Active databases using Redis client libraries with built-in failover logic.
linkTitle: Client library-based
weight: 30
---

Some Redis client libraries support geographic failover and failback. These client libraries monitor all Active-Active database members and instantiate connections for all endpoints in advance to allow faster failover and failback.

Advantages:

- No additional hardware or software components required.

- No high availability considerations.

- No scalability concerns.

- Tighter control over connectivity, such as timeouts, connection retries, and dynamic reconfiguration.

- OSS Cluster API support.

- Low latency.

Considerations:

- Requires code changes for failover and failback logic.

- Concurrent access across replicas is possible, but can be mitigated using the distributed health status provided by the database availability API requests.

- When a development framework uses Redis transparently, failover and failback might not be easy to configure.

The following diagram shows a client library-based disaster recovery approach:

<div class="flex justify-center">
<img src="../../../../../../images/active-active-disaster-recovery/client-library.svg" alt="Diagram of client libraries routing traffic to Active-Active database members" width="50%">
</div>

The following diagram shows a client-based disaster recovery approach that also uses [connection pooling]({{<relref "/develop/clients/pools-and-muxing#connection-pooling">}}):

<div class="flex justify-center">
<img src="../../../../../../images/active-active-disaster-recovery/client-library-connection-pool.svg" alt="Diagram of client libraries with connection pooling routing traffic to Active-Active database members" width="50%">
</div>

For additional information, see the following client library guides for failover and failback:

- [Jedis (Java)]({{<relref "/develop/clients/jedis/failover">}})

- [redis-py (Python)]({{<relref "/develop/clients/redis-py/failover">}})
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
Title: Network-based disaster recovery
alwaysopen: false
categories:
- docs
- operate
- rs
- rc
description: Network-based disaster recovery for Active-Active databases using global traffic managers and load balancing solutions.
linkTitle: Network-based
weight: 10
---

Network-based solutions use DNS or load balancing to route traffic across regions without application changes.

Advantages:

- Because routing happens at the network level:

- No application code changes are needed.

- Development frameworks are agnostic and can connect to a single Active-Active database member's endpoint.

## Cross-region availability

For cross-region availability, you can use a global traffic manager or a global load balancer.

Advantages:

- If DNS routing is available at the application level, no additional load balancer is required between the application and the data tier to resolve the Active-Active database member's FQDN, reducing latency.

- Protects against data center failure since failure in one region should not affect services running in another region.

### Global traffic manager

A global traffic manager acts as an intelligent DNS server that directs clients to healthy endpoints based on distance, latency, or availability. You should configure the traffic manager to route to the local region first and fail over to other regions if an issue occurs.

Advantages:

- High availability.

- Latency optimization.

- Seamless disaster recovery.

Considerations:

- DNS propagation delays affect failover time.

- DNS caches can impact proper functioning.

- Limited custom health check support.

- May route traffic during CRDT synchronization, causing stale data reads.

The following diagram shows how a global traffic manager with DNS resolution routes traffic:

<div class="flex justify-center">
<img src="../../../../../../images/active-active-disaster-recovery/gtm-with-DNS.svg" alt="Diagram of a global traffic manager routing applications to Active-Active database members across regions" width="50%">
</div>

If the environment does not allow DNS resolution, you can use a load balancer to direct traffic to the cluster nodes:

<div class="flex justify-center">
<img src="../../../../../../images/active-active-disaster-recovery/gtm-with-load-balancer.svg" alt="Diagram of a global traffic manager with a load balancer directing traffic to Active-Active database members across regions" width="50%">
</div>

### Global load balancer

For real-time traffic control and more advanced routing logic for cross-region failover and failback, you can use a global load balancer. However, this solution can have higher latency than a global traffic manager.

The following diagram shows how a global load balancer routes traffic between regions:

<div class="flex justify-center">
<img src="../../../../../../images/active-active-disaster-recovery/global-load-balancer.svg" alt="Diagram of a global load balancer routing traffic between Active-Active database members in different regions" width="50%">
</div>

## Cross-zone availability

If your deployment does not require cross-region availability, you can use a regional load balancer to route requests to a healthy Active-Active database member in a different availability zone within the same region.

The following diagram shows how a regional load balancer routes traffic across availability zones:

<div class="flex justify-center">
<img src="../../../../../../images/active-active-disaster-recovery/regional-load-balancer.svg" alt="Diagram of a regional load balancer routing traffic across availability zones within a single region" width="50%">
</div>
Loading