Remote Copy Data Patterns and Partnership Tuning

Hello friends. I know it has been a while, but I would like to now put out part 2 to the 1920 post I had put out a while back. That post was largely around tuning the partnership to protect against impacts to the master volume (or replication source). This week I would like to explain a bit about how data is replicated to help understand why 1920 events and backpressure can happen and how to hopefully avoid the situation in the first place.

Types of Remote Copy Relationships in SpecV

Metro Mirror (Synchronous/Synchronous)

Metro Mirror is what I call a Sync/Sync replication. By this I mean that the RPO is synchronous or in layman's terms the two sides of the mirror are always identical. Additionally, the data pattern for the replication is synchronous as well meaning we forward new writes as we receive them as shown here:

When the primary host sends a write to the master volume, we will forward that to the remote cluster. The remote cluster will cache the write and send an acknowledgement (ack) back to the master cluster. The master cluster will also cache the write, but will wait until it receives the ack from the remote cluster before sending the ack to the host. This means there is direct correlation between the round trip latency between clusters and the latency seen by the host.

Global Mirror (Asynchronous/Synchronous)

Global Mirror by contrast is what I refer to as Async/Sync mirroring. Meaning the RPO is asynchronous and the two sides are a few seconds apart in terms of consistency. However, it shows a synchronous data pattern by forwarding writes as we receive them as shown below:

The main difference between Global and Metro Mirror is that Global mirror does not wait for an acknowledgement from the remote cluster before acknowledging write completion to the host. While this is an asynchronous replication, any impediments to being able to send the data to the remote cluster will cause backpressure and increased write latency for the master volume.

Global Mirror with Change Volumes (Asynchronous/Asynchronous)

As you could imagine having having both synchronous and asynchronous mirroring technologies showing a synchronous data pattern is a bad solution for true long distance links or native IP links where there will be lots of network loss and need to re-transmit data. The solution to this is that was implemented was to combine Flashcopy and Global Mirror to make a truly asynchronous replication in both regards to RPO (for which in this case is minutes) and data pattern. This is accomplished by removing the master volume from the replication task itself as shown below:

What happens here is a Flashcopy is periodically started to the master change volume. The change volume then does a background copy to synchronize the sites. The i/o to production hosts continues on as if nothing happened, however Flashcopy processing will still take place once the write is cached - but that is a conversation for another post. The benefit of this style of replication is the backpressure from the inter-site link or the remote cluster is not noticed by the master volume.

Partnership Considerations for Each Relationship Type

Partnership Settings

svcinfo lspartnership -delim : REDACTED
location:remote                      <==Indicates this is a remote copy partner
code_level: (build 142.6.1805101736000)
gm_link_tolerance:300                <== Link Tolerance
relationship_bandwidth_limit:25      <== Relationship Limit
gm_max_host_delay:5                  <== Max Host Delay
link_bandwidth_mbits:800             <== Link Bandwidth
background_copy_rate:100             <== Background Copy Rate
max_replication_delay:0              <== Max Replication Delay

Metro Mirror and Global Mirror

In Metro Mirror and Global Mirror the bandwidth demand for replication is the peak write data rate for all the replicated volumes. In an ideal world the inter-site link will be able to handle this demand. In practice this is rarely the case. This is where the partnership configuration becomes important to maintaining synchronized copies and minimizing impact to normal operations.

For initial synchronization the Link Bandwidth should be equal to the inter-site link's bandwidth and the background copy rate should be the percentage of that bandwidth dedicated for initial sync and resync tasks. In addition to this, the Relationship Bandwidth Limit is a cap on the bandwidth for a single remote copy relationship and is a system-wide parameter - I suggest leaving the default in most cases which is 25MB/s.

After the relationship enters a consistent state (the two copies completed the initial sync) the relationship will send all the write data on receipt across the link. This means that while replication is consistent, the bandwidth and limit settings don't matter (just 1920 policing parameters to stop it where needed). Because after the initial synchronization the bandwidth requirement shits to being the write data rate coming in, it is important to have sufficient bandwidth on the network to handle the load reliably.

For protection settings, the max host delay should be the max average latency your application can handle (in ms), the link tolerance should be the time (in seconds) that this latency can be tolerated. If you have very sensitive applications, you can set the max replication delay to be the highest acceptable peak latency (in seconds). For everyone's notes most host timeout settings are suggested to be set to either 60 or 120 seconds, but applications may be more sensitive. Having these values set appropriately will help by stopping replication before backpressure from the replication impacts the production host.

Global Mirror with Change Volumes

For Global Mirror with Change Volumes the remote copy itself takes place between change volumes and by definition this is always a re-synchronization task. This means for GMCV has more dependency on these settings to maintain consistent copies. For GMCV, the link bandwidth should match the data rate you want to achieve (remember this is mbps not MB/s - do the byte conversion for data rate to get the right value). Also this value is pre-compression so if you are compressing the partnership and want to saturate the link, you need to factor in the compression ratio. The background copy rate should be 100%. This is will make the system send as much data as you define (or as close to it as the link allows). We don't concern ourselves too much with the protection settings that would trigger a 1920 as in GMCV link backpressure won't have impact on master volumes response times.

I hope you all found this helpful and informative. If you have any questions or concerns please leave a comment, ask me on Twitter @fincherjc, or reach out to me on Linkedin.


Popular posts from this blog

What is a 1920 and why is it happening?

Troubleshooting volume performance in IBM Storage Insights

Spectrum Virtualize Software Overview