Remote Copy Data Patterns and Partnership Tuning

Hello friends. I know it has been a while, but I would like to now put out part 2 to the 1920 post I had put out a while back. That post was largely around tuning the partnership to protect against impacts to the master volume (or replication source). This week I would like to explain a bit about how data is replicated to help understand why 1920 events and backpressure can happen and how to hopefully avoid the situation in the first place.

Types of Remote Copy Relationships in SpecV

Metro Mirror (Synchronous/Synchronous)

Metro Mirror is what I call a Sync/Sync replication. By this I mean that the RPO is synchronous or in layman's terms the two sides of the mirror are always identical. Additionally, the data pattern for the replication is synchronous as well meaning we forward new writes as we receive them as shown here:

When the primary host sends a write to the master volume, we will forward that to the remote cluster. The remote cluster will cache the write and send an acknowledgement (ack) back to the master cluster. The master cluster will also cache the write, but will wait until it receives the ack from the remote cluster before sending the ack to the host. This means there is direct correlation between the round trip latency between clusters and the latency seen by the host.

Global Mirror (Asynchronous/Synchronous)

Global Mirror by contrast is what I refer to as Async/Sync mirroring. Meaning the RPO is asynchronous and the two sides are a few seconds apart in terms of consistency. However, it shows a synchronous data pattern by forwarding writes as we receive them as shown below:

The main difference between Global and Metro Mirror is that Global mirror does not wait for an acknowledgement from the remote cluster before acknowledging write completion to the host. While this is an asynchronous replication, any impediments to being able to send the data to the remote cluster will cause backpressure and increased write latency for the master volume.

Global Mirror with Change Volumes (Asynchronous/Asynchronous)

As you could imagine having having both synchronous and asynchronous mirroring technologies showing a synchronous data pattern is a bad solution for true long distance links or native IP links where there will be lots of network loss and need to re-transmit data. The solution to this is that was implemented was to combine Flashcopy and Global Mirror to make a truly asynchronous replication in both regards to RPO (for which in this case is minutes) and data pattern. This is accomplished by removing the master volume from the replication task itself as shown below:

What happens here is a Flashcopy is periodically started to the master change volume. The change volume then does a background copy to synchronize the sites. The i/o to production hosts continues on as if nothing happened, however Flashcopy processing will still take place once the write is cached - but that is a conversation for another post. The benefit of this style of replication is the backpressure from the inter-site link or the remote cluster is not noticed by the master volume.

Partnership Considerations for Each Relationship Type

Partnership Settings

svcinfo lspartnership -delim : REDACTED
location:remote                      <==Indicates this is a remote copy partner
code_level: (build 142.6.1805101736000)
gm_link_tolerance:300                <== Link Tolerance
relationship_bandwidth_limit:25      <== Relationship Limit
gm_max_host_delay:5                  <== Max Host Delay
link_bandwidth_mbits:800             <== Link Bandwidth
background_copy_rate:100             <== Background Copy Rate
max_replication_delay:0              <== Max Replication Delay

Metro Mirror and Global Mirror

In Metro Mirror and Global Mirror the bandwidth demand for replication is the peak write data rate for all the replicated volumes. In an ideal world the inter-site link will be able to handle this demand. In practice this is rarely the case. This is where the partnership configuration becomes important to maintaining synchronized copies and minimizing impact to normal operations.

For initial synchronization the Link Bandwidth should be equal to the inter-site link's bandwidth and the background copy rate should be the percentage of that bandwidth dedicated for initial sync and resync tasks. In addition to this, the Relationship Bandwidth Limit is a cap on the bandwidth for a single remote copy relationship and is a system-wide parameter - I suggest leaving the default in most cases which is 25MB/s.

After the relationship enters a consistent state (the two copies completed the initial sync) the relationship will send all the write data on receipt across the link. This means that while replication is consistent, the bandwidth and limit settings don't matter (just 1920 policing parameters to stop it where needed). Because after the initial synchronization the bandwidth requirement shits to being the write data rate coming in, it is important to have sufficient bandwidth on the network to handle the load reliably.

For protection settings, the max host delay should be the max average latency your application can handle (in ms), the link tolerance should be the time (in seconds) that this latency can be tolerated. If you have very sensitive applications, you can set the max replication delay to be the highest acceptable peak latency (in seconds). For everyone's notes most host timeout settings are suggested to be set to either 60 or 120 seconds, but applications may be more sensitive. Having these values set appropriately will help by stopping replication before backpressure from the replication impacts the production host.

Global Mirror with Change Volumes

For Global Mirror with Change Volumes the remote copy itself takes place between change volumes and by definition this is always a re-synchronization task. This means for GMCV has more dependency on these settings to maintain consistent copies. For GMCV, the link bandwidth should match the data rate you want to achieve (remember this is mbps not MB/s - do the byte conversion for data rate to get the right value). Also this value is pre-compression so if you are compressing the partnership and want to saturate the link, you need to factor in the compression ratio. The background copy rate should be 100%. This is will make the system send as much data as you define (or as close to it as the link allows). We don't concern ourselves too much with the protection settings that would trigger a 1920 as in GMCV link backpressure won't have impact on master volumes response times.

I hope you all found this helpful and informative. If you have any questions or concerns please leave a comment, ask me on Twitter @fincherjc, or reach out to me on Linkedin.


  1. Hiee i have one sql db on 6 different ibm flash volumes of 1 tb each.theses 6 volumes are in gmcv replication.Now customer is asking to take olny 1 6 tib volume and start gmcv this fisible to take 6tib volume in gmcv.

    Please suggest??

    1. Whether or not this is possible in your environment will largely have to do with throughput availability. There is a system wide parameter called relationshipbandwidthlimit (tunable with the chsystem command) which sets the rate limit for each individual relationship. The parameter is a value in MB/s. Assuming you have the bandwidth available it should be possible to do this conversion but you will probably slip RPO for the first few cycles for initial sync.

  2. I've got an interesting scenario for you , there are two sites in current operation and two new sites in a migration target, both existing are IP global mirror as are the two new systems (FS9100s) in the new DC with IP based global mirror so my partnership possibilities for migration are zero.

    Old site 1 SVC+A9K I have installed V7K on same san , lets call it SWING A
    New site 1 FS9100 I have also installed V7K on same san, lets call it SWING B

    To replicate a LUN from old site 1 , I provision an IMAGE mode disk from the V7K and either use it as a flash copy target or use it as a volume copy by adding a copy to the LUN (addvdisk copy).

    I then create a global mirror IP between the V7KS in old site and new site, replicate the IMAGE LUN from V7k SWING A to v7K SWING B. When done, break the mirror

    Map the AUX LUN from Vk7 SWING B to FS9100 as image mode volume and it should all be ok ?

    However sometimes its not, sometimes the replication seems not to be complete.

    Any ideas?

    1. I am not confident that I have a full grasp of what you are trying to achieve, do you have a topology diagram that might help better illustrate your goal?

    2. Yes i have a diagram but I can't attach it here.

  3. Hi,

    I have a scenario where I get the 1920 error at around 3am -4am daily. How can I monitor what is happening at that time and what could be the cause of it on daily basis.

    1. I would suggest using a long term performance modeling tool such as IBM Storage Insights to keep measure of the performance statistics on the sending and receiving cluster.

      If you need help in troubleshooting replication performance I recommend getting logs from the systems shortly after an event as well as any switch logs (assuming its predictable) and contacting IBM support

      Typically 1920 events are a response to reduced fabric performance or an increase in replication traffic exceeding the initial design point of the solution.

  4. I have a new GMCV relationship setup between two FS7200s with a Cycle time of 15min (900sec). The distance is 20 miles and the link is a shared 1GbE. I have setup the the partner link at 1000mb and have 90% set on the background copy. I am also compressing the replication over the IP link. How can I monitor the status of the replication to determine if changed data is copying over within the given Cycle period? There are times when the course might have 300MB/s of sustained new writes that need to replicate. How would I monitor the "backlog" between the source and target in cases like this? Thanks!

    1. In the gui you can monitor freeze times to make sure they are updating normally. Additionally if you have a performance monitoring tool like Storage Insights, you can monitor the "port to remote node" data rates to track the rate at which data is moving. There isn't a good way to measure the volume of data in GB a given replication is lagging.


Post a Comment

Popular posts from this blog

Why you should always use DRAID

What is a 1920 and why is it happening?