What is a 1920 and why is it happening?

Hello friends. For the next few weeks (while I am in a dry spell for post ideas) I am going to do a series of posts that are frequently asked questions I get from some of my peer support agents as well as others I interact with day-to-day. For the first go-around we will discuss remote copy. Specifically, what exactly is a 1920 error? How do I identify the cause of the error and stop it from happening again?

A 1920 error is a cluster error indicating that the software in the master cluster of a remote copy relationship has decided to kill remote copy for a particular relationship (and its associated consistency group) because the SLA of remote copy was violated as defined in the partnership settings.

The first and probably most important thing to know is what that SLA is and how it is defined. For this, go ahead and run lspartnership against your remote copy partnership (or check partnership properties in gui... your call). If you run lspartnersip you will see something like this:

svcinfo lspartnership -delim : REDACTED
location:remote                      <==Indicates this is a remote copy partner
code_level: (build 142.6.1805101736000)
gm_link_tolerance:300                <== Link Tolerance
relationship_bandwidth_limit:25      <== Relationship Limit
gm_max_host_delay:5                  <== Max Host Delay
max_replication_delay:0              <== Max Replication Delay

I know this output is a lot to take in. Lets start with the simple stuff. First, make sure that you are actually looking at the remote partnership (there is a local one for the local cluster itself). This is specified by the location output and will say remote if it is remote copy. The SLA is generally defined by gm_link_tolerance and the gm_max_host_delay. 

gm_max_host_delay is a value in milliseconds that we tell the cluster we can tolerate as a delay in communicating between clusters (5ms is the default). 

gm_link_tolerance is a value in seconds that we tell the cluster we can tolerate violations of the gm_max_host_delay as an average latency.

Combined, the system will monitor each remote copy relationship (rcrel) and if one of them has a latency that is greater than the gm_max_host_delay on average for the time threshold of the gm_link_tolerance, a 1920 event is triggered and that rcrel is stopped. If that rcrel was in a consistency group, that group is stopped as well in order to maintain point in time consistency of the grouped relationships.

In addition to this, another parameter can be set to put more restrictions on the SLA for remote copy. This is the max_replication_delay. The max_replication_delay enforces a time-to-live (in seconds) for every datagram sent to the aux cluster. If the master volume fails to receive the ack for any command sent to the remote cluster before that timer runs out, remote copy is then killed and a 1920 event is triggered. The system default for this is 0 (disabled) and is designed to be used in latency sensitive environments to prevent a single long standing i/o from causing impact to the hosts in the primary site.

By definition, 1920 events are a performance problem with only 2 possible causes:
  1. There was congestion and/or latency on the inter-site link
  2. There was poor performance on the aux cluster impacting the aux volume, creating back-pressure to the master cluster's ability to send data.
Because of the nature of this problem, support (or you) will need a snap and performance data during the time of the event from both clusters involved (available through Spectrum Control or Storage Insights). However, this will only get you far enough to tell if the problem is in the fabric between clusters or in the remote cluster (and what is wrong in the remote cluster to cause it). If the problem turns out to be related to the inter-site link, a fabric analysis will likely be required to see what is impacting the Spectrum Virtualize cluster's ability to send data.

First, we check the event log (show all is helpful here) and find the following events:

This snippet is from an IBM internal tool to analyze logs. You will have to take my word for it for now that this same data is available in the GUI in the event log if you do show all and inspect the properties of event id's 50010 (the 1920 error) and 985003 which is an informational event indicating we had to retry i/o in remote copy. The value of the informational event is if the 1920 stops a whole consistency group, the informational event will tell you the specific relationship in that group that triggered the event (as shown above). This will help to narrow down your search (if it comes to it) on the remote cluster performance analysis.

The analysis to determine if the event is related to the fabric or the remote cluster is fairly simple. On the master cluster, take a look at the port to remote node send response time and port to remote node send queue times around the time of the 1920 event. If they spike up, that confirms there was back-pressure on the link inhibiting the ability to send data.

To put these values into perspective, the port to remote node send response time is the response time for commands sent to the remote cluster on average. The port to remote node send queue time is how long it takes for the node to be able to send a frame to the remote cluster. The larger spike in queue time here would suggest that something is impairing the ability for the master cluster to send i/o. Typically in the Fibre-Channel world this would be buffer credit starvation and so we can check those stats on the remote cluster:

Generally, I wouldn't think that port blocked % (buffer credit starvation on the 16Gb cards) of 16% to be too bad, but the data correlation here is hard to ignore. From a Spectrum Virtualize perspective we see the 1920 event (from event log) lines up with the time port to remote node send queue time and the port time blocked % stat spike up making this appear to be a fabric related issue. 

If the time blocked % had not spiked up, I would then start suspecting the remote cluster's performance as impeding the ability of the master cluster to send data. On the remote cluster, we review the aux volume of the relationship in question for performance problems in general to see what hurts using the same approach from my previous performance post here.

It is important to note that Spectrum Virtualize averages data over an interval (here 1 minute) which means bursts are easily masked by the stats. Additionally, the charts in this post and the lspartnership output are samples and not from the same environment for data privacy reasons.

I hope you all found this helpful and informative. If you have any questions or concerns please leave a comment, ask me on Twitter @fincherjc, or reach out to me on Linkedin.


Popular posts from this blog

Troubleshooting volume performance in IBM Storage Insights

Spectrum Virtualize Software Overview