Why every Spectrum Virtualize system should be using NPIV

The IBM Spectrum Virtualize software stack operates in clusters which are based on the Paxos Protocol

Therefore, in order to service a host's I/O request, each node must have the current cluster state information. When a node comes back online from a code upgrade (or reboot) it must pull this information from the running cluster. To be able to pull this data, the node must connect to the network so that the data can be transferred. This requires the node to bring its network interfaces online and log into the Fibre Channel (FC) network.

Naturally, the consequence of this is the propagation of RSCN (State Change Notifications) which then notifies all the hosts that the ports are back online (prompting login). However, the node will not be able to service SCSI I/O until the current cluster state is obtained from the other SV nodes in the cluster. Since the FC host logins are happening in parallel with the inter-node data transfer,  there is no way that the Spectrum Virtualize code can be ready to respond to the hosts at the instant the FC ports log back into the fabric.

The expectation is that the hosts will abort and retry any commands that time out because the storage node doesn't respond fast enough. While this is normally the case, and no outage would occur, there are some circumstances where this could be problematic.

Lets take for example a lun discovery process. When the host completes a fibre channel login to the storage subsystem the next thing that happens is the host asks what luns it has access to. Generally, the timeout for this is significantly lower than a standard SCSI read or write. If the lun query times out and for whatever reason is not retried, the host will functionally fail to reclaim the path.

Take a moment to consider the risk associated with that...  If you are performing a code upgrade and the host fails to reclaim its paths after the first node in an I/O group updates, what happens? The second node going offline for upgrade will remove the hosts' only functional paths to storage.

While this situation at a technical perspective is not the explicit fault of the storage array, most of you (I hope) don't care whose fault it is and simply want to know how to mitigate the risk. On that note, in comes the conversation about NPIV.

Why/how does using NPIV mitigate this problem?  I'm glad you asked. Lets break it down into steps:
  1. Node 1 goes offline (upgrade, reboot, failure, etc.)
  2. The system fails over the host-facing (NPIV) WWPN to its partner node 2
  3. The host handles the NPIV failover, and access is maintained
  4. Node 1 comes back online and logs into the fabric using its physical WWPN
  5. Node 1 pulls the cluster state data from Node 2 using the physical WWPN's
  6. Node 1 finishes loading the cluster state
  7. Node 2 fails-back the NPIV WWPN's owned by node 1
The trick is that the host is talking to the online node 2 until node 1 comes back online AND is ready to service I/O. While you might still notice a latency spike, all the commands are expected to complete. This ultimately means hosts are expected to maintain access to the data, and the problem described at the start of this page is circumvented.

As always, if you have any questions or concerns please feel free to comment, follow me on Twitter @fincherjc, or on LinkedIn.


Comments

Popular posts from this blog

Why you should always use DRAID

Remote Copy Data Patterns and Partnership Tuning

What is a 1920 and why is it happening?