Spectrum Virtualize Software Overview

Hello friends. After last week's post I got a lot of questions about the software stack diagram and so instead of doing a troubleshooting post I want to take a step back and just explain the diagram and in general how IBM Spectrum Virtualize software works. To start, lets take a look at the diagram:

The software is comprised of several 'layers'. From top down we have a SCSI target, Replication, Upper Cache, Flashcopy, Mirroring, Thin Provisioning, Compression, Data Reduction Pool, Lower Cache, Virtualization, RAID, and SCSI Initiator that comprise most of our i/o stack.

It is important to note, is this diagram is somewhat incomplete. The truth is that just after the SCSI Target, before the lower cache, and between every component under the lower cache there is a forwarding layer which I will explain in more detail a bit later in this post.

SCSI Target

This section of the software decodes SCSI commands that are received by the hosts and manages the IT Nexus that is created to service the i/o request. It is also where we start and stop the timer for i/o response times in the performance metrics of the product.

Write i/o response time starts when we receive the write request and stops after we send the acknowledgement to the host that the write data is received and destaged. Read i/o response time starts when we receive the read request and stops when we receive the acknowledgement from the host that it has received the data blocks requested in the read command.

Upper Forwarding

What!? Yes, this is not the picture. Spectrum Virtualize is organized into node pairs known as i/o groups (iogrp). The upper forwarding layer is responsible for forwarding write datagrams from the node that received the write datagram to the partner node in the iogrp so that there is a redundant copy of the in-flight i/o that can continue the i/o process in the event of a node failure.  

Replication

The replication section manages remote copy relationships and any i/o processing that needs to be done. The product has three types of remote copy relationships:

Metro Mirror:
Metro Mirror is IBM's synchronous mirroring solution embedded in the product. This mode of replication forwards a write datagram in full to the replication partner (remote cluster) and waits to receive an acknowledgement that the i/o is received on the replication partner before continuing processing the write i/o locally. This technology requires that the bandwidth available for replication be greater than or equal to the write data rate of all the replicated volumes in order to avoid backpressure that would increase write response time to the host.

Global Mirror:
Global Mirror is one of IBM's two asynchronous replication technologies built into the product. Global mirror works almost exactly the same way as Metro Mirror with one exception. When using Global Mirror, the system does not wait to receive an acknowledgement from the remote cluster before continuing to process the write. This means that slow downs on the replication link don't have as severe of an impact on the write response time that is seen by the host, but the system has to send the write data to the remote cluster before continuing the i/o.

Global Mirror with Change Volumes:
 This is where the conversation gets interesting. Because native Global Mirror is still susceptible to fabric conditions like running out of buffer credits or high packet loss causing massive delays in response time to the host, IBM devised a more asynchronous technology. The general concept here is a flashcopy (to be discussed later) mapping is started and then a Global Mirror style replication is started between that flashcopy target (the master change volume) and the aux change volume. Once these two volumes are synchronized (all the data is moved), then a flash
copy is started from the aux change volume to the aux volume to commit the changes. The benefit of this is that fabric conditions will have no impact on write response time to the host.



Upper Cache

This is a partition of memory in the node used to store write datagrams coming into the host and read datagrams post-processing of other software components between the two different levels of cache.

Flashcopy

Flashcopy allows you to make a point in time copy of any given volume instantaneously. This is done by creating a relationship between a source and target volume. While data is copied in the background from the source to the target volume, reads to the target are redirected to the source for data that hasn't been copied over yet. Writes to the flashcopy source volume that have not yet been copied to the target results in a read of the existing data on the source, and a write of the existing data to the target before the new write to the source can proceed. This is done to maintain the integrity of the point in time copy, and is referred to as copy on write.

Mirroring

This allows you to create a simple RAID 1 style mirror using two volume copies. This is typically used for critical workloads to span two different storage controllers or as part of a stretched cluster keeping a copy of a volume at each site. 

Thin Provisioning

Thin provisioning manages the space allocation of thin provisioned volumes. If needed, new extents can be allocated to the volume, they are allocated to the volume before proceeding.

Real Time Compression (RtC)

This is exactly as it sounds. It compresses writes to a fixed output size of 32KB and indexes segments at 32KB. A fixed output size write compression means that in our case, sequential writes (>64kb/op transfer size) will consume more resources than random writes. Additionally there are limited resources allocated to compression so having lots of sequential workloads can be bad for performance of Real Time Compression. From a read perspective, because all of the data is indexed at 32KB the minimum read size to disk for a compressed volume is 32KB. Additionally, sequential (>64kb/op transfer size) reads result in multiple multiple reads to disk which can also impact performance.

Data Reduction

The data reduction layer is part of IBM's new Data Reduction Pool feature that includes a redesigned compression method, deduplication, and a new architecture for how volumes are managed in a pool.

Lower Forwarding

Yes, another hidden forwarding layer. The idea here is now that data has been manipulated by the software (compression for example) we mirror the altered datagram to the partner node so that both nodes have a copy of the data in the event of a node failure.

Lower Cache

This level of cache serves two purposes. The first is it stores the manipulated in-flight write data (compressed data for example). The second is it serves as read cache to help avoid having to read from the back-end disk for every request that comes in.

Virtualization

This component manages extents in pools. This includes EasyTier migrations to allow frequently accessed data to reside on faster tiers of storage AND load balancing across different mdisks in the same pool that are the same tier.

RAID

As the title implies, this manages internal RAID arrays for any disk drives that happened to be directly attached to the node. Supported RAID types are 0 (don't ever use it), 1, 5, 6, 10, DRAID 5, and (the one you should always use) DRIAD 6.

SCSI Initiator

The SCSI Initiator encapsulates our altered write datagram into a SCSI frame and send it off the the backing mdisk/array to be destaged to persistent storage.  Additionally, it creates the SCSI read requests that we send to the mdisk/array to read in data that we had previously written.


I hope this overview helps you to understand what Spectrum Virtualize does at a high level and can serve as a reference for troubleshooting in the future. If you have any questions please  leave a comment or follow me on the Twitter and LinkedIn.


Comments

Popular posts from this blog

Why you should always use DRAID

What is a 1920 and why is it happening?

Troubleshooting volume performance in IBM Storage Insights