Troubleshooting volume performance in IBM Storage Insights

Oi meu amigos e amigas! 
Many IBM Spectrum Virtualize users have been using tools like Spectrum Control or Storage Insights to monitor their storage systems for a while now. However, we see a lot of support cases regarding performance that come with very little information outside of 'a performance problem exists.' So this week, I want to share the methodology I use to troubleshoot performance for a specific volume (or volumes tied to a specific host). Please note, this blog will assume that you have Spectrum Control or Storage Insights Pro to be able to check the detailed performance data support gets in a snap.

The first step:

The first step in troubleshooting any problem is acknowledging that there is one. More importantly, defining what the problem is. The information that is vital to know is:
  • When did the problem surface? Explicitly note the date and time.
  • What was impacted? List off specific applications or hosts and then identify their volumes.
  • Why is it important? Is the backup running too long? Is a database struggling to update records? Is VDI struggling to provision desktops? This helps to identify if the problem at the application layer is read or write centered when troubleshooting performance.

The second step:

Now that the volume(s) have been identified the second step is to gather some basic information about the specific volumes that are impacted:
  • Is the volume in a global mirror (GM), metro mirror (MM), or global mirror with change volume (GMCV) relationships?
  • Is the volume in any flashcopy relationships (fcmap)?
  • Is the volume mirrored within the same cluster? Is it a hyperswap volume?
  • Is the volume compressed, thin provisioned, or in a data reduction pool?
  • What kind of storage is backing the pool that the volume is using?
The third step:

Now this is where things start to get interesting. Take the information you know about the volumes that experienced a performance problem and break out your copy of the Spectrum Virtualize IO Stack (for here, shamelessly stolen from the FS9100 architecture redbook). Immediately cross off all of the features the affected volumes are not using. If the volumes in question aren't using the features, these features can't be the cause of the performance issue experienced. Please note for yourself that SCSI Target, SCSI Initiator, RAID, Virtualization, and Cache (usually) are always possible contributors and should not be crossed out.

The fourth step:

At this point we can start looking at the detailed performance data. In Storage Insights go to actions => view performance for your storage device. Once in the performance view, click actions again and select volume performance as shown in the picture to the right.

Once here, proceed to filter on one of the volumes that you have identified as suffering from this performance impact. Then filter on the date/time range that the problem was experienced (start and end time) - adding an extra 30 minutes before and after so we can check for dramatic changes in the data. Finally (for write problems), lets proceed to change the graph metrics to Destage Response Time - VC and Destage Response Time - VCC. If your problem happens to be focused on reads, chose stage instead of destage. VC stand for Volume Cache which is also known as the upper cache and VCC stands for Volume Copy Cache which is also known as the lower cache in IBM Spectrum Virtualize. From there you will have built yourself a chart that resembles something like this:


The Destage Response Time measures how long it takes to move an i/o out of cache. At the VC level, the barrier to destaging an io is processing flashcopy, volume mirroring, space efficient technologies (compression,deduplication, etc,), and being able to write data into the VCC level. The VCC level barrier to destaging an i/o is allocating an space to write to and saving it to the mdisks. Based on the graph here, there is a high gap between the VC and VCC level where the VC level has considerably higher latency.

This would indicate that i/o is NOT being delayed by the VCC level or anything under it in the stack - so we can cross those out. At this point in our example, our iostack diagram looks something like this. Our data here shows then that our problem was caused by either replication, compression, a host issue (SCSI Target), or a defect in cache. From here, we have identified the probable cause of the performance problem experienced and then can do more performance analysis on the specific features identified as possible contributors to latency. In the example used here, they would be Real-time Compression and Replication.

That is all for now. In future posts I will help with debugging some of the more advanced features of the product. In the meantime if you have any questions please feel free to comment, find me on linkedin, or follow me in twitter.



Comments

  1. Good start my friend! Do not forget the victim and perpetrator dilemma. Sometimes the victim is just a victim of a perpetrator the is exhausting resources and the perpetrator is not suffering but it the root cause. Happy hunting!!

    ReplyDelete
    Replies
    1. Yes my friend. Here we just identify where the i/o is being held up. I am hoping in the coming days or weeks to help give guidance on how to dig deeper into the individual software components and separate the victims from the causes.

      Delete

Post a Comment

Popular posts from this blog

Why you should always use DRAID

What is a 1920 and why is it happening?