Introduction

Suppose you are facing a performance bottleneck in an on-premises or cloud infrastructure, which after troubleshooting, gets down to being related to storage performance. Some storage subsystem(s) and/or disk appliances or virtual disks are misbehaving or under-performing. You need to monitor disk IOPS performance in detail and detect the root cause of storage performance.

Storage infrastructure assessment

First off, you need to audit/assess your environment, in order to determine your storage components and their architecture. Storage components commonly include, but are not limited to, the following:

File storage (CIFS, NFS, SMB protocols)
Block storage (FC, FCoE, iSCSI protocols)
Object storage, usually found in cloud-based systems, such as Amazon S3 and S3-compatible services.
Various data stores, including SQL and NoSQL databases
NAS and SAN appliances (e.g. NetApp, Synology, IBM/Dell SAN)
Virtual SAN appliances (e.g. Starwind)
Backup appliances
In-memory database systems and disk caching systems
Various end-user storage devices on optical (Blueray), photonics, electromechanical (SATA disks) and purely electronic (SSD/Flash disks) storage media.

Each storage component communicates with other storage components either directly, via an electronic circuit-based communication bus or via a wired or wireless network. It is essential to map your workloads to each of the storage connecting points in your storage infrastructure and understand your storage infrastructure idle/baseline performing capacity at any given point. The various storage components also have interfaces, protocols (e.g. SATA, SCSI, iSCSI, etc) and interface converters which allow them to interconnect to another storage appliance or communication link. This is why storage benchmarking is essential and must ideally be performed right after a storage infrastructure is deployed.

Storage traffic paths

It is also important to map your storage traffic and understand the storage components and the connectivity links involved (electronic or photonic buses and networking links). You need to account for various traffic hops and virtual as well as physical components, especially in the case of hypervisors, virtual machines and containers. If your architecture involves multiple physical on-premise locations and cloud locations, there most certainly at some level, part of your data traffic will involve firewalls, routers, external WAN links, Site-to-Site VPN links and leased lines as well. If you have other special configurations, e.g. nested virtualization, this should contribute to performance. The following traffic path is a possible storage traffic path for an on-premise infrastructure.

Local disk on end-user appliance --- Access ethernet switch (wired) or access point (wireless) --- Distribution ethernet switch --- Core ethernet switch --- hypervisor or container physical host network physical interface cards (pNIC) --- hypervisor or container host virtual switch (vswitch) --- virtual machine or container virtual NICs (vNIC) --- storage ethernet switch --- storage system (NAS, SAN).

Each storage component has an expected performance capacity, which is measured in Input Output Operations per second (IOPS). For read and write operations, each storage component has a certain capacity. Similarly, networking link throughput is measured in bits per second (bps). In modern network infrastructures with Hyper-converged Infrastructure (HCI) components, we tend to experience either 1 Gbps, 10 Gbps or 40 Gbps links. By utilizing network interface/NIC bonds and NIC teaming methods, we get multiples of the original single interface bandwidth. There are cases in which network latency and throughput may not be accommodated sufficiently by corresponding storage IOPS capacity. A very low latency and high bandwidth network will allow large amounts of data to travel through from the hypervisor hosts to the storage system and the storage system may not actually be competent enough to handle the data traffic demand without performance degradation.

Storage performance considerations

Monitoring, optimizing and troubleshooting your infrastructure for storage performance is therefore based on a combination of storage performance considerations, the most noteworthy of which are the following.

Make use of storage vendor tools and best practices (e.g. IBM or Dell SAN systems). Ensure that you optimize your storage system for performance, taking into account design decisions such as the following.
- Number and type of storage pools
- Number and type (SATA, SSD) of physical disks and RAID configuration in each storage pool
- Number and type of storage volumes (LUNs)
- Storage protocols used
- Usage of multipathing (e.g. MPIO in Windows)
- Usage of data compression, deduplication and encryption
- Usage of multi-tiering and auto-tiering or manual tiering
Take into account the storage configuration best practices for all storage components which comprise the storage traffic path mentioned earlier. For example, your should consult Microsoft Learn documentation for SQL Server storage best practices and Hyper-V failover cluster storage best practices. An example of such design considerations is provided in the following article: https://www.starwindsoftware.com/blog/dont-fear-but-respect-redirected-io-with-shared-vhdx. Also, a great analysis for SQL Server IOPS performance troubleshooting can be found at https://www.red-gate.com/products/redgate-monitor/resources/articles/monitor-sql-server-io. Each application which processes heavy load of data (such as DBMS and other data stores) as well as all major hypervisor, container and hyper-converged infrastructure platforms, have their own list of best practices regarding storage operations and performance.
In some cases, thorough examination of your infrastructure may actually reveal that the storage performance issues are actually networking performance issues or that the root cause for low storage performance is due to excessive traffic generated from malicious software, faulty hardware components or under-performing applications, such as in cases of applications connecting to SQL Server databases. In this latter case, performing SQL query optimization could largely improve storage performance.
Make use of IOPS monitoring and benchmarking tools as Diskspd. Microsoft Diskspd, which replaced SQLIO. Diskspd full range of commands is available at: https://github.com/Microsoft/diskspd/wiki/Command-line-and-parameters. Consult the Microsoft Learn article with detailed instructions on how to utilize Diskspd and how to interpret the IOPS test results: https://learn.microsoft.com/en-us/azure-stack/hci/manage/diskspd-overview.

Last but not least, always ensure that you have some form of backup of your primary data. Even if the main SAN system provides duplicate and redundant power supplies, controllers and disk arrays, it is always critical to have your data replicated to an on-site or off-site location as well.