Across the board, users are moving to all flash for all their primary storage requirements. Years ago, using Flash was great way to provide super low latency, wicked high IOPS, and better $/IOP for performance sensitive apps. But since those early days, costs have come down, capacities have increased, and reliability has proven to be rock solid. And today, even the arrays themselves have become more efficient.
So the good news is more apps are taking advantage of modernized infrastructure leveraging the All Flash data center. But with all this Flash goodness, we are observing an interesting phenomenon. The performance of VMAX all flash is so high, in certain user deployments, the SAN infrastructure cannot handle performance coming out of the array. This phenomenon, known as “slow drain”, is typically observed when SAN infrastructure becomes congested, and like rush hour traffic, everything backs up and starts to go too slow.
Dad, When Are We Gonna Be There?
The cause of slow drain in the network is not all that unlike the typical causes of traffic congestion that everyone has had to deal with, some of us on a daily basis. And just like your 9 year old in the back seat, users can quickly become impatient and frustrated with why things are moving so slow. And with both slow drain and traffic congestion, there seems to be 3 key contributors to this issue.
- Mixing older, slower network components in front of the high performance All Flash Array. This includes switches and host adapters with a mix of 4Gb, 8Gb or 16Gb network speeds. The slower paths limit the amount of data and IO the arrays can deliver to the host. Think of highway designed for 70MPH, but some of the cars are only capable of 25MPH. Even though your car can go faster, you can’t get ahead of the slower traffic in front of you. And the busier the roads, like rush hour, the more traffic backs up.
- Using an inefficient number of paths used to fan in or fan out between the hosts, switches, and storage array. Without enough paths, or highway lanes, traffic can back up quickly. More lanes means traffic can flow more efficiently. Intelligent path managers, like PowerPath, can help by allowing traffic to use the right lane to either pass slower cars or switch between lanes that are too congested. But this only works if there are enough lanes to switch between.
- Running inter-switch links (ISLs) that are oversubscribed. Switches are not only used to connect hosts and storage arrays, but they are also connected to other switches to create larger fabrics. Switch to switch data flows across dedicated paths (the ISL’s), providing an increased number of hosts and storage ports that can be included in a single fabric. But if there are not enough ISL’s or they are slower links, performance can be impacted. Think of an 8 lane highway with all traffic having to go through a single toll both. Worse yet, imaging if that toll booth didn’t accept a “Fast Pass” and each transaction had to be handled manually (BRUTAL!!)
So just like your daily commute to the office, slow drain in the storage network is no Sunday drive in the country. It means the slowest components in the network may drag down the rest of the components in the network – slowing down or impacting performance of others.
How Do We Get There From Here?
If your flash array is deployed in a SAN with slow drain issues, when practical, it is advisable to fix the root cause of the problem, ie build a faster highway. However, it is not always practical to address all root causes, for example a large number of servers may have slower paths, such as 4Gb FC host adapters, while the array runs at 16Gb FC speed. One effective option is to “throttle” the array to prevent too high performance from saturating the SAN. Some arrays offer QoS settings to cap IOPS and bandwidth to mitigate the issue.
For example, VMAX All Flash can help you alleviate the issue by leveraging Host IO limits feature of HyperMAX OS. The setting can be turned on when either creating new storage groups, or can be enabled, disabled, or tweaked on for existing storage group. When enabled, Host IO limits allows IOs/s and MBs/s caps to be set that match the performance of lowest component in the SAN and prevent traffic backups. It’s similar in concept to traffic lights used on major highway on ramps to stagger the flow of incoming traffic. It helps keep traffic flowing more efficiency by minimizing backups at those on ramps that can quickly impact traffic and cause major backups. The result is a more balanced, less stressful commute allowing your data a more enjoyable ride to your users.
Within Dell EMC, we have extensive experience on this issue and can help address the problem. In fact, there are many online resources available to identify and troubleshoot slow drain issues.
Here’s a great deep dive from Dell EMC’s storage networking guru Erik Smith on slow drains and how they can be impacting your Storage Area Network:
And here’s a handy white paper on Host IO limits and how they can help put thresholds and polices in place to avoid the issue.