In EMC Proven Solution testing for a given use case, I have come across a serious issue in relation to iSCSI responses, which inheritently causes slow storage response times, very slow cluster polling and enumeration.
Windows 2008 R2 with Hyper-V. 6-node Hyper-V cluster. 65 Disks. 2x iSCSI NIC per node.
What I saw was a slowness in the cluster in bringing a VM’s Virtual Machine Configuration cluster resource online which had a large amount of disks configured. They would time out as they passed their default pending and deadlock time-outs.
Firstly, when you online, refresh (or failover) a VM configuration, Hyper-V performs a sanity check to ensure the underlying components of the VM (network, storage, etc) are available. This means scanning all the disks.
So, say I had a VM with 25 disks (in my case), the VM config took over 10 minutes to online!
Why? Well working with Microsoft OEM Support, they asked me to try to tighten the TCPAckFrequency to 1(millisecond). I say, OK I’ll try it.
This brought the online time from 10 minutes to 19 seconds! Result…or maybe…
I needed to fully understand the issue, so out came WireShark in order run some Ethernet traces…
Let me explain what the actual issue is…
The problem is basically that iSCSI is a victim of Nagle’s Algorithm. Optimization of TCP networks in terms of minimizing congestion due to TCP overhead. iSCSI is essentially stung by send colalescing.
Windows 2003 onwards, the default TCP acknowledge time is 200ms.
This means that if a TCP segment (1462 bytes) is not full, it may need to wait up to 200ms before the data actually sent from the Windows host.
This is a problem when using iSCSI for two reasons
1) You need the fastest response time possible from storage for your application
2) iSCSI payloads can be very small, esp SCSI control OpCode (CDBs).
Now, the iSCSI CDB (control/query) OpCode commands involved in enumerating the disks during the online action have a tiny payload (10 bytes).
From looking at the ethernet trace, the cluster disk driver is performing SCSI(10) read commands (LUN read inquiry, read capacity, etc). It does this sequentially, at least twice for each disk involved in the virtual machine.
With the default TCPAck time of 200ms, for each SCSI Read command issues by the cluster node, the payload of the command is 10bytes (on wire is 66bytes). The SCSI Read command does not fill a TCP segment and so while the TCP payload is sent to the storage controller and the controller responds, the node waits to send the ACK until a segment on that NIC is filled or hits the max ACK time of 200ms.
So, let’s say I am the cluster disk driver and want to read LUN metadata as part of onlining a cluster disk…
…I send the command via the iSCSI Initiator, Storage Controller responds…wait….wait….wait…TcpAckFrequency trigger fires after a max of 200ms….only then is my ACK sent to storage….and the TCP transmission completes and the data is returned to the Cluster disk driver process.
This means for each SCSI command we attempt to send to storage (target and target LUN), we will typically end up waiting for the 200ms timer! This is regardless of how busy SCSI data I/O is because this algorithm has Winsocket-based granularity.
In a Windows iSCSI cluster, this issue really comes to light because the cluster operates (controls/validates) on resources in the cluster in a sequential manner.
So, for say 65 LUNs, each LUN takes a significantly longer time to validate/control because the duration elongation due to the 200ms ACK timer happens multiple times per LUN.
Storage response times should be in the sub-10ms range, not 200ms 🙂
This is also why I did not see this issue in Fibre Channel environments.
So, while setting the TCP ACK frequency to hardcode to 1ms per-NIC helps, it may have adverse performance implications in terms of wire and Storage Port congestion.
But…for iSCSI Networks, this should not really be of concern because by best practice, you should have isolated iSCSI networks with minimal hops between host and storage.
The real fix I believe is by using the TCP_NODELAY option in SockOPTNS for the iSCSI Initator. This by-passes the Nagle Algorithm completely for that process, so you dont need to even wait for that 1ms (seems a short time, but it is still a trigger time) – the SCSI command will fire immediately.
I have a design change request in for Microsoft to consider this as the way forward. Probably wont see it until Windows 8, but hey!
How to set the TCPAckFrequency on your iSCSI NICs
Regedit – backup your registry – before proceeding 🙂
Subkey: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\<Interface GUID>
Value Type: REG_DWORD, number
Valid Range: 0-255
Set to 1.
Do this only for your iSCSI NICs, unless directed otherwise.
Hope this helps