Microsoft® iSCSI Target 3.3 availability and performance white paper (2010)

noname studio

published: December 2010

(download as pdf)

Abstract

In October 2010 noname studio conducted independently a set of iSCSI performance benchmarks using the lately available Microsoft® iSCSI Target 3.3, distributed to OEMs with Windows Storage Server 2008 R2 editions. The test series were focused on a gigabit Ethernet over copper (1000BASE-T) with a single link. The results were comparably satisfactory, still for high-performance applications not recommended.

Our objective

Our goal was firstly to test if Microsoft® iSCSI Target 3.3 can offer a good performance, compared to other iSCSI Target software solutions. Furthermore we wanted to check if one single 1GbE link can be a bottleneck in an iSCSI SAN scenario. Finally there is a long lasting discussion about adjusting TCP registry values for higher bandwidth and lower latency under Microsoft® Windows products – the proposed solutions were set under a proof of concept trial.

Under such conditions it was decided to keep the test matrix on this basic density level. This is why we purposely refrained from further test scenarios concerning redundancy and high availability, such as MPIO, Clustered SAN and Cluster Shared Volumes.

Our observations

IPERF

The first important step in validating the test environment was to verify if ethernet connection was fast and stable enough to support iSCSI load. Since a theoretical technical bandwidth for 1GbE is 128 megabyte/s the expectation was laid within 110-124 MB/s in one direction. Remark 1: actually for MTU = 1500 Byte the payload is only 1460 or about 0.973 factor of the bandwidth, which results to 124,5866 MB/s by zero latency. With jumbo frames the factor is around 0.996, which should have lead to theoretical bandwidth of 127,431 MB/s. The theoretical performance “boost” percentage between the two values is then around 2,28 %. Remark 2: since it is generally accepted to measure hard disk transfer rates in Megabytes per second, and iSCSI is essentially a (virtual) disk provisioning protocol, this measure unit was taken into account in the entire document.

To measure the capacity of the network cards, using their latest stable driver releases, which were switched directly over a copper cable, we used Iperf and visualized the results with Jperf (see tech specs in Appendix B. With or without the LLTD-Protocol there was no difference in the performance. Much important was the TCP window size. When tested with the default 0.01 MB the bandwidth had an average value of 41.60 MB/s.


Also the default TCP window size in Microsoft® TCP/IP stack 0.06MB (64 KB) was very unstable and way under the maximum bandwidth expectations, normally staying somewhere between 94 – 98 MB/s.

Only after increasing the TCP window size to 0.13 MB (128 KB) we could produce reliable transfer rates of 113 MB/s. Further increasing of the rates of up to 0.5 MB did not bring any significant changes, so our recommendation should have been any TCP window size higher than 128 KB.

IOMETER

The second step in the test environment was to measure the transfer rates with IOMeter. Since this tool measures Ins/Outs and they are heavily dependent on the fragments (data blocks) size, read & write percentage, random & sequential percentage, etc. its output – especially the IOPS-values – should be taken with reservations (for further explanations of the IOPS-importance in virtual and SQL servers see our article “SAN performance factors explained”). For our purpose we used it generally as reference model against some DELL test environments where IOMeter profile specifications were published. On the iSCSI Target machine we created a single RAM-residing 2GB large disk. The first results, using the “Simulated File Server Configuration”-profile (see Appendix B) was very promising, showing results almost as twice higher than DELL MD3220i storage solutions (their value being 6.859 total IOPS). Further investigation brought some concerns though: on the one side our tests were using the much faster and smaller RAMDISK, where neither RAID controller latency nor HDD head positioning latency are disturbing the performance. On the second hand the total MBps were showing values bigger than the maximal 124 MBps, which could be only explained with the fact that there were 80% reads and 20% writes on a full duplex ethernet environment.

Further on we repeated the same profile with recommended TCP and iSCSI optimizations (see Appendix C). Whereas iSCSI Burst Length and Data Segment Length did not influence the results in mentionable rates the additional – and well documented by iSCSI OEMs “TCP improvements” – had traceably negative effect, reducing the transfer rate with about 8.5 %.

We have tried to reproduce similar results using 50% read and 50% write, expecting that the full duplex 1GbE would utilize both directions, in summary having total MBps values at around 210-220. But in fact we came only to values as high as 165 MBps.

The values in yellow represent the transfer bandwidth in one direction, which, considering the results by iperf with 64KB TCP window size (~94-98MBps), is still way afar from optimized ethernet utilization.

Finally this table represents another NIC vendor as cross-check (for specs see appendix A, cross-check hardware). Here we can see that with Intel® network cards the IO performance is comparable to Broadcom network cards. For further details about the tested NIC teaming please refer to “NIC teaming” subsection in this paper.

H2BENCHW

The third test sequence was conducted using block based disk benchmark h2benchw available from heise.de. This tool is widely used to measure physical disks performance, including test profiles like zone measurement, sustained read and access time. The results can therefore be used as comparison between current SATA/SAS hard disk tech specs and iSCSI disks performance benchmarks.

It can be argued that h2benchw is not a SAN performance measurement optimized tool and also that using primarily sequential reads/writes tests underestimates real life scenarios where multiple SQL or VMs spark off random reads/writes on major scale. We agree with that but also extend this argumentation to its logical consequence: sequential reads/writes – although not critical indicator – are firstly very important by file services with large size files transfers as well as SQL log transactions or DB backup/restore procedures. Secondly –at least with spindle-based HDDs – sequential IOs are performed faster than random IOs.

With this argumentation ahead we expected from h2benchw better or at least similar performance values, compared to IOMeter values. To our disappointment though the maximum transfer rates were somewhere around 74MB/s (read sustained) and 77MB/s (write no delay). The Cisco switch port probes also rarely registered bandwidth utilization higher than 55 percent (from 1000Mbit/s).

Further investigation and feedback from h2benchw developer revealed that the tool works with fixed block size of 128 sectors. Since iSCSI Target reported the RAMDISK as 512 Byte sectored disk the transfer block size was 64 KB. Still we cannot consider this profile as worst case scenario given that 64 KB is the recommended cluster size (aka NTFS allocation unit size) for SQL database volumes. Thus in our opinion the tests revealed realistic performance values. These were actually comparable to entry level SATA HDDs and significantly slower than enterprise level SAS HDDs. Only the random access latency proved to be very promising: 0,41 ms read can be achieved only with SSDs and 0,13-0,16 ms write latency is not even possible with the latter, except when using hardware RAID controller with large cache in Write-Back Mode.

The following table presents the summary of the most relevant tested values.

As can be seen here actually no software and registry tweaks could improve the overall performance and the fluctuations can be interpreted within the statistical tolerance frames: once again as with IOMeter iSCSI Burst Length and Data Segment Length or TCP window size variations (see Appendix C) could not traceably improve the overall performance.

A typical alignment for zone measurement read can be seen in the flowchart below. Please pay attention to the line as if the signal gets saturated around 75MByte/s. We could not find out whether this was caused by the iSCSI Target itself or by the implementation of the RAMDISK virtual volume, but considering the better results by IOMeter we tend to think that this was caused by the iSCSI Target (or Initiator) as soon as 64 KB fixed block size was used. Unluckily we could not reproduce the same scenario with IOMeter: for that purpose we needed the exact profile specifications for h2benchw, the latter being closed source and the developer not prepared to cooperate.

Exactly the same measurements were then reproduced on the cross-check hardware platform and there also we didn’t observer any significant improvements, the average values also covering the spectrum between 70-70MB/s. By both hardware workbenches we observed though some irregularities, that could not be related to any human action: we started per batch h2benchw in 3-iterations loop with a breather pause of 30 seconds before each iteration and rarely one of the three iterations would show higher values in the write-bandwidth. Below follows such untypical pattern:

Since the “write boost” to ~83MB/s was only sporadic (two from altogether 12 iterations), it could not be taken into consideration, but is nevertheless annoying not to be able to find the cause of it.

Recapitulating this subsection – the performance of the virtual disk over iSCSI is not sufficient for applications, using long sequential reads / writes, especially if they are using fixed data block size of 64KB.

User experience – robocopy, richcopy and file sharing

A nice way to illustrate the ambivalent perception of the test results was the “user experience” test phase, where two types of behaviors were pinpointed. The first pattern was copying large files directly on the iSCSI initiator, the second – copying large files over network shared folder from the iSCSI virtual disk. Using robocopy on the iSCSI initiator with a letter-mounted LUN was seamlessly fast, but the performance was delusory:

When paying close attention to the digits it was obviously not possible to hurl 650MB/s over 112MB/s link. As usual the system cache was involved in this behavior, the Task Manager revealing RAM peaks of 1GB with sustained NIC activity (at about 40% of the capacity) – obviously the file was firstly loaded in RAM and then copied “in background” over the network.

Richcopy, although capable of switching off system cache was also not a reliable information source, reporting 128MB/s and 256MB/s for 1GB files and 1024x1MB files respectively. We refrained from further tests in this pattern.

In the second pattern the iSCSI drive, already mounted and NTFS-formatted on the iSCSI Initiator machine, was shared over the network and another client contacted this share over CIFS/SMB. As actually expected the transfer rates were never higher than ~45MB/s thus the bottleneck was here the SMB protocol itself: either SMBv2 was not working (although between Windows 7 and Windows Server 2008 R2 this should be the standard negotiation) or its transfer rates are exactly as bad as SMBv1.

Supplement – NIC teaming

Although not in the scope of our objective we decided to perform a proof of statement trial. In the OEM documentation for iSCSI Target V3.3 is written: “You should not use network adapter teaming with Microsoft® iSCSI Software Target 3.3 for iSCSI communication.”

There are cases though, where productive servers have already activated NIC teaming and our question was if it is at all possible to use iSCSI Target V3.3 under such conditions. For the test-bench we teamed two physical interfaces on each side in LACP mode and also configured the corresponding ports on the switch the same. Since an iSCSI session is a single I/O flow from the switch’s perspective we: a) did not expect any bandwidth improvement; b) the load balancing configuration on the switch was unimportant. The only benefit from a NIC Teaming (also known as bonding) in such case could be the redundancy.

As seen from the last IOMeter table the achieved transfer rates were comparable to those of a single NIC. Also we could eventually unplug one of the cables (as such degrading the logical aggregated link) and still proceed with further test iterations. So it is presumable that Microsoft® simply does not support NIC teaming for iSCSI communication, but it is nevertheless possible to implement it. Still we do not recommend NIC teaming as a logical link to a SAN appliance: currently MPIO is widely supported and it maintains redundancy as well as link aggregation.

Conclusions

To summarize the three different stress tests: iperf showed that a high network utilization is possible and stable for TCP window sizes equal or higher to 128KB/s. Still when set in the registry “TcpWindowSize” could not improve the transfer rates and even in the case of IOMeter tests degraded the performance insignificantly. Also iSCSI configurations with “Maximum Data Segment Length Size” of 128KB or higher did not have any outstanding effect. As such the answer to the question if TCP and/or iSCSI “tweaks” can improve the iSCSI performance the answer is definitely negative.

Similarly the second question whether 1Gb network connection can be a bottleneck for Microsoft® iSCSI Target 3.3 could be answered negatively. Even IOMeter could not exploit the full technical bandwidth, and the user experience checks showed that the system cache is used as buffer, with the data afterwards gradually synced to the LUN at approximately 40% capacity. This is presumably a design behavior, nursing the overall network utilization, but as such is also a hazardous solution – imagine a switch failure during this background synchronization. Thus there is definitely no reason to purchase a 10Gb equipment and further more MPIO could hardly make improvements in single point-to-point scenarios. Whether Microsoft® iSCSI Target 3.3 could be better utilized with multiple worker processes over MPIO is beyond the scope of this paper.

The third question whether Microsoft® iSCSI Target 3.3 can offer a good performance, compared to other iSCSI Target software solutions can be positively answered. Comparing with the results from our previous paper “Microsoft® iSCSI Target 3.2 availability and performance white paper (2010)” Microsoft® performed as good as (and in some cases slightly better than) Starwind’s software iSCSI Target, and perceptibly better than iStorage software solutions. Still transfer rates between 70-85MB/s are not satisfactory for a heavy loaded productive environment – in such cases a choice of hardware iSCSI appliances is highly recommendable.

Last elaboration: given the overall conclusions what could be possible deployment scenarios for this software? It is imaginable to use Microsoft® iSCSI Target 3.3 in file and backup services. For example Microsoft’s integrated backup solution in Windows Server 2008, respectively Vista, and higher is lacking the functionality to perform incremental backups on network shares. Here an iSCSI advertized LUN can be used from the client as raw disk, and the VHD file on the iSCSI Target side can be archived to tapes or similar. Also another example would be a storage drive for a file server with the possibility to trigger VSS consistent snapshots that can then be archived directly on the iSCSI Target Server without the need of network congestion.

Appendix A: test server specifications

For the tests we used two identical hardware systems, one configured as a Storage Server (iSCSI Target) the other as Client Server (iSCSI Initiator)

Hardware

DELL PowerEdge 860

CPU – Intel® Xeon® X3220 (Quad Core @ 2.40 GHz)

RAM – 4GB (4x 1GB PC2-5300E-5-5-5-12)

NIC – Broadcom Dual NetXTreme Gigabit Ethernet (BCM5712 B1, Firmware v3.61, Drivers 14.2.0.5 from 7/30/2010)

No Jumbo Frames were used.

The connection was built over Cat6E double-shielded (F/FTP) cable; no switch was used

Only Protocol IPv4 was enabled (see Appendix C)

Cross-check hardware

iSCSI Target –

Identical with above except for NIC card

NIC – Intel® PRO/1000 PT Quad Port (Drivers 11.4.7.0 from 04.12.2009)

iSCSI Initiator –

Supermicro X8DT3

CPU – 2x Intel® Xeon® E5502 (Dual Core @ 1,87 GHz)

RAM – 6GB (6x 1GB PC3-6400E-6-6-6-14)

NIC – Intel® 82576 Gigabit Dual Port (Drivers 11.4.7.0 from 04.12.2009)

No Jumbo Frames were used.

The connection was built over Cat6E double-shielded (F/FTP) cable on dedicated Cisco C2960G switch

Only Protocol IPv4 was enabled (see Appendix C)

OS

Microsoft® Windows Server 2008 R2 x64 English

All Updates, included in the CD Windows_Storage_Server_2008_R2\OS Updates as recommended by Windows_Storage_Server_2008R2_OEM_Guide.doc

Software

(Only on the Storage Server)

iSCSI_Software_Target_33

(Only on the Client Server)

RAMDisk_Evaluation_x64_530212

From http://members.fortunecity.com/ramdisk/RAMDisk/ramdiskent.htm

Configured as a 2GB NTFS Disk with 2048 MB in resident memory

Used for the test iterations with robocopy and richcopy (see Appendix B)

Appendix B: workload and test procedures

The following synthetic and native workload programs were used during the test phase:

Jperf-2.0.0 + iperf-1.7.0

Tests included three identical iterations each 10 seconds long with a single thread. A median from the iterations was chosen. The following options were changed subsequently:

  1. TCP Window Size: 0.01MByte (default); 0.06Mbyte; 0.13MByte; 0.50MByte
  2. Link Layer Topology Discovery Mapper I/O: enabled on iSCSI interface; disabled on iSCSI interface
  3. As such a 2×4 matrix was created

IOMeter -2008-06-22

The tests included the “Simulated File Server Configuration”-Workload as proposed in http://www.dell.com/downloads/global/products/pvaul/en/powervault-md3220i-md3000i-mixed-workload-comparison.pdf :

10% 512 byte, 5% 1K, 5% 2K, 60% 4K, 2% 8K, 4% 16K, 4% 32K, and 10% 64K transfer request size

80% reads, 0% writes

100% random, 0% sequential

I/Os aligned on sector boundaries

100% seek range

60 outstanding I/Os per target

1 worker per target

There were no iterations but additional profiles instead (i.e. IOMix from heise.de, etc.), where the total IOPS did not significantly deviate, therefore they were not represented here.

The same test profile was run for the three network optimization scenarios “notweaks” “onlyISCSItweaks” and “TCPandISCSItweaks” (see Appendix C for details)

H2benchw-3.16

This set of tests was executed directly after IOMeter without any system changes. Version 3.16 does not include Application Benchmarks anymore, so only sequential read/write and zone measurement tests (including latency measurement) are displayed.

The command used was

h2benchw 1 -a -!

The txt and ps outputs can be acquired on request

Robocopy

(Installed by default on Server 2008 R2)

An additional idea during the scheduled tests occurred to measure the transfer speed on an already existing NTFS volume. It was deliberated whether the synthetic benchmarks are traceable under real conditions, i.e. if NTFS driver and copy/robocopy implementation could drastically reduce the file transfer speeds.

The command was run as

Robocopy <source> <destination> /E

Any additional options such as /COPY:DATSOU or /MT were left out, considered to influence the aspect of copying as the average user would normally do it – with copy and paste on Windows Explorer.

The following file sizes were tested:

1x 1GByte file

1024x 1MByte files

64x Folders @ 64x 256Kbyte files

All files were created using dd if=/dev/random pseudo interface

Appendix C: software and registry optimization specifications

It was decided to test the reliability and performance deltas from two different services. The first is the TCP/IP (v4) stack and the other is the iSCSI stack.

We named three different scenarios: “notweaks”, “onlyISCSItewaks” and “TCPandISCSItweaks”. Scenario such as “onlyTCPtweaks” was not included since it was already thoroughly tested earlier (see our internal document “Microsoft® iSCSI Target 3.2 availability and performance white paper (2010)”) and would not have brought any further improvements in the environment.

Lastly the scenarios were implemented parallel on both servers, to assure that the systems were in a consistent-affiliates state

Notweaks

This configuration was based on a vanilla installation of Windows Server 2008 R2 Standard Edition. Still since IPv4 iSCSI configuration was deployed, all other protocols were disabled on the iSCSI interfaces. Sufficient preliminary checks with iperf have proven that there is no negative impact when disabling those additional protocols.

Furthermore netsh int tcp set global was reset to default, in case any of the Microsoft® OS Updates could have changed their values unattended

OnlyISCSItweaks

This configuration was based on the recognition that stable high bandwidth can be achieved only with packets of TCP Window Size equal to or larger than 256Kbyte.

The default settings were as shown next:

We could decrease the values of the advanced settings but cannot increase them (or set them back) via GUI. For the iSCSI Target we issued the following powershell command:

Set-IscsiServerTarget -TargetName target0 -FirstBurstLength 262144 -Force -MaxReceiveDataSegmentLength 262144 -PassThru -MaxBurstLength 524288

For the iSCSI Initiator we had to change the values directly in the registry:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{ 4D36E97B-E325-11CE-BFC1-08002BE10318}000\Parameters]

“MaxTransferLength”=dword:00080000

“MaxBurstLength”=dword:00080000

“FirstBurstLength”=dword:00040000

“MaxRecvDataSegmentLength”=dword:00040000

TCPandISCSItweaks

This configuration was based on the “onlyISCSItweaks” and additionally the TCP/IP stack was tweaked as follows:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\Parameters\Interfaces\{<iSCSI NIC ID here>}]

“TcpDelAckTicks”=dword:00000001

“TcpWindowSize”=dword:00080000

“GlobalMaxTcpWindowSize”=dword:00080000

“Tcp1323Opts”=dword:00000003

“SackOpts”=dword:00000001

“TcpAckFrequency”=dword:00000001

Additionally the following command was issued

netsh interface tcp set global rss=disabled chimney=disabled autotuninglevel=disabled congestionprovider=none

As proposed in http://support.nexenta.com/index.php?_m=knowledgebase&_a=viewarticle&kbarticleid=85

Advertisements

3 thoughts on “Microsoft® iSCSI Target 3.3 availability and performance white paper (2010)

  1. Interesting article. Regarding the TCP window size, Windows 2008 and newer ignores those registry settings. To my knowledge, the only thing that you can do to change it is to use the various “autotuninglevel” settings. e.g.:

    netsh int tcp set global autotuninglevel=normal

    There’s no manual control, as far as I know, which is a pity.

  2. Thanks for this enhnancement, cce. Strangely enough nexenta removed their “quickfix” which I linked above, which I now find totally wrong, after reading http://technet.microsoft.com/en-us/library/cc731258(WS.10).aspx
    Should have been “set global rss=enabled chimney=enabled autotuninglevel=normal congestionprovider=ctcp”, but these are the default settings under Windows 7 and Server 2008, and as such the “notweak”-test-iteration which didn’t prove to be substantially better.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s