Sql Server I/O performance with different disk alignment values and allocation unit sizes

by Pasi Oja-Nisula (pon@iki.fi), 2007-11-03

Preface

When an I/O operation uses the hard disk, the read or write operation is related to tracks and sectors on physical disk. Above this physical layer lies the file system. This layered structure creates a unique problem. If the file system allocation units are not aligned with physical disk, then it is possible that while fetching one allocation unit from the file system, the hard disk actually has to read two physical allocation units. This might also mean wasted cache space. Some terminology and an example can be found from Christian Bolton's blog.

Many articles recommend using 64 KB value for partition alignment when deploying Sql Server. One such article is Predeployment I/O Best Practices from Microsoft. However, almost none of the articles covering the subject present any results or measurements about the performance implications.

One piece where the difference is actually measured is Performance Impact of Disk Misalignment by Linchi Shea. While installing a new server I had the opportunity to test this myself. While I don't have the required expertise with this kind of performance tuning, the results seem to indicate that there is actually a measurable performance benefit that can be achived by aligning the partition to 64 KB.

While running the tests, I decided to test also the implication of NTFS allocation unit. Same documents that recommend aligning the partition to 64 KB usually also recommend using 64 KB as allocation unit size. My results showed performance increase only in specific workloads.

Testing environment

I tried to copy Linchi Shea's method as closely as I could. So I used sqlio.exe for testing the performance. Diskpar.exe (from Windows 2000 Server resource kit) was used for examining the partition offset and diskpart.exe (which comes with Windows 2003 Server) was used for actually creating the partition.

I copied the initial sqlio.exe parameters (80 GB test file, 8KB block size, 4 I/O queue depth) from Linchi Shea's article. While he only focused on random reads and random writes, I tested also sequential reads and writes. I had a Python script that run the tests with different amount of threads. The test set was then repeated and I chose the better values of the two runs. The differences were marginal here. A single test was run for 300 seconds, which according to some articles might not be enough for all disk controllers and cache sizes. The percieved differences between different alignments seem to by consistent anyway.

The server I used was a dual cpu quadcore with 8 GB of ram. It had a raid controller with two disk sets configured, RAID1 for the operating system (Windows Server 2003 Standard x64) and RAID10 for the data. All experiments were done with the RAID10 disks. The raid was created with 64 KB stripe size.

Creating the 64 KB aligned partition

The data drive was already partitioned when I got the server. The partition was created with Windows disk management graphical utility. While diskpart.exe (the one that comes with Windows 2003) can create the partition with specific alignment, it lacks the query feature from the earlier diskpar.exe. So I used diskpar.exe to find out the offset for data partition.

C:\temp>diskpar.exe -i 1
---- Drive 1 Geometry Infomation ----
Cylinders = 72809
TracksPerCylinder = 255
SectorsPerTrack = 63
BytesPerSector = 512
DiskSize = 598874411520 (Bytes) = 571131 (MB)

---- Drive Partition 0 Infomation ----
StatringOffset = 32256
PartitionLength = 598874379264
HiddenSectors = 63
PartitionNumber = 1
PartitionType = 7

End of partition information. Total existing partitions: 1

The offset for data partition was 32256 which means 31.5 KB. I first run the tests with this configuration. After that I used diskpart.exe to create a correctly aligned partition. This was done by simply deleting the partition with Windows disk management application and using diskpart.exe as follows.

C:\temp>diskpart.exe 
DISKPART> list disk

  Disk ###  Status      Size     Free     Dyn  Gpt
  --------  ----------  -------  -------  ---  ---
  Disk 0    Online        68 GB      0 B
* Disk 1    Online       558 GB   558 GB

DISKPART> select disk 1

Disk 1 is now the selected disk.

DISKPART> create partition primary align=64

DiskPart succeeded in creating the specified partition.

After this I formatted the partition with disk management application. At first I left the allocation unit size to default.

Results for 31.5 KB and 64 KB partition alignment

The following table presents the results measured with sqlio.exe. The first columns tell the operation (R for read, W for write), mode (random or sequential) and number of threads. The following columns show the number of i/os per second. Orig_ios is the value for 31.5 KB alignment and new_ios is the value for 64 KB alignment. Diff% tells the percentual difference to original value. The remaining three columns show the difference in megabytes per second. Since the difference is essentially the same, I have later concentrated only on number of I/Os.

 oper thr   orig_ios    new_ios   diff%   orig_mbs  new_mbs   diff%
R ran   1     400.45     589.11   47.11       3.12     4.60   47.44
R ran   2     696.59     842.85   21.00       5.44     6.58   20.96
R ran   4     929.99    1083.75   16.53       7.26     8.46   16.53
R ran   8    1071.10    1267.99   18.38       8.36     9.90   18.42
R ran  16    1202.49    1442.43   19.95       9.39    11.26   19.91
R ran  32    1322.13    1589.99   20.26      10.32    12.42   20.35
R ran  64    1331.94    1607.96   20.72      10.40    12.56   20.77
R ran 128    1336.41    1603.51   19.99      10.44    12.52   19.92
R ran 256    1336.83    1604.22   20.00      10.44    12.53   20.02
R seq   1   18275.00   18361.12    0.47     142.77   143.44    0.47
R seq   2   19616.58   21298.11    8.57     153.25   166.39    8.57
R seq   4   18046.04   19330.16    7.12     140.98   151.01    7.11
R seq   8   16355.16   17167.52    4.97     127.77   134.12    4.97
R seq  16   13587.78   13872.80    2.10     106.15   108.38    2.10
R seq  32   12628.85   13941.27   10.39      98.66   108.91   10.39
R seq  64   11639.13   12389.73    6.45      90.93    96.79    6.44
R seq 128   11535.52   12171.05    5.51      90.12    95.08    5.50
R seq 256   11633.98   12385.93    6.46      90.89    96.76    6.46
W ran   1     314.82     348.02   10.55       2.45     2.71   10.61
W ran   2     385.36     433.50   12.49       3.01     3.38   12.29
W ran   4     449.75     514.37   14.37       3.51     4.01   14.25
W ran   8     517.79     602.53   16.37       4.04     4.70   16.34
W ran  16     591.43     698.91   18.17       4.62     5.46   18.18
W ran  32     623.32     742.41   19.11       4.86     5.80   19.34
W ran  64     625.65     743.54   18.84       4.88     5.80   18.85
W ran 128     624.91     745.74   19.34       4.88     5.82   19.26
W ran 256     625.16     743.71   18.96       4.88     5.81   19.06
W seq   1    1095.90    1098.30    0.22       8.56     8.58    0.23
W seq   2    2537.11    2614.16    3.04      19.82    20.42    3.03
W seq   4    4323.43    4614.75    6.74      33.77    36.05    6.75
W seq   8    7986.73    8518.64    6.66      62.39    66.55    6.67
W seq  16   10568.54   11455.13    8.39      82.56    89.49    8.39
W seq  32    8751.45    9772.27   11.66      68.37    78.46   14.76
W seq  64    7179.18    7958.96   10.86      56.08    62.17   10.86
W seq 128    7464.06    8353.19   11.91      58.31    65.25   11.90
W seq 256    7711.36    8649.59   12.17      60.24    67.57   12.17

It seems that the 64 KB offset is better all the time. Random I/O benefits more than sequential I/O. This sounds ok to me, since sequential I/O probably uses the adjacent blocks anyway, while for random I/O the extra block is mostly useless. Average difference is is not a very good measure, but anyway it climbs over 13 percent. The following image shows the differences by I/O type. At least to me it says that there shouldn't be any question whether to use the 64 KB offset.

Results for 4 KB and 64 KB NTFS allocation unit size

After testing the performance with different offset values I found out that the 64 KB partition offset was better. But could the results be even better by using also 64 KB as NTFS allocation unit size?

For allocation unit size tests I left the 64 KB partition offset since it was now found to be better than the default 31.5 KB. First test was with default NTFS allocation unit size, which is 4 KB. After that I run the tests with 64 KB allocation unit. The results are here, formatted the same way as previously.

 oper thr    4KB_ios   64KB_ios   diff%    4KB_mbs 64KB_mbs   diff%
R ran   1     589.11     586.63   -0.42       4.60     4.58   -0.43
R ran   2     842.85     841.39   -0.17       6.58     6.57   -0.15
R ran   4    1083.75    1081.14   -0.24       8.46     8.44   -0.24
R ran   8    1267.99    1265.49   -0.20       9.90     9.88   -0.20
R ran  16    1442.43    1437.93   -0.31      11.26    11.23   -0.27
R ran  32    1589.99    1584.81   -0.33      12.42    12.38   -0.32
R ran  64    1607.96    1597.76   -0.63      12.56    12.48   -0.64
R ran 128    1603.51    1604.37    0.05      12.52    12.53    0.08
R ran 256    1604.22    1601.35   -0.18      12.53    12.51   -0.16
R seq   1   18361.12   18370.13    0.05     143.44   143.51    0.05
R seq   2   21298.11   20750.68   -2.57     166.39   162.11   -2.57
R seq   4   19330.16   19034.68   -1.53     151.01   148.70   -1.53
R seq   8   17167.52   16891.43   -1.61     134.12   131.96   -1.61
R seq  16   13872.80   13625.19   -1.78     108.38   106.44   -1.79
R seq  32   13941.27   13718.14   -1.60     108.91   107.17   -1.60
R seq  64   12389.73   12322.88   -0.54      96.79    96.27   -0.54
R seq 128   12171.05   12101.91   -0.57      95.08    94.54   -0.57
R seq 256   12385.93   12281.82   -0.84      96.76    95.95   -0.84
W ran   1     348.02     347.09   -0.27       2.71     2.71    0.00
W ran   2     433.50     430.96   -0.59       3.38     3.36   -0.59
W ran   4     514.37     511.66   -0.53       4.01     3.99   -0.50
W ran   8     602.53     597.60   -0.82       4.70     4.66   -0.85
W ran  16     698.91     692.42   -0.93       5.46     5.40   -1.10
W ran  32     742.41     735.10   -0.98       5.80     5.74   -1.03
W ran  64     743.54     735.79   -1.04       5.80     5.74   -1.03
W ran 128     745.74     738.57   -0.96       5.82     5.77   -0.86
W ran 256     743.71     738.35   -0.72       5.81     5.76   -0.86
W seq   1    1098.30    1094.25   -0.37       8.58     8.54   -0.47
W seq   2    2614.16    2559.92   -2.07      20.42    19.99   -2.11
W seq   4    4614.75    4594.38   -0.44      36.05    35.89   -0.44
W seq   8    8518.64    8535.73    0.20      66.55    66.68    0.20
W seq  16   11455.13   11497.78    0.37      89.49    89.82    0.37
W seq  32    9772.27    9730.27   -0.43      78.46    76.01   -3.12
W seq  64    7958.96    8342.12    4.81      62.17    65.17    4.83
W seq 128    8353.19    8509.16    1.87      65.25    66.47    1.87
W seq 256    8649.59    8815.85    1.92      67.57    68.87    1.92

There is no point in doing any graps of these. The difference is practically nonexisting. It seems that 64 KB allocation unit size is actually slower throughout. Only at the end with sequential writes and large number of threads does the 64 KB allocation unit size do better than the default.

The average difference for I/O per second is -0.4 percent, to advantage of the default size. This was bit strange, since the 64 KB allocation unit size recommendation can be found from the same documents where the (now measured and accepted) 64 KB partition offset recommendation was found. That's why I run the tests again with 8 KB allocation unit size. No luck there, since the performance was again almost identical. The average difference was -0.27 percent, and again the default size was better.

I can't really explain the results. I would have thought that for sure bigger allocation unit would have been better. But since the difference is so small, for some other configuration the numbers might have been the other way round and 64 KB allocation unit would have been better.

Summary

At least for my server the 64 KB partition offset was clearly better than the default. The best thing is that the performance benefit can be gained just by creating the partition with diskpart.exe instead of the graphical utility. Once the partition is created, the graphical utility can be used to format the partition. Since it isn't more difficult than this, changing the partition offset should really be a standard procedure for all new Sql Server installations.

I should point out that the mentioned average performance increase of 13 percent is just for I/O. The real world application performance is whole different matter and the increase might be marginal on application level. At least when full I/O capacity is not used. Changing the partition offset for existing Sql Server installations is probably overkill, since it would require deleting and recreating the partition and restoring the databases from backups. If the existing installation needs a speedup, most likely partition alignment is not going to make a huge difference and the time is better spent on tuning other parts of the system.

I didn't find any performance increase from changing the NTFS allocation unit size from the default value. I wouldn't change this from the default unless the specific configuration was first tested and the test proved some kind of performance benefit.