When an I/O operation uses the hard disk, the read or write operation is related to tracks and sectors on physical disk. Above this physical layer lies the file system. This layered structure creates a unique problem. If the file system allocation units are not aligned with physical disk, then it is possible that while fetching one allocation unit from the file system, the hard disk actually has to read two physical allocation units. This might also mean wasted cache space. Some terminology and an example can be found from Christian Bolton's blog.
Many articles recommend using 64 KB value for partition alignment when deploying Sql Server. One such article is Predeployment I/O Best Practices from Microsoft. However, almost none of the articles covering the subject present any results or measurements about the performance implications.
One piece where the difference is actually measured is Performance Impact of Disk Misalignment by Linchi Shea. While installing a new server I had the opportunity to test this myself. While I don't have the required expertise with this kind of performance tuning, the results seem to indicate that there is actually a measurable performance benefit that can be achived by aligning the partition to 64 KB.
While running the tests, I decided to test also the implication of NTFS allocation unit. Same documents that recommend aligning the partition to 64 KB usually also recommend using 64 KB as allocation unit size. My results showed performance increase only in specific workloads.
I tried to copy Linchi Shea's method as closely as I could. So I used sqlio.exe for testing the performance. Diskpar.exe (from Windows 2000 Server resource kit) was used for examining the partition offset and diskpart.exe (which comes with Windows 2003 Server) was used for actually creating the partition.
I copied the initial sqlio.exe parameters (80 GB test file, 8KB block size, 4 I/O queue depth) from Linchi Shea's article. While he only focused on random reads and random writes, I tested also sequential reads and writes. I had a Python script that run the tests with different amount of threads. The test set was then repeated and I chose the better values of the two runs. The differences were marginal here. A single test was run for 300 seconds, which according to some articles might not be enough for all disk controllers and cache sizes. The percieved differences between different alignments seem to by consistent anyway.
The server I used was a dual cpu quadcore with 8 GB of ram. It had a raid controller with two disk sets configured, RAID1 for the operating system (Windows Server 2003 Standard x64) and RAID10 for the data. All experiments were done with the RAID10 disks. The raid was created with 64 KB stripe size.
The data drive was already partitioned when I got the server. The partition was created with Windows disk management graphical utility. While diskpart.exe (the one that comes with Windows 2003) can create the partition with specific alignment, it lacks the query feature from the earlier diskpar.exe. So I used diskpar.exe to find out the offset for data partition.
C:\temp>diskpar.exe -i 1 ---- Drive 1 Geometry Infomation ---- Cylinders = 72809 TracksPerCylinder = 255 SectorsPerTrack = 63 BytesPerSector = 512 DiskSize = 598874411520 (Bytes) = 571131 (MB) ---- Drive Partition 0 Infomation ---- StatringOffset = 32256 PartitionLength = 598874379264 HiddenSectors = 63 PartitionNumber = 1 PartitionType = 7 End of partition information. Total existing partitions: 1
The offset for data partition was 32256 which means 31.5 KB. I first run the tests with this configuration. After that I used diskpart.exe to create a correctly aligned partition. This was done by simply deleting the partition with Windows disk management application and using diskpart.exe as follows.
C:\temp>diskpart.exe DISKPART> list disk Disk ### Status Size Free Dyn Gpt -------- ---------- ------- ------- --- --- Disk 0 Online 68 GB 0 B * Disk 1 Online 558 GB 558 GB DISKPART> select disk 1 Disk 1 is now the selected disk. DISKPART> create partition primary align=64 DiskPart succeeded in creating the specified partition.
After this I formatted the partition with disk management application. At first I left the allocation unit size to default.
The following table presents the results measured with sqlio.exe. The first columns tell the operation (R for read, W for write), mode (random or sequential) and number of threads. The following columns show the number of i/os per second. Orig_ios is the value for 31.5 KB alignment and new_ios is the value for 64 KB alignment. Diff% tells the percentual difference to original value. The remaining three columns show the difference in megabytes per second. Since the difference is essentially the same, I have later concentrated only on number of I/Os.
oper thr orig_ios new_ios diff% orig_mbs new_mbs diff% R ran 1 400.45 589.11 47.11 3.12 4.60 47.44 R ran 2 696.59 842.85 21.00 5.44 6.58 20.96 R ran 4 929.99 1083.75 16.53 7.26 8.46 16.53 R ran 8 1071.10 1267.99 18.38 8.36 9.90 18.42 R ran 16 1202.49 1442.43 19.95 9.39 11.26 19.91 R ran 32 1322.13 1589.99 20.26 10.32 12.42 20.35 R ran 64 1331.94 1607.96 20.72 10.40 12.56 20.77 R ran 128 1336.41 1603.51 19.99 10.44 12.52 19.92 R ran 256 1336.83 1604.22 20.00 10.44 12.53 20.02 R seq 1 18275.00 18361.12 0.47 142.77 143.44 0.47 R seq 2 19616.58 21298.11 8.57 153.25 166.39 8.57 R seq 4 18046.04 19330.16 7.12 140.98 151.01 7.11 R seq 8 16355.16 17167.52 4.97 127.77 134.12 4.97 R seq 16 13587.78 13872.80 2.10 106.15 108.38 2.10 R seq 32 12628.85 13941.27 10.39 98.66 108.91 10.39 R seq 64 11639.13 12389.73 6.45 90.93 96.79 6.44 R seq 128 11535.52 12171.05 5.51 90.12 95.08 5.50 R seq 256 11633.98 12385.93 6.46 90.89 96.76 6.46 W ran 1 314.82 348.02 10.55 2.45 2.71 10.61 W ran 2 385.36 433.50 12.49 3.01 3.38 12.29 W ran 4 449.75 514.37 14.37 3.51 4.01 14.25 W ran 8 517.79 602.53 16.37 4.04 4.70 16.34 W ran 16 591.43 698.91 18.17 4.62 5.46 18.18 W ran 32 623.32 742.41 19.11 4.86 5.80 19.34 W ran 64 625.65 743.54 18.84 4.88 5.80 18.85 W ran 128 624.91 745.74 19.34 4.88 5.82 19.26 W ran 256 625.16 743.71 18.96 4.88 5.81 19.06 W seq 1 1095.90 1098.30 0.22 8.56 8.58 0.23 W seq 2 2537.11 2614.16 3.04 19.82 20.42 3.03 W seq 4 4323.43 4614.75 6.74 33.77 36.05 6.75 W seq 8 7986.73 8518.64 6.66 62.39 66.55 6.67 W seq 16 10568.54 11455.13 8.39 82.56 89.49 8.39 W seq 32 8751.45 9772.27 11.66 68.37 78.46 14.76 W seq 64 7179.18 7958.96 10.86 56.08 62.17 10.86 W seq 128 7464.06 8353.19 11.91 58.31 65.25 11.90 W seq 256 7711.36 8649.59 12.17 60.24 67.57 12.17
It seems that the 64 KB offset is better all the time. Random I/O benefits more than sequential I/O. This sounds ok to me, since sequential I/O probably uses the adjacent blocks anyway, while for random I/O the extra block is mostly useless. Average difference is is not a very good measure, but anyway it climbs over 13 percent. The following image shows the differences by I/O type. At least to me it says that there shouldn't be any question whether to use the 64 KB offset.
After testing the performance with different offset values I found out that the 64 KB partition offset was better. But could the results be even better by using also 64 KB as NTFS allocation unit size?
For allocation unit size tests I left the 64 KB partition offset since it was now found to be better than the default 31.5 KB. First test was with default NTFS allocation unit size, which is 4 KB. After that I run the tests with 64 KB allocation unit. The results are here, formatted the same way as previously.
oper thr 4KB_ios 64KB_ios diff% 4KB_mbs 64KB_mbs diff% R ran 1 589.11 586.63 -0.42 4.60 4.58 -0.43 R ran 2 842.85 841.39 -0.17 6.58 6.57 -0.15 R ran 4 1083.75 1081.14 -0.24 8.46 8.44 -0.24 R ran 8 1267.99 1265.49 -0.20 9.90 9.88 -0.20 R ran 16 1442.43 1437.93 -0.31 11.26 11.23 -0.27 R ran 32 1589.99 1584.81 -0.33 12.42 12.38 -0.32 R ran 64 1607.96 1597.76 -0.63 12.56 12.48 -0.64 R ran 128 1603.51 1604.37 0.05 12.52 12.53 0.08 R ran 256 1604.22 1601.35 -0.18 12.53 12.51 -0.16 R seq 1 18361.12 18370.13 0.05 143.44 143.51 0.05 R seq 2 21298.11 20750.68 -2.57 166.39 162.11 -2.57 R seq 4 19330.16 19034.68 -1.53 151.01 148.70 -1.53 R seq 8 17167.52 16891.43 -1.61 134.12 131.96 -1.61 R seq 16 13872.80 13625.19 -1.78 108.38 106.44 -1.79 R seq 32 13941.27 13718.14 -1.60 108.91 107.17 -1.60 R seq 64 12389.73 12322.88 -0.54 96.79 96.27 -0.54 R seq 128 12171.05 12101.91 -0.57 95.08 94.54 -0.57 R seq 256 12385.93 12281.82 -0.84 96.76 95.95 -0.84 W ran 1 348.02 347.09 -0.27 2.71 2.71 0.00 W ran 2 433.50 430.96 -0.59 3.38 3.36 -0.59 W ran 4 514.37 511.66 -0.53 4.01 3.99 -0.50 W ran 8 602.53 597.60 -0.82 4.70 4.66 -0.85 W ran 16 698.91 692.42 -0.93 5.46 5.40 -1.10 W ran 32 742.41 735.10 -0.98 5.80 5.74 -1.03 W ran 64 743.54 735.79 -1.04 5.80 5.74 -1.03 W ran 128 745.74 738.57 -0.96 5.82 5.77 -0.86 W ran 256 743.71 738.35 -0.72 5.81 5.76 -0.86 W seq 1 1098.30 1094.25 -0.37 8.58 8.54 -0.47 W seq 2 2614.16 2559.92 -2.07 20.42 19.99 -2.11 W seq 4 4614.75 4594.38 -0.44 36.05 35.89 -0.44 W seq 8 8518.64 8535.73 0.20 66.55 66.68 0.20 W seq 16 11455.13 11497.78 0.37 89.49 89.82 0.37 W seq 32 9772.27 9730.27 -0.43 78.46 76.01 -3.12 W seq 64 7958.96 8342.12 4.81 62.17 65.17 4.83 W seq 128 8353.19 8509.16 1.87 65.25 66.47 1.87 W seq 256 8649.59 8815.85 1.92 67.57 68.87 1.92
There is no point in doing any graps of these. The difference is practically nonexisting. It seems that 64 KB allocation unit size is actually slower throughout. Only at the end with sequential writes and large number of threads does the 64 KB allocation unit size do better than the default.
The average difference for I/O per second is -0.4 percent, to advantage of the default size. This was bit strange, since the 64 KB allocation unit size recommendation can be found from the same documents where the (now measured and accepted) 64 KB partition offset recommendation was found. That's why I run the tests again with 8 KB allocation unit size. No luck there, since the performance was again almost identical. The average difference was -0.27 percent, and again the default size was better.
I can't really explain the results. I would have thought that for sure bigger allocation unit would have been better. But since the difference is so small, for some other configuration the numbers might have been the other way round and 64 KB allocation unit would have been better.
At least for my server the 64 KB partition offset was clearly better than the default. The best thing is that the performance benefit can be gained just by creating the partition with diskpart.exe instead of the graphical utility. Once the partition is created, the graphical utility can be used to format the partition. Since it isn't more difficult than this, changing the partition offset should really be a standard procedure for all new Sql Server installations.
I should point out that the mentioned average performance increase of 13 percent is just for I/O. The real world application performance is whole different matter and the increase might be marginal on application level. At least when full I/O capacity is not used. Changing the partition offset for existing Sql Server installations is probably overkill, since it would require deleting and recreating the partition and restoring the databases from backups. If the existing installation needs a speedup, most likely partition alignment is not going to make a huge difference and the time is better spent on tuning other parts of the system.
I didn't find any performance increase from changing the NTFS allocation unit size from the default value. I wouldn't change this from the default unless the specific configuration was first tested and the test proved some kind of performance benefit.