ATA 4 KiB sector issues

From ata Wiki
Revision as of 03:20, 8 March 2010 by Tj (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Minimally reformatted from the original text file. Once reviews on mailinglists are complete, I'll update the formatting. Till then, please write to tj@kernel.org and cc linux-ide@vger.kernel.org for comments instead of editing this page directly. Thanks.

Contents

Background

Up until recently, all ATA hard drives have been organized in 512 byte sectors. For example, my 500 GB or 477 GiB hard drive is organized of 976773168 512 byte sectors numbered from 0 to 976773167. This is how a drive communicates with the driver. When the operating system wants to read 32 KiB of data at 1 MiB position, the driver asks the drive to read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable individually, the physical medium also is organized in the same sized sectors. In addition to the area to store the actual data, each sector requires extra space for book keeping - inter-sector space to enable locating and addressing each sector and ECC data to detect and correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger ECC becomes necessary to guarantee acceptable level of data integrity increasing the space overhead. In addition, in most applications, hard drives are now accessed in units of at least 8 sectors or 4096 bytes and maintaining 512 byte granularity has become somewhat meaningless.

This reached a point where enlarging the sector size to 4096 bytes would yield measurably more usable space given the same raw data storage size and hard drive manufacturers are transitioning to 4 KiB sectors.

Anandtech has a good article which illustrates the background and issues with pretty diagrams[1].

Physical vs. Logical

Because the 512 byte sector size has been around for a very long time and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the sector size assumption is scattered across all the layers - controllers or bridge chips snooping commands, BIOSs, boot codes, drivers, partitioners and system utilities, which makes it very difficult to change the sector size from 512 byte without breaking backward compatibility massively.

As a workaround, the concept of logical sector size was introduced. The physical medium is organized in 4 KiB sectors but the firmware on the drive will present it as if the drive is composed of 512 byte sectors thus making the drive behave as before, so if the driver asks the hard drive to read 64 sectors from LBA 2048, the firmware will translate it and read 8 4 KiB sectors from hardware sector 256. As a result, the hard drive now has two sector sizes - the physical one which the physical media is actually organized in, and the logical one which the firmware presents to the outside world.

A straight forward example mapping between physical sector and LBA would be

 LBA = 8 * phys_sect

Alignment problem on 4 KiB physical / 512 logical drives

This workaround keeps older hardware and software working while allowing the drive to use larger sector size internally. However, the discrepancy between physical and logical sector sizes creates an alignment issue. For example, if the driver wants to read 7 sectors from LBA 2047, the firmware has to read hardware sector 255 and 256 and trim leading 7*512 bytes and tailing 512 bytes.

For reads, this isn't an issue as drives read in larger chunks anyway but for writes, the drive has to do read-modify-write to achieve the requested action. It has to first read hardware sector 255 and 256, update requested parts and then write back those sectors which can cause significant performance degradation[2].

The problem is aggravated by the way DOS partitions[3] have been laid out traditionally. For reasons dating back more than two decades, they are laid out considering something called disk geometry which nowadays are arbitrary values with a number of restrictions for backward compatibility accumulated over the years. The end result is that until recently (most Linux variants and upto Windows XP) the first partition ends up on sector 63 and later ones on cylinder boundaries where each cylinder usually is composed of 255 * 63 sectors.

Most modern filesystems generate 4 KiB aligned accesses from the partition it is in. If a drive maps 4 KiB physical sectors to 512 byte logical sectors from LBA0, the filesystem in the first partition will always be misaligned and filesystems in later partitions are likely to be misaligned too.

Solving the alignment problem on 4 KiB physical / 512 logical drives

There are multiple ways which attempt to solve the problem.

S-1. Yet another workaround from the firmware - offset-by-one.

Yet another workaround which can be done by the firmware is to offset physical to logical mapping by one logical sector such that LBA 63 ends up on physical sector boundary, which aligns the first partition to physical sectors without requiring any software update. The example mapping between phys_sector and LBA becomes

 LBA = 8 * phys_sect - 1

The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to 63, making LBA 63 aligned on hardware sector.

Although this aligns only the first partition, for many use cases, especially the ones involving older software, this workaround was deemed useful and some recent drives with 4 KiB physical sectors are equipped with a dip switch to turn on or off offset-by-one mapping.

S-2. The proper solution.

Correct alignments for all partitions can't be achieved by the firmware alone. The system utilities should be informed about the alignment requirements and align partitions accordingly.

The above firmware workaround complicates the situation because the two different configurations require different offsets to achieve the correct alignments. ATA/ATAPI-8 specifies a way for a drive to export the physical and logical sector sizes and the LBA offset which is aligned to the physical sectors.

In Linux, these parameters are exported via the following sysfs nodes.

 physical sector size	: /sys/block/sdX/queue/physical_block_size
 logical sector size	: /sys/block/sdX/queue/logical_block_size
 alignment offset	: /sys/block/sdX/alignment_offset

Let the physical sector size be PSS, logical sector size LSS and alignment offset AOFF. The system software should place partitions such that the starting LBAs of all partitions are aligned on

 (n * PSS + AOFF) / LSS

For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512 and AOFF 3584 and with n of 7 the above becomes,

 (7 * 4096 + 3584) / 512 == 63

making sector 63 an aligned LBA where the first partition can be put, but without the offset-by-one mapping, AOFF is zero and LBA 63 is not aligned.

With the above new alignment requirement in place, it becomes difficult to honor the legacy one - first partition on sector 63 and all other partitions on cylinder boundary (255 * 63 sectors) - as the two alignment requirements contradict each other. This might be worked around by adjusting how LBA and CHS addresses are mapped but the disk geometry parameters are hard coded everywhere and there is no reliable way to communicate custom geometry parameters.

Complications

Unfortunately, there are complications.

C-1. The standard is not and won't be followed as-is.

Some of the existing BIOSs and/or drivers can't cope with drives which report 4 KiB physical sector size. To work around this, some drive models lie that its physical sector size is 512 bytes when the actual configuration is 4 KiB without offsetting.

This nullifies the provisions for alignment in the ATA standard but results in the correct alignment for Windows Vista and 7. OS behaviors will be described further later.

For these drives, which are likely to continue to be shipped for the foreseeable future, traditional LBA 63 and cylinder based aligning results in misalignment.

C-2. Windows XP depends on the traditional partition layout.

Windows XP makes use of the CHS start/end addresses in the partition table and gets confused if partitions are not laid out traditionally. This means that XP can't be installed into a partition prepared by later versions of Windows[4]. This isn't a big problem for Windows because in most cases the later version is replacing the older one, not the other way around.

Unfortunately, the situation is more complex for Linux because Linux is often co-installed with various versions of Windows and XP is still quite popular. This means that when a Linux partitioner is used to prepare a partition which may be used by Windows, the partitioner might have to consider which version of Windows is going to be used and whether to align the partitions for the correct alignment or compatibility with older versions of Windows.

C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.

The DOS partition format uses 32 bit for the starting LBA and the number of sectors and, reportedly, 32 bit Windows XP shares the limitation. With 32 bit addressing and 512 byte logical sector size, the maximum addressable sector + 1 is at

 2^32 * 2^9 == 2^41 == 2 TiB

The DOS partition format allows a partition to reach beyond 2 TiB as long as the starting LBA is under 2 TiB; however, both Windows XP and and the Linux kernel (at least upto v2.6.33) refuse such partition configurations.

With the right combination of host controller, BIOS and driver, this barrier can be overcome by enlarging the logical sector size to 4 KiB, which will push the barrier out to 16 TiB. On the right configuration, Windows XP is reportedly able to address beyond the 2 TiB barrier with a DOS partition and 4 KiB logical sector size. Linux kernel upto v2.6.33 doesn't work under such configurations but a patch to make it work is pending[5].

This might also be beneficial for operating systems which don't suffer from this limitation. A different partition format - GPT[6] - should be used beyond 2^32 sectors, which could harm compatibility with older BIOSs or other operating systems which don't recognize the new format.

As mentioned previously, 512 byte sector assumption has been there for a very long time and changing it is likely to cause various compatibility problems at many different layers from hardware up to the system utilities.

Windows

As hard drive vendors aim for performance and compatibility in modern Windows environments, it is worthwhile to investigate how Windows partitions with different alignment requirements. Up until Windows XP, it followed the traditional layout - the first partition on LBA 63 and the others on cylinder boundaries where a cylinder is defined as 255 tracks with 63 sectors each.

Windows Vista and 7 align partitions differently. As the two behave similarly, only 7's behavior is shown here. These partition tables are created by Windows 7 RC installer on blank disks.

W-1. 512 byte physical and logical sector drive.

 ST FIRST  T  LAST   LBA      NBLKS
 80 202100 07 df130c 00080000 00200300
 00 df140c 07 feffff 00280300 00689e12
 00 000000 00 000000 00000000 00000000
 00 000000 00 000000 00000000 00000000
 
 Part0:        FIRST	C    0	H   32	S   33	: 2048	        (63 sec/trk)
 		LAST	C   12	H  223	S   19	: 206847        (255 heads/cyl)
 		LBA	2048 + 204800 = 206848
 
 Part1:        FIRST	C   12	H  223	S   20	: 206848
 		LAST	C 1023	H  254	S   63	: E
 		LBA	206848 + 312371200 = 312578048
 
 Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.

 ST FIRST  T  LAST   LBA      NBLKS
 80 202100 07 df130c 00080000 00200300
 00 df140c 07 feffff 00280300 00b83f25
 00 000000 00 000000 00000000 00000000
 00 000000 00 000000 00000000 00000000
 
 Part0:        FIRST	C    0	H   32	S   33	: 2048	        (63 sec/trk)
               LAST	C   12	H  223	S   19	: 206847        (255 heads/cyl)
               LBA	2048 + 204800 = 206848
 
 Part1:        FIRST	C   12	H  223	S   20	: 206848
               LAST	C 1023	H  254	S   63	: E
               LBA	206848 + 624932864 = 625139712
 
 Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.

 ST FIRST  T  LAST   LBA      NBLKS
 80 202800 07 df130c 07080000 f91f0300
 00 df1b0c 07 feffff 07280300 f9376d74
 00 000000 00 000000 00000000 00000000
 00 000000 00 000000 00000000 00000000
 
 Part0:        FIRST	C    0	H   32	S   40	: 2055          (63 sec/trk)
               LAST	C   12	H  223	S   19	: 206847        (255 heads/cyl)
               LBA	2055 + 204793 = 206848
 
 Part1:        FIRST	C   12	H  223	S   27	: 206855
               LAST	C 1023	H  254	S   63	: E
               LBA	206855 + 1953314809 = 1953521664
 
 Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.

The partitioner seems to be using 1M as the basic alignment unit and offsetting from there if explicitly requested by the drive and there is no difference between handling of 512 byte and 4 KiB drives, which explains why C-1 works for hard drive vendors.

In all cases, the partitioner ignores both the first partition on LBA 63 and the others on cylinder boundary requirements while still using the same 255*63 cylinder size. Also, note that in W-3, both part 0 and 1 end up with odd number of sectors. It seems that they simply decided to completely break away from the traditional layout, which is understandable given that there really isn't one good solution which can cover all the cases and that the default larger alignment benefits earlier SSDs.

Windows Vista basically shows the same behavior. Vista was tested by creating two partitions using the management tool. Test data is available at [7].

 *-alignment_offset    : alignment_offset reported by Linux kernel
 *-fdisk               : fdisk -l output
 *-fdisk-u             : fdisk -lu output
 *-hdparm              : hdparm -I output
 *-mbr                 : dump of mbr
 *-part                : decoded partition table from mbr

Please note that hdparm is misreporting the alignment offset. It should be reporting 512 instead of 256 for offset-by-one drives.

So, what now for Linux?

The situation is not easy. Considering all the factors, the only workable solution looks like doing what Windows is doing. Hard drive and SSD vendors are focusing on compatibility and performance on recent Windows releases and are happy to do things which break the standard defined mechanism as shown by C-1, so parting away from what Windows does would be unnecessarily painful.

Unfortunately, while Windows can assume that newer releases won't share the hard drive with older releases including Windows XP, Linux distros can't do that. There will be many installations where a modern Linux distros share a hard drive with older releases of Windows. At this point, I can't see a silver bullet solution.

Partitioners maybe should only align partitions which will be used by Linux and default to the traditional layout for others while allowing explicit override. I think Windows XP wouldn't have problem with differently aligned partitions as long as it doesn't actually use them but haven't tested it.

Reportedly, commonly used partitioners aren't ready to handle drives larger than 2 TiB in any configuration and alignment isn't done properly for drives with 4 KiB physical sectors. 4 KiB logical sector support is broken in both the kernel and partitioners. (need more details and probably a whole section on partitioner behaviors)

Unfortunately, the transition to 4 KiB sector size, physical only or logical too, is looking fairly ugly. Hopefully, a reasonable solution can be reached in not too distant future but even with all the software side updated, it looks like it's gonna cause significant amount of confusion and frustration.

[1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
[2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
[3] http://en.wikipedia.org/wiki/Master_boot_record
[4] http://support.microsoft.com/kb/931760
[5] http://thread.gmane.org/gmane.linux.kernel/953981
[6] http://en.wikipedia.org/wiki/GUID_Partition_Table
[7] http://userweb.kernel.org/~tj/partalign/
  • Mar 04 2009

Initial draft, Tejun Heo <tj@kernel.org>

  • Mar 08 2009

Updated according to comments from Daniel Taylor <Daniel.Taylor@wdc.com>. Other minor updates.

Personal tools