ATA 4 KiB sector issues

From ata Wiki
(Difference between revisions)
Jump to: navigation, search
(Created page with ''''Minimally reformatted from the original text file. Once reviews on mailinglists are complete, I'll update the formatting. Till then, please write to tj@kernel.org and cc lin…')
 
(restore edit blurb)
(One intermediate revision by one user not shown)
Line 1: Line 1:
 
'''Minimally reformatted from the original text file.  Once reviews on mailinglists are complete, I'll update the formatting.  Till then, please write to tj@kernel.org and cc linux-ide@vger.kernel.org for comments instead of editing this page directly.  Thanks.'''
 
'''Minimally reformatted from the original text file.  Once reviews on mailinglists are complete, I'll update the formatting.  Till then, please write to tj@kernel.org and cc linux-ide@vger.kernel.org for comments instead of editing this page directly.  Thanks.'''
 +
 +
ATA 4 KiB sector issues
 +
 
=Background=
 
=Background=
Up until recently, all ATA hard drives have been organized in 512 byte
+
 
sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
+
Up until recently, all ATA hard drives have been organized in 512byte
 +
sectors.  For example, my 500GB or 477GiB hard drive is organized of
 
976773168 512 byte sectors numbered from 0 to 976773167.  This is how
 
976773168 512 byte sectors numbered from 0 to 976773167.  This is how
 
a drive communicates with the driver.  When the operating system wants
 
a drive communicates with the driver.  When the operating system wants
to read 32 KiB of data at 1 MiB position, the driver asks the drive to
+
to read 32 KiB of data at 1MiB position, the driver asks the drive to
 
read 64 sectors from LBA (Logical block address, sector number) 2048.
 
read 64 sectors from LBA (Logical block address, sector number) 2048.
  
Line 24: Line 28:
 
This reached a point where enlarging the sector size to 4096 bytes
 
This reached a point where enlarging the sector size to 4096 bytes
 
would yield measurably more usable space given the same raw data
 
would yield measurably more usable space given the same raw data
storage size and hard drive manufacturers are transitioning to 4 KiB
+
storage size and hard drive manufacturers are transitioning to 4KiB
 
sectors.
 
sectors.
  
 
Anandtech has a good article which illustrates the background and
 
Anandtech has a good article which illustrates the background and
 
issues with pretty diagrams[1].
 
issues with pretty diagrams[1].
 +
  
 
=Physical vs. Logical=
 
=Physical vs. Logical=
 +
 
Because the 512 byte sector size has been around for a very long time
 
Because the 512 byte sector size has been around for a very long time
 
and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
 
and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
Line 40: Line 46:
  
 
As a workaround, the concept of logical sector size was introduced.
 
As a workaround, the concept of logical sector size was introduced.
The physical medium is organized in 4 KiB sectors but the firmware on
+
The physical medium is organized in 4KiB sectors but the firmware on
 
the drive will present it as if the drive is composed of 512 byte
 
the drive will present it as if the drive is composed of 512 byte
 
sectors thus making the drive behave as before, so if the driver asks
 
sectors thus making the drive behave as before, so if the driver asks
 
the hard drive to read 64 sectors from LBA 2048, the firmware will
 
the hard drive to read 64 sectors from LBA 2048, the firmware will
translate it and read 8 4 KiB sectors from hardware sector 256.  As a
+
translate it and read 8 4KiB sectors from hardware sector 256.  As a
 
result, the hard drive now has two sector sizes - the physical one
 
result, the hard drive now has two sector sizes - the physical one
 
which the physical media is actually organized in, and the logical one
 
which the physical media is actually organized in, and the logical one
Line 54: Line 60:
 
   LBA = 8 * phys_sect
 
   LBA = 8 * phys_sect
  
=Alignment problem on 4 KiB physical / 512 logical drives=
+
 
 +
=Alignment problem on 4KiB physical / 512 logical drives=
 +
 
 
This workaround keeps older hardware and software working while
 
This workaround keeps older hardware and software working while
 
allowing the drive to use larger sector size internally.  However, the
 
allowing the drive to use larger sector size internally.  However, the
Line 78: Line 86:
 
sectors.
 
sectors.
  
Most modern filesystems generate 4 KiB aligned accesses from the
+
Most modern filesystems generate 4KiB aligned accesses from the
partition it is in.  If a drive maps 4 KiB physical sectors to 512
+
partition it is in.  If a drive maps 4KiB physical sectors to 512 byte
byte logical sectors from LBA0, the filesystem in the first partition
+
logical sectors from LBA0, the filesystem in the first partition will
will always be misaligned and filesystems in later partitions are
+
always be misaligned and filesystems in later partitions are likely to
likely to be misaligned too.
+
be misaligned too.
 +
 
 +
 
 +
=Solving the alignment problem on 4KiB physical / 512 logical drives=
  
=Solving the alignment problem on 4 KiB physical / 512 logical drives=
 
 
There are multiple ways which attempt to solve the problem.
 
There are multiple ways which attempt to solve the problem.
  
 
==S-1. Yet another workaround from the firmware - offset-by-one.==
 
==S-1. Yet another workaround from the firmware - offset-by-one.==
  
Yet another workaround which can be done by the firmware is to
+
Yet another workaround which can be done by the firmware is to offset
offset physical to logical mapping by one logical sector such that
+
physical to logical mapping by one logical sector such that LBA 63
LBA 63 ends up on physical sector boundary, which aligns the first
+
ends up on physical sector boundary, which aligns the first partition
partition to physical sectors without requiring any software update.
+
to physical sectors without requiring any software update. The
The example mapping between phys_sector and LBA becomes
+
example mapping between phys_sector and LBA becomes
  
  LBA = 8 * phys_sect - 1
+
    LBA = 8 * phys_sect - 1
  
 
The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
 
The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
Line 103: Line 113:
 
Although this aligns only the first partition, for many use cases,
 
Although this aligns only the first partition, for many use cases,
 
especially the ones involving older software, this workaround was
 
especially the ones involving older software, this workaround was
deemed useful and some recent drives with 4 KiB physical sectors are
+
deemed useful and some recent drives with 4KiB physical sectors are
 
equipped with a dip switch to turn on or off offset-by-one mapping.
 
equipped with a dip switch to turn on or off offset-by-one mapping.
  
Line 113: Line 123:
  
 
The above firmware workaround complicates the situation because the
 
The above firmware workaround complicates the situation because the
two different configurations require different offsets to achieve
+
two different configurations require different offsets to achieve the
the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
+
correct alignments.  ATA/ATAPI-8 specifies a way for a drive to export
export the physical and logical sector sizes and the LBA offset
+
the physical and logical sector sizes and the LBA offset which is
which is aligned to the physical sectors.
+
aligned to the physical sectors.
  
In Linux, these parameters are exported via the following sysfs
+
In Linux, these parameters are exported via the following sysfs nodes.
nodes.
+
  
  physical sector size : /sys/block/sdX/queue/physical_block_size
+
    physical sector size : /sys/block/sdX/queue/physical_block_size
  logical sector size : /sys/block/sdX/queue/logical_block_size
+
    logical sector size : /sys/block/sdX/queue/logical_block_size
  alignment offset : /sys/block/sdX/alignment_offset
+
    alignment offset : /sys/block/sdX/alignment_offset
  
 
Let the physical sector size be PSS, logical sector size LSS and
 
Let the physical sector size be PSS, logical sector size LSS and
Line 129: Line 138:
 
such that the starting LBAs of all partitions are aligned on
 
such that the starting LBAs of all partitions are aligned on
  
  (n * PSS + AOFF) / LSS
+
    (n * PSS + AOFF) / LSS
  
For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
+
For 4KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
 
and AOFF 3584 and with n of 7 the above becomes,
 
and AOFF 3584 and with n of 7 the above becomes,
  
  (7 * 4096 + 3584) / 512 == 63
+
    (7 * 4096 + 3584) / 512 == 63
  
making sector 63 an aligned LBA where the first partition can be
+
making sector 63 an aligned LBA where the first partition can be put,
put, but without the offset-by-one mapping, AOFF is zero and LBA 63
+
but without the offset-by-one mapping, AOFF is zero and LBA 63 is not
is not aligned.
+
aligned.
  
 
With the above new alignment requirement in place, it becomes
 
With the above new alignment requirement in place, it becomes
 
difficult to honor the legacy one - first partition on sector 63 and
 
difficult to honor the legacy one - first partition on sector 63 and
all other partitions on cylinder boundary (255 * 63 sectors) - as
+
all other partitions on cylinder boundary (255 * 63 sectors) - as the
the two alignment requirements contradict each other.  This might be
+
two alignment requirements contradict each other.  This might be
 
worked around by adjusting how LBA and CHS addresses are mapped but
 
worked around by adjusting how LBA and CHS addresses are mapped but
the disk geometry parameters are hard coded everywhere and there is
+
the disk geometry parameters are hard coded in some places and there
no reliable way to communicate custom geometry parameters.
+
is no reliable way to communicate custom geometry parameters.
 +
 
  
 
=Complications=
 
=Complications=
 +
 
Unfortunately, there are complications.
 
Unfortunately, there are complications.
 +
 
==C-1. The standard is not and won't be followed as-is.==
 
==C-1. The standard is not and won't be followed as-is.==
  
Some of the existing BIOSs and/or drivers can't cope with drives
+
Some of the existing BIOSs and/or drivers can't cope with drives which
which report 4 KiB physical sector size.  To work around this, some
+
report 4KiB physical sector size.  To work around this, some drive
drive models lie that its physical sector size is 512 bytes when the
+
models lie that its physical sector size is 512 bytes when the actual
actual configuration is 4 KiB without offsetting.
+
configuration is 4KiB without offsetting.
  
 
This nullifies the provisions for alignment in the ATA standard but
 
This nullifies the provisions for alignment in the ATA standard but
Line 165: Line 177:
 
results in misalignment.
 
results in misalignment.
  
==C-2. Windows XP depends on the traditional partition layout.==
+
==C-2. The 2TiB barrier and the possibility for 4KiB logical sector size.==
 
+
Windows XP makes use of the CHS start/end addresses in the partition
+
table and gets confused if partitions are not laid out
+
traditionally.  This means that XP can't be installed into a
+
partition prepared by later versions of Windows[4].  This isn't a
+
big problem for Windows because in most cases the later version is
+
replacing the older one, not the other way around.
+
 
+
Unfortunately, the situation is more complex for Linux because Linux
+
is often co-installed with various versions of Windows and XP is
+
still quite popular.  This means that when a Linux partitioner is
+
used to prepare a partition which may be used by Windows, the
+
partitioner might have to consider which version of Windows is going
+
to be used and whether to align the partitions for the correct
+
alignment or compatibility with older versions of Windows.
+
 
+
==C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.==
+
  
 
The DOS partition format uses 32 bit for the starting LBA and the
 
The DOS partition format uses 32 bit for the starting LBA and the
 
number of sectors and, reportedly, 32 bit Windows XP shares the
 
number of sectors and, reportedly, 32 bit Windows XP shares the
limitation.  With 32 bit addressing and 512 byte logical sector
+
limitation.  With 32 bit addressing and 512 byte logical sector size,
size, the maximum addressable sector + 1 is at
+
the maximum addressable sector + 1 is at
  
  2^32 * 2^9 == 2^41 == 2 TiB
+
    2^32 * 2^9 == 2^41 == 2TiB
  
The DOS partition format allows a partition to reach beyond 2 TiB as
+
The DOS partition format allows a partition to reach beyond 2TiB as
long as the starting LBA is under 2 TiB; however, both Windows XP
+
long as the starting LBA is under 2TiB; however, both Windows XP and
and and the Linux kernel (at least upto v2.6.33) refuse such
+
and the Linux kernel (at least upto v2.6.33) refuse such partition
partition configurations.
+
configurations.
  
 
With the right combination of host controller, BIOS and driver, this
 
With the right combination of host controller, BIOS and driver, this
barrier can be overcome by enlarging the logical sector size to 4
+
barrier can be overcome by enlarging the logical sector size to 4KiB,
KiB, which will push the barrier out to 16 TiB.  On the right
+
which will push the barrier out to 16TiB.  On the right configuration,
configuration, Windows XP is reportedly able to address beyond the 2
+
Windows XP is reportedly able to address beyond the 2TiB barrier with
TiB barrier with a DOS partition and 4 KiB logical sector size.
+
a DOS partition and 4KiB logical sector size. Linux kernel upto
Linux kernel upto v2.6.33 doesn't work under such configurations but
+
v2.6.33 doesn't work under such configurations but a patch to make it
a patch to make it work is pending[5].
+
work is pending[4].
  
This might also be beneficial for operating systems which don't
+
This might also be somewhat beneficial for operating systems which
suffer from this limitation.  A different partition format - GPT[6]
+
don't suffer from this limitation.  A different partition format -
- should be used beyond 2^32 sectors, which could harm compatibility
+
GPT[5] - should be used beyond 2^32 sectors, which could harm
with older BIOSs or other operating systems which don't recognize
+
compatibility with other operating systems which don't recognize the
the new format.
+
new format.
 +
 
 +
As mentioned previously, 512 byte sector assumption has existed for a
 +
very long time and changing it is might cause various compatibility
 +
problems at different layers.  It has been suggested that 4KiB logical
 +
sector size might be primarily useful for external (USB or otherwise)
 +
drives.
  
As mentioned previously, 512 byte sector assumption has been there
 
for a very long time and changing it is likely to cause various
 
compatibility problems at many different layers from hardware up to
 
the system utilities.
 
  
 
=Windows=
 
=Windows=
 +
 
As hard drive vendors aim for performance and compatibility in modern
 
As hard drive vendors aim for performance and compatibility in modern
 
Windows environments, it is worthwhile to investigate how Windows
 
Windows environments, it is worthwhile to investigate how Windows
partitions with different alignment requirements. Up until Windows
+
behaves and partitions with different alignment requirements.
XP, it followed the traditional layout - the first partition on LBA 63
+
 
and the others on cylinder boundaries where a cylinder is defined as
+
Although there seem to be some issues with certain BIOS settings[6],
255 tracks with 63 sectors each.
+
any releases after and including Windows XP do not depend on
 +
traditional partition alignment and can boot from partitions with any
 +
alignment.  The reported problem seems to be caused by BIOS trying to
 +
guess geometry by reading from the partition table instead of using
 +
the de-facto geometry of 255 * 63 and can be worked around by either
 +
changing BIOS configuration or applying a hotfix.
 +
 
 +
It is reported that Windows 2000 depends on the traditional partition
 +
layout and will not work properly on partitions aligned differently.
 +
When partitioning for Windows 2000, it will be necessary to follow
 +
traditional partition layout; however, given the largely diminished
 +
Windows 2000 user-base, this won't be a big problem.  Having a way to
 +
manually choose traditional alignment should be enough.
  
Windows Vista and 7 align partitions differently.  As the two behave
+
When asked to partition hard drives, up until Windows XP, Windows
similarly, only 7's behavior is shown here.  These partition tables
+
followed the traditional layout - the first partition on LBA 63 and
are created by Windows 7 RC installer on blank disks.
+
the others on cylinder boundaries where a cylinder is defined as 255
 +
tracks with 63 sectors each.  Windows Vista and 7 align partitions
 +
differently.  As the two behave similarly, only 7's behavior is shown
 +
here.  These partition tables are created by Windows 7 RC installer on
 +
blank disks.
  
 
==W-1. 512 byte physical and logical sector drive.==
 
==W-1. 512 byte physical and logical sector drive.==
Line 235: Line 249:
 
   00 000000 00 000000 00000000 00000000
 
   00 000000 00 000000 00000000 00000000
 
    
 
    
   Part0:        FIRST C    0 H  32 S  33 : 2048         (63 sec/trk)
+
   Part0:        FIRST   C    0 H  32 S  33 : 2048         (63 sec/trk)
  LAST C  12 H  223 S  19 : 206847        (255 heads/cyl)
+
                LAST   C  12 H  223 S  19 : 206847        (255 heads/cyl)
  LBA 2048 + 204800 = 206848
+
                LBA     2048 + 204800 = 206848
 
    
 
    
   Part1:        FIRST C  12 H  223 S  20 : 206848
+
   Part1:        FIRST   C  12 H  223 S  20 : 206848
  LAST C 1023 H  254 S  63 : E
+
                LAST   C 1023 H  254 S  63 : E
  LBA 206848 + 312371200 = 312578048
+
                LBA     206848 + 312371200 = 312578048
 
    
 
    
 
   Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
 
   Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
  
==W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one.==
+
==W-2. 4KiB physical and 512 byte logical sector drive without offset-by-one.==
  
 
   ST FIRST  T  LAST  LBA      NBLKS
 
   ST FIRST  T  LAST  LBA      NBLKS
Line 253: Line 267:
 
   00 000000 00 000000 00000000 00000000
 
   00 000000 00 000000 00000000 00000000
 
    
 
    
   Part0:        FIRST C    0 H  32 S  33 : 2048         (63 sec/trk)
+
   Part0:        FIRST   C    0 H  32 S  33 : 2048         (63 sec/trk)
                 LAST C  12 H  223 S  19 : 206847        (255 heads/cyl)
+
                 LAST   C  12 H  223 S  19 : 206847        (255 heads/cyl)
                 LBA 2048 + 204800 = 206848
+
                 LBA     2048 + 204800 = 206848
 
    
 
    
   Part1:        FIRST C  12 H  223 S  20 : 206848
+
   Part1:        FIRST   C  12 H  223 S  20 : 206848
                 LAST C 1023 H  254 S  63 : E
+
                 LAST   C 1023 H  254 S  63 : E
                 LBA 206848 + 624932864 = 625139712
+
                 LBA     206848 + 624932864 = 625139712
 
    
 
    
 
   Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
 
   Both aligned at (2048 * n).  Part 1 not aligned to cylinder.
  
==W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.==
+
==W-3. 4KiB physical and 512 byte logical sector drive with offset-by-one.==
  
 
   ST FIRST  T  LAST  LBA      NBLKS
 
   ST FIRST  T  LAST  LBA      NBLKS
Line 271: Line 285:
 
   00 000000 00 000000 00000000 00000000
 
   00 000000 00 000000 00000000 00000000
 
    
 
    
   Part0:        FIRST C    0 H  32 S  40 : 2055          (63 sec/trk)
+
   Part0:        FIRST   C    0 H  32 S  40 : 2055          (63 sec/trk)
                 LAST C  12 H  223 S  19 : 206847        (255 heads/cyl)
+
                 LAST   C  12 H  223 S  19 : 206847        (255 heads/cyl)
                 LBA 2055 + 204793 = 206848
+
                 LBA     2055 + 204793 = 206848
 
    
 
    
   Part1:        FIRST C  12 H  223 S  27 : 206855
+
   Part1:        FIRST   C  12 H  223 S  27 : 206855
                 LAST C 1023 H  254 S  63 : E
+
                 LAST   C 1023 H  254 S  63 : E
                 LBA 206855 + 1953314809 = 1953521664
+
                 LBA     206855 + 1953314809 = 1953521664
 
    
 
    
 
   Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.
 
   Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.
Line 283: Line 297:
 
The partitioner seems to be using 1M as the basic alignment unit and
 
The partitioner seems to be using 1M as the basic alignment unit and
 
offsetting from there if explicitly requested by the drive and there
 
offsetting from there if explicitly requested by the drive and there
is no difference between handling of 512 byte and 4 KiB drives, which
+
is no difference between handling of 512 byte and 4KiB drives, which
 
explains why C-1 works for hard drive vendors.
 
explains why C-1 works for hard drive vendors.
  
 
In all cases, the partitioner ignores both the first partition on LBA
 
In all cases, the partitioner ignores both the first partition on LBA
 
63 and the others on cylinder boundary requirements while still using
 
63 and the others on cylinder boundary requirements while still using
the same 255*63 cylinder size.  Also, note that in W-3, both part 0
+
the same 255 * 63 cylinder size.  Also, note that in W-3, both part 0
 
and 1 end up with odd number of sectors.  It seems that they simply
 
and 1 end up with odd number of sectors.  It seems that they simply
 
decided to completely break away from the traditional layout, which is
 
decided to completely break away from the traditional layout, which is
Line 299: Line 313:
 
available at [7].
 
available at [7].
  
   *-alignment_offset   : alignment_offset reported by Linux kernel
+
   *-alignment_offset : alignment_offset reported by Linux kernel
   *-fdisk               : fdisk -l output
+
   *-fdisk : fdisk -l output
   *-fdisk-u             : fdisk -lu output
+
   *-fdisk-u : fdisk -lu output
   *-hdparm             : hdparm -I output
+
   *-hdparm : hdparm -I output
   *-mbr                 : dump of mbr
+
   *-mbr : dump of mbr
   *-part               : decoded partition table from mbr
+
   *-part : decoded partition table from mbr
  
 
Please note that hdparm is misreporting the alignment offset.  It
 
Please note that hdparm is misreporting the alignment offset.  It
should be reporting 512 instead of 256 for offset-by-one drives.
+
should be reporting 512 instead of 256 for offset-by-one drives.  This
 +
problem is fixed by version 9.28.
  
=So, what now for Linux?=
 
  
The situation is not easy.  Considering all the factors, the only
+
=Where Linux stands=
workable solution looks like doing what Windows is doing.  Hard drive
+
and SSD vendors are focusing on compatibility and performance on
+
recent Windows releases and are happy to do things which break the
+
standard defined mechanism as shown by C-1, so parting away from what
+
Windows does would be unnecessarily painful.
+
  
Unfortunately, while Windows can assume that newer releases won't
+
Considering all the factors, the best workable solution seems to be
share the hard drive with older releases including Windows XP, Linux
+
doing what Windows is doingHard drive and SSD vendors are focusing
distros can't do thatThere will be many installations where a
+
on compatibility and performance on recent Windows releases and are
modern Linux distros share a hard drive with older releases of
+
happy to do things which break the standard defined mechanism as shown
Windows.  At this point, I can't see a silver bullet solution.
+
by C-1, so parting away from what Windows does would be unnecessarily
 +
painfulOther than giving an option to use traditional layout for
 +
Windows releases <= 2000, always using larger alignment will achieve
 +
properly aligned partitions and acceptable compatibility.
  
Partitioners maybe should only align partitions which will be used by
+
Most of information in this section comes from the discussion thread
Linux and default to the traditional layout for others while allowing
+
reviewing an early draft of this document[8] and the following two
explicit override.  I think Windows XP wouldn't have problem with
+
documents.
differently aligned partitions as long as it doesn't actually use them
+
but haven't tested it.
+
  
Reportedly, commonly used partitioners aren't ready to handle drives
+
* I/O Limits: block sizes, alignment and I/O hints - Mike Snitzer [9]
larger than 2 TiB in any configuration and alignment isn't done
+
properly for drives with 4 KiB physical sectors.  4 KiB logical sector
+
support is broken in both the kernel and partitioners.  (need more
+
details and probably a whole section on partitioner behaviors)
+
  
Unfortunately, the transition to 4 KiB sector size, physical only or
+
* Linux & Advanced Storage Interfaces - Martin K. Petersen [10]
logical too, is looking fairly ugly.  Hopefully, a reasonable solution
+
can be reached in not too distant future but even with all the
+
software side updated, it looks like it's gonna cause significant
+
amount of confusion and frustration.
+
  
[1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
+
==L-1. Kernel support==
[2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
+
[3] http://en.wikipedia.org/wiki/Master_boot_record
+
[4] http://support.microsoft.com/kb/931760
+
[5] http://thread.gmane.org/gmane.linux.kernel/953981
+
[6] http://en.wikipedia.org/wiki/GUID_Partition_Table
+
[7] http://userweb.kernel.org/~tj/partalign/
+
  
* Mar 04 2009
+
Various storage parameters including physical and logical sector sizes
Initial draft, Tejun Heo <tj@kernel.org>
+
and alignment requirements are exported via IO limits and storage
* Mar 08 2009
+
topology support.  The kernel gathers all the relevant parameters,
 +
combine them according to storage organization and export them to
 +
userspace.  As of v2.6.33, the support covers most of Linux I/O stacks
 +
including but not limited to ATA and any mass storage device driven by
 +
the SCSI disk driver and complex devices composed using MD, DM and
 +
LVM.  IO topology support is being extended to cover virtualized
 +
storage devices.
 +
 
 +
As of v2.6.33, Linux ATA drivers do not support drives with 4KiB
 +
logical sector size although there is a development branch containing
 +
experimental support[11].  For ATA drives connected via bridges to
 +
different buses - USB and IEEE 1394, as long as the bridges support
 +
4KiB logical sector size correctly, the SCSI disk driver can handle
 +
them.
 +
 
 +
There currently is a limitation in DOS partition handling which
 +
prevents DOS partitions to grow over 2TiB even with 4KiB sector size
 +
but this is being worked on[4].
 +
 
 +
==L-2. Userspace tools status (thanks to Karel Zak[12])==
 +
 
 +
* libblkid provides unified API to topology information, it supports:
 +
** ioctls (kernel >= 2.6.32)
 +
** sysfs (kernel >= 2.6.31)
 +
** stripe chunk size and stripe width for DM, MD. LVM and evms on old kernels
 +
 
 +
* libparted and fdisk are linked against libblkid
 +
 
 +
* fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
 +
* fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
 +
* fdisk uses 1MiB alignment (or more if optimal I/O size is bigger) and alignment_offset for all partitions in non-DOS mode (util-linux-ng >= 2.17.1)
 +
 
 +
* parted supports 4KiB physical sector size
 +
* parted uses 1MiB alignment for disks with unknown topology, disks with topology information are aligned to optimal (or minimum) I/O size (parted >= 2.1)
 +
* The latest news on parted status can be found here[13]
 +
 +
* EFI GPT code in the kernel has been updated to works properly with 4KiB sectors (kernel >= 2.6.33)
 +
 
 +
* mkfs.{ext,xfs,gfs2,ocfs2} have been updated to work properly with topology information, mkfs.{ext,xfs} are linked against libblkid for compatibility with old kernel (for stripe chunk size / width)
 +
 
 +
* Fedora-13/RHEL6 installer uses libparted with 4KiB support
 +
 
 +
* alignment_offset & 4KiB support is planned for LUKS (cryptsetup)
 +
 
 +
Overall, distributions being released after Spring of 2010 with the
 +
updated tools shouldn't have much problem aligning and dealing with
 +
4KiB physical sector drives.  If you are working on or testing a
 +
distro, please make sure all storage related tools are up-to-date and
 +
aligning disks properly.
 +
 
 +
==L-3. Booting and boot loaders==
 +
 
 +
On traditional PC configurations, Linux booting is done in several
 +
stages.  The BIOS should be able to probe and access the drive.  It
 +
reads the MBR off the drive and pass control to it.  MBR contains
 +
initial chunk of bootloader and reads more data (often off the same
 +
drive) necessary for booting - usually further stages of boot loader.
 +
This process repeats as necessary until the kernel and module images
 +
are loaded and control is passed to it.  There can be different issues
 +
at various layers.
 +
 
 +
At the BIOS level, the following problems have been reported or are
 +
suspected.
 +
 
 +
* Some reportedly have issues accessing drives which report hardware sector size which is larger than 512 bytes even if the logical sector size remains 512 bytes (see C-1).
 +
* INT13h EDD uses 64bit LBA but some BIOSs might have problems with accessing drives which have higher capacity than 2TiB (32 bit limit).
 +
* Depending on the BIOS configuration, some read the partition table and solve CHS/LBA equations to figure out the geometry used during partitioning which seems to cause compatibility problems with partitions which don't consider geometry alignment at all[6].
 +
* It's reasonable to suspect that some (or rather, many) BIOSs wouldn't be able to access or boot off ATA drives with 4KiB logical sector size.
 +
 
 +
Despite the various problems, in general, all a BIOS needs to boot
 +
from a hard drive is reading the MBR off it and as long as logical
 +
block size remains at 512 bytes, most BIOSs should be able to boot off
 +
large and/or differently aligned drives.
 +
 
 +
On top of working BIOS access to the drives, boot loaders may have
 +
additional dependencies.  For example, GRUB needs to understand the
 +
partition table format and the filesystem itself to retrieve the
 +
kernel image and modules, while LILO hard codes LBAs of needed blocks
 +
and thus doesn't care about how the blocks are logically organized.
 +
 
 +
* As long as the BIOS can access the hard drive, LILO should be able to boot regardless of partition table format or alignment.  However, it is yet unknown whether there would be hidden issues with >2TiB hard drives or 4KiB logical sector size (if you know or have tested, please let me know).
 +
 
 +
* GRUB is not affected by partition alignment.  According to GRUB2 wiki Current Status page, it supports GPT and presumably >2TiB disks.  It is unclear how 4KiB logical sector size would work (please let me know).  Support status for GRUB legacy (0.9.x) is rather unclear but seems to require a patch to make GPT work.  >2TiB support status is unclear (again...).
 +
 
 +
* H. Peter Anvin reports that syslinux should work fine with any alignment and GPT with gptmbr.bin installed[14].  4KiB logical sector support has bit-rotted but he intends to update it[15]. >2TiB support status is unclear (plz let me know).
 +
 
 +
 
 +
=Random thoughts and comments (mostly for distros)=
 +
 
 +
* All upstream partitioning tools have been updated properly regarding alignment.  They either already default to larger alignment or are scheduled to switch to it.  For new releases, please make sure all the tools are up-to-date and larger alignment rules are in effect.
 +
 
 +
* Windows >= XP wouldn't have any problem sharing or booting from partition prepared with larger alignment, so compatibility implications will not be major.  Providing a mechanism to force legacy cylinder alignment or describing a way to manually create partitions with legacy layout should be enough.
 +
 
 +
* In newer releases of fdisk (util-linux-ng >= 2.17.1), traditional cylinder based alignment can be requested by turning on DOS Compatibility flag (the 'c' command).
 +
 
 +
* In case INT13h EDD has problems accessing sectors beyond 2TiB, it would be better to put data necessary for booting inside a boot partition which is contained inside 2TiB limit.
 +
 
 +
* GPT is unavoidable for 512 byte logical sector drives which is larger than 2TiB and there are clear advantages of GPT such as better protection against corruption, lack of artificial distinctions between primary and extended/logical partitions.  When compatibility with older software is not an issue, it could be better to default to GPT.
 +
 
 +
* Drives >2TiB and 4KiB logical sector size support status seems unclear.  It will be great if we can get proper prototype hardware into upstream developers' hands and make sure software side is ready before the actual products hit the market.
 +
 
 +
 
 +
=Document history=
 +
 
 +
* Mar 04 2010 Tejun Heo <tj@kernel.org>
 +
Initial draft.
 +
 
 +
* Mar 08 2010 Tejun Heo <tj@kernel.org>
 
Updated according to comments from Daniel Taylor
 
Updated according to comments from Daniel Taylor
 
<Daniel.Taylor@wdc.com>.  Other minor updates.
 
<Daniel.Taylor@wdc.com>.  Other minor updates.
 +
 +
* Mar 15 2010 Tejun Heo <tj@kernel.org>
 +
Updated according to various comments from discussions[8] on
 +
LKML and linux-ide.
 +
 +
 +
=References=
 +
 +
  [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
 +
  [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
 +
  [3] http://en.wikipedia.org/wiki/Master_boot_record
 +
  [4] http://thread.gmane.org/gmane.linux.kernel/953981
 +
  [5] http://en.wikipedia.org/wiki/GUID_Partition_Table
 +
  [6] http://support.microsoft.com/kb/931760
 +
  [7] http://userweb.kernel.org/~tj/partalign/
 +
  [8] http://thread.gmane.org/gmane.linux.ide/45211
 +
  [9] http://people.redhat.com/msnitzer/docs/io-limits.txt
 +
[10] http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf
 +
[11] git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git sectsize
 +
[12] http://article.gmane.org/gmane.linux.ide/45228
 +
[13] http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS
 +
[14] http://article.gmane.org/gmane.linux.ide/45293
 +
[15] http://article.gmane.org/gmane.linux.ide/45214

Revision as of 04:01, 16 March 2010

Minimally reformatted from the original text file. Once reviews on mailinglists are complete, I'll update the formatting. Till then, please write to tj@kernel.org and cc linux-ide@vger.kernel.org for comments instead of editing this page directly. Thanks.

ATA 4 KiB sector issues

Contents

Background

Up until recently, all ATA hard drives have been organized in 512byte sectors. For example, my 500GB or 477GiB hard drive is organized of 976773168 512 byte sectors numbered from 0 to 976773167. This is how a drive communicates with the driver. When the operating system wants to read 32 KiB of data at 1MiB position, the driver asks the drive to read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable individually, the physical medium also is organized in the same sized sectors. In addition to the area to store the actual data, each sector requires extra space for book keeping - inter-sector space to enable locating and addressing each sector and ECC data to detect and correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger ECC becomes necessary to guarantee acceptable level of data integrity increasing the space overhead. In addition, in most applications, hard drives are now accessed in units of at least 8 sectors or 4096 bytes and maintaining 512 byte granularity has become somewhat meaningless.

This reached a point where enlarging the sector size to 4096 bytes would yield measurably more usable space given the same raw data storage size and hard drive manufacturers are transitioning to 4KiB sectors.

Anandtech has a good article which illustrates the background and issues with pretty diagrams[1].


Physical vs. Logical

Because the 512 byte sector size has been around for a very long time and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the sector size assumption is scattered across all the layers - controllers or bridge chips snooping commands, BIOSs, boot codes, drivers, partitioners and system utilities, which makes it very difficult to change the sector size from 512 byte without breaking backward compatibility massively.

As a workaround, the concept of logical sector size was introduced. The physical medium is organized in 4KiB sectors but the firmware on the drive will present it as if the drive is composed of 512 byte sectors thus making the drive behave as before, so if the driver asks the hard drive to read 64 sectors from LBA 2048, the firmware will translate it and read 8 4KiB sectors from hardware sector 256. As a result, the hard drive now has two sector sizes - the physical one which the physical media is actually organized in, and the logical one which the firmware presents to the outside world.

A straight forward example mapping between physical sector and LBA would be

 LBA = 8 * phys_sect


Alignment problem on 4KiB physical / 512 logical drives

This workaround keeps older hardware and software working while allowing the drive to use larger sector size internally. However, the discrepancy between physical and logical sector sizes creates an alignment issue. For example, if the driver wants to read 7 sectors from LBA 2047, the firmware has to read hardware sector 255 and 256 and trim leading 7*512 bytes and tailing 512 bytes.

For reads, this isn't an issue as drives read in larger chunks anyway but for writes, the drive has to do read-modify-write to achieve the requested action. It has to first read hardware sector 255 and 256, update requested parts and then write back those sectors which can cause significant performance degradation[2].

The problem is aggravated by the way DOS partitions[3] have been laid out traditionally. For reasons dating back more than two decades, they are laid out considering something called disk geometry which nowadays are arbitrary values with a number of restrictions for backward compatibility accumulated over the years. The end result is that until recently (most Linux variants and upto Windows XP) the first partition ends up on sector 63 and later ones on cylinder boundaries where each cylinder usually is composed of 255 * 63 sectors.

Most modern filesystems generate 4KiB aligned accesses from the partition it is in. If a drive maps 4KiB physical sectors to 512 byte logical sectors from LBA0, the filesystem in the first partition will always be misaligned and filesystems in later partitions are likely to be misaligned too.


Solving the alignment problem on 4KiB physical / 512 logical drives

There are multiple ways which attempt to solve the problem.

S-1. Yet another workaround from the firmware - offset-by-one.

Yet another workaround which can be done by the firmware is to offset physical to logical mapping by one logical sector such that LBA 63 ends up on physical sector boundary, which aligns the first partition to physical sectors without requiring any software update. The example mapping between phys_sector and LBA becomes

   LBA = 8 * phys_sect - 1

The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to 63, making LBA 63 aligned on hardware sector.

Although this aligns only the first partition, for many use cases, especially the ones involving older software, this workaround was deemed useful and some recent drives with 4KiB physical sectors are equipped with a dip switch to turn on or off offset-by-one mapping.

S-2. The proper solution.

Correct alignments for all partitions can't be achieved by the firmware alone. The system utilities should be informed about the alignment requirements and align partitions accordingly.

The above firmware workaround complicates the situation because the two different configurations require different offsets to achieve the correct alignments. ATA/ATAPI-8 specifies a way for a drive to export the physical and logical sector sizes and the LBA offset which is aligned to the physical sectors.

In Linux, these parameters are exported via the following sysfs nodes.

   physical sector size	: /sys/block/sdX/queue/physical_block_size
   logical sector size		: /sys/block/sdX/queue/logical_block_size
   alignment offset		: /sys/block/sdX/alignment_offset

Let the physical sector size be PSS, logical sector size LSS and alignment offset AOFF. The system software should place partitions such that the starting LBAs of all partitions are aligned on

   (n * PSS + AOFF) / LSS

For 4KiB physical sector offset-by-one drives, PSS is 4096, LSS 512 and AOFF 3584 and with n of 7 the above becomes,

   (7 * 4096 + 3584) / 512 == 63

making sector 63 an aligned LBA where the first partition can be put, but without the offset-by-one mapping, AOFF is zero and LBA 63 is not aligned.

With the above new alignment requirement in place, it becomes difficult to honor the legacy one - first partition on sector 63 and all other partitions on cylinder boundary (255 * 63 sectors) - as the two alignment requirements contradict each other. This might be worked around by adjusting how LBA and CHS addresses are mapped but the disk geometry parameters are hard coded in some places and there is no reliable way to communicate custom geometry parameters.


Complications

Unfortunately, there are complications.

C-1. The standard is not and won't be followed as-is.

Some of the existing BIOSs and/or drivers can't cope with drives which report 4KiB physical sector size. To work around this, some drive models lie that its physical sector size is 512 bytes when the actual configuration is 4KiB without offsetting.

This nullifies the provisions for alignment in the ATA standard but results in the correct alignment for Windows Vista and 7. OS behaviors will be described further later.

For these drives, which are likely to continue to be shipped for the foreseeable future, traditional LBA 63 and cylinder based aligning results in misalignment.

C-2. The 2TiB barrier and the possibility for 4KiB logical sector size.

The DOS partition format uses 32 bit for the starting LBA and the number of sectors and, reportedly, 32 bit Windows XP shares the limitation. With 32 bit addressing and 512 byte logical sector size, the maximum addressable sector + 1 is at

   2^32 * 2^9 == 2^41 == 2TiB

The DOS partition format allows a partition to reach beyond 2TiB as long as the starting LBA is under 2TiB; however, both Windows XP and and the Linux kernel (at least upto v2.6.33) refuse such partition configurations.

With the right combination of host controller, BIOS and driver, this barrier can be overcome by enlarging the logical sector size to 4KiB, which will push the barrier out to 16TiB. On the right configuration, Windows XP is reportedly able to address beyond the 2TiB barrier with a DOS partition and 4KiB logical sector size. Linux kernel upto v2.6.33 doesn't work under such configurations but a patch to make it work is pending[4].

This might also be somewhat beneficial for operating systems which don't suffer from this limitation. A different partition format - GPT[5] - should be used beyond 2^32 sectors, which could harm compatibility with other operating systems which don't recognize the new format.

As mentioned previously, 512 byte sector assumption has existed for a very long time and changing it is might cause various compatibility problems at different layers. It has been suggested that 4KiB logical sector size might be primarily useful for external (USB or otherwise) drives.


Windows

As hard drive vendors aim for performance and compatibility in modern Windows environments, it is worthwhile to investigate how Windows behaves and partitions with different alignment requirements.

Although there seem to be some issues with certain BIOS settings[6], any releases after and including Windows XP do not depend on traditional partition alignment and can boot from partitions with any alignment. The reported problem seems to be caused by BIOS trying to guess geometry by reading from the partition table instead of using the de-facto geometry of 255 * 63 and can be worked around by either changing BIOS configuration or applying a hotfix.

It is reported that Windows 2000 depends on the traditional partition layout and will not work properly on partitions aligned differently. When partitioning for Windows 2000, it will be necessary to follow traditional partition layout; however, given the largely diminished Windows 2000 user-base, this won't be a big problem. Having a way to manually choose traditional alignment should be enough.

When asked to partition hard drives, up until Windows XP, Windows followed the traditional layout - the first partition on LBA 63 and the others on cylinder boundaries where a cylinder is defined as 255 tracks with 63 sectors each. Windows Vista and 7 align partitions differently. As the two behave similarly, only 7's behavior is shown here. These partition tables are created by Windows 7 RC installer on blank disks.

W-1. 512 byte physical and logical sector drive.

 ST FIRST  T  LAST   LBA      NBLKS
 80 202100 07 df130c 00080000 00200300
 00 df140c 07 feffff 00280300 00689e12
 00 000000 00 000000 00000000 00000000
 00 000000 00 000000 00000000 00000000
 
 Part0:        FIRST   C    0  H   32  S   33  : 2048          (63 sec/trk)
               LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
               LBA     2048 + 204800 = 206848
 
 Part1:        FIRST   C   12  H  223  S   20  : 206848
               LAST    C 1023  H  254  S   63  : E
               LBA     206848 + 312371200 = 312578048
 
 Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-2. 4KiB physical and 512 byte logical sector drive without offset-by-one.

 ST FIRST  T  LAST   LBA      NBLKS
 80 202100 07 df130c 00080000 00200300
 00 df140c 07 feffff 00280300 00b83f25
 00 000000 00 000000 00000000 00000000
 00 000000 00 000000 00000000 00000000
 
 Part0:        FIRST   C    0  H   32  S   33  : 2048          (63 sec/trk)
               LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
               LBA     2048 + 204800 = 206848
 
 Part1:        FIRST   C   12  H  223  S   20  : 206848
               LAST    C 1023  H  254  S   63  : E
               LBA     206848 + 624932864 = 625139712
 
 Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-3. 4KiB physical and 512 byte logical sector drive with offset-by-one.

 ST FIRST  T  LAST   LBA      NBLKS
 80 202800 07 df130c 07080000 f91f0300
 00 df1b0c 07 feffff 07280300 f9376d74
 00 000000 00 000000 00000000 00000000
 00 000000 00 000000 00000000 00000000
 
 Part0:        FIRST   C    0  H   32  S   40  : 2055          (63 sec/trk)
               LAST    C   12  H  223  S   19  : 206847        (255 heads/cyl)
               LBA     2055 + 204793 = 206848
 
 Part1:        FIRST   C   12  H  223  S   27  : 206855
               LAST    C 1023  H  254  S   63  : E
               LBA     206855 + 1953314809 = 1953521664
 
 Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.

The partitioner seems to be using 1M as the basic alignment unit and offsetting from there if explicitly requested by the drive and there is no difference between handling of 512 byte and 4KiB drives, which explains why C-1 works for hard drive vendors.

In all cases, the partitioner ignores both the first partition on LBA 63 and the others on cylinder boundary requirements while still using the same 255 * 63 cylinder size. Also, note that in W-3, both part 0 and 1 end up with odd number of sectors. It seems that they simply decided to completely break away from the traditional layout, which is understandable given that there really isn't one good solution which can cover all the cases and that the default larger alignment benefits earlier SSDs.

Windows Vista basically shows the same behavior. Vista was tested by creating two partitions using the management tool. Test data is available at [7].

 *-alignment_offset	: alignment_offset reported by Linux kernel
 *-fdisk		: fdisk -l output
 *-fdisk-u		: fdisk -lu output
 *-hdparm		: hdparm -I output
 *-mbr			: dump of mbr
 *-part		: decoded partition table from mbr

Please note that hdparm is misreporting the alignment offset. It should be reporting 512 instead of 256 for offset-by-one drives. This problem is fixed by version 9.28.


Where Linux stands

Considering all the factors, the best workable solution seems to be doing what Windows is doing. Hard drive and SSD vendors are focusing on compatibility and performance on recent Windows releases and are happy to do things which break the standard defined mechanism as shown by C-1, so parting away from what Windows does would be unnecessarily painful. Other than giving an option to use traditional layout for Windows releases <= 2000, always using larger alignment will achieve properly aligned partitions and acceptable compatibility.

Most of information in this section comes from the discussion thread reviewing an early draft of this document[8] and the following two documents.

  • I/O Limits: block sizes, alignment and I/O hints - Mike Snitzer [9]
  • Linux & Advanced Storage Interfaces - Martin K. Petersen [10]

L-1. Kernel support

Various storage parameters including physical and logical sector sizes and alignment requirements are exported via IO limits and storage topology support. The kernel gathers all the relevant parameters, combine them according to storage organization and export them to userspace. As of v2.6.33, the support covers most of Linux I/O stacks including but not limited to ATA and any mass storage device driven by the SCSI disk driver and complex devices composed using MD, DM and LVM. IO topology support is being extended to cover virtualized storage devices.

As of v2.6.33, Linux ATA drivers do not support drives with 4KiB logical sector size although there is a development branch containing experimental support[11]. For ATA drives connected via bridges to different buses - USB and IEEE 1394, as long as the bridges support 4KiB logical sector size correctly, the SCSI disk driver can handle them.

There currently is a limitation in DOS partition handling which prevents DOS partitions to grow over 2TiB even with 4KiB sector size but this is being worked on[4].

L-2. Userspace tools status (thanks to Karel Zak[12])

  • libblkid provides unified API to topology information, it supports:
    • ioctls (kernel >= 2.6.32)
    • sysfs (kernel >= 2.6.31)
    • stripe chunk size and stripe width for DM, MD. LVM and evms on old kernels
  • libparted and fdisk are linked against libblkid
  • fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
  • fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
  • fdisk uses 1MiB alignment (or more if optimal I/O size is bigger) and alignment_offset for all partitions in non-DOS mode (util-linux-ng >= 2.17.1)
  • parted supports 4KiB physical sector size
  • parted uses 1MiB alignment for disks with unknown topology, disks with topology information are aligned to optimal (or minimum) I/O size (parted >= 2.1)
  • The latest news on parted status can be found here[13]
  • EFI GPT code in the kernel has been updated to works properly with 4KiB sectors (kernel >= 2.6.33)
  • mkfs.{ext,xfs,gfs2,ocfs2} have been updated to work properly with topology information, mkfs.{ext,xfs} are linked against libblkid for compatibility with old kernel (for stripe chunk size / width)
  • Fedora-13/RHEL6 installer uses libparted with 4KiB support
  • alignment_offset & 4KiB support is planned for LUKS (cryptsetup)

Overall, distributions being released after Spring of 2010 with the updated tools shouldn't have much problem aligning and dealing with 4KiB physical sector drives. If you are working on or testing a distro, please make sure all storage related tools are up-to-date and aligning disks properly.

L-3. Booting and boot loaders

On traditional PC configurations, Linux booting is done in several stages. The BIOS should be able to probe and access the drive. It reads the MBR off the drive and pass control to it. MBR contains initial chunk of bootloader and reads more data (often off the same drive) necessary for booting - usually further stages of boot loader. This process repeats as necessary until the kernel and module images are loaded and control is passed to it. There can be different issues at various layers.

At the BIOS level, the following problems have been reported or are suspected.

  • Some reportedly have issues accessing drives which report hardware sector size which is larger than 512 bytes even if the logical sector size remains 512 bytes (see C-1).
  • INT13h EDD uses 64bit LBA but some BIOSs might have problems with accessing drives which have higher capacity than 2TiB (32 bit limit).
  • Depending on the BIOS configuration, some read the partition table and solve CHS/LBA equations to figure out the geometry used during partitioning which seems to cause compatibility problems with partitions which don't consider geometry alignment at all[6].
  • It's reasonable to suspect that some (or rather, many) BIOSs wouldn't be able to access or boot off ATA drives with 4KiB logical sector size.

Despite the various problems, in general, all a BIOS needs to boot from a hard drive is reading the MBR off it and as long as logical block size remains at 512 bytes, most BIOSs should be able to boot off large and/or differently aligned drives.

On top of working BIOS access to the drives, boot loaders may have additional dependencies. For example, GRUB needs to understand the partition table format and the filesystem itself to retrieve the kernel image and modules, while LILO hard codes LBAs of needed blocks and thus doesn't care about how the blocks are logically organized.

  • As long as the BIOS can access the hard drive, LILO should be able to boot regardless of partition table format or alignment. However, it is yet unknown whether there would be hidden issues with >2TiB hard drives or 4KiB logical sector size (if you know or have tested, please let me know).
  • GRUB is not affected by partition alignment. According to GRUB2 wiki Current Status page, it supports GPT and presumably >2TiB disks. It is unclear how 4KiB logical sector size would work (please let me know). Support status for GRUB legacy (0.9.x) is rather unclear but seems to require a patch to make GPT work. >2TiB support status is unclear (again...).
  • H. Peter Anvin reports that syslinux should work fine with any alignment and GPT with gptmbr.bin installed[14]. 4KiB logical sector support has bit-rotted but he intends to update it[15]. >2TiB support status is unclear (plz let me know).


Random thoughts and comments (mostly for distros)

  • All upstream partitioning tools have been updated properly regarding alignment. They either already default to larger alignment or are scheduled to switch to it. For new releases, please make sure all the tools are up-to-date and larger alignment rules are in effect.
  • Windows >= XP wouldn't have any problem sharing or booting from partition prepared with larger alignment, so compatibility implications will not be major. Providing a mechanism to force legacy cylinder alignment or describing a way to manually create partitions with legacy layout should be enough.
  • In newer releases of fdisk (util-linux-ng >= 2.17.1), traditional cylinder based alignment can be requested by turning on DOS Compatibility flag (the 'c' command).
  • In case INT13h EDD has problems accessing sectors beyond 2TiB, it would be better to put data necessary for booting inside a boot partition which is contained inside 2TiB limit.
  • GPT is unavoidable for 512 byte logical sector drives which is larger than 2TiB and there are clear advantages of GPT such as better protection against corruption, lack of artificial distinctions between primary and extended/logical partitions. When compatibility with older software is not an issue, it could be better to default to GPT.
  • Drives >2TiB and 4KiB logical sector size support status seems unclear. It will be great if we can get proper prototype hardware into upstream developers' hands and make sure software side is ready before the actual products hit the market.


Document history

  • Mar 04 2010 Tejun Heo <tj@kernel.org>

Initial draft.

  • Mar 08 2010 Tejun Heo <tj@kernel.org>

Updated according to comments from Daniel Taylor <Daniel.Taylor@wdc.com>. Other minor updates.

  • Mar 15 2010 Tejun Heo <tj@kernel.org>

Updated according to various comments from discussions[8] on LKML and linux-ide.


References

 [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
 [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
 [3] http://en.wikipedia.org/wiki/Master_boot_record
 [4] http://thread.gmane.org/gmane.linux.kernel/953981
 [5] http://en.wikipedia.org/wiki/GUID_Partition_Table
 [6] http://support.microsoft.com/kb/931760
 [7] http://userweb.kernel.org/~tj/partalign/
 [8] http://thread.gmane.org/gmane.linux.ide/45211
 [9] http://people.redhat.com/msnitzer/docs/io-limits.txt
[10] http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf
[11] git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git sectsize
[12] http://article.gmane.org/gmane.linux.ide/45228
[13] http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS
[14] http://article.gmane.org/gmane.linux.ide/45293
[15] http://article.gmane.org/gmane.linux.ide/45214
Personal tools