Add RAID10 and other stuff to md.4
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
This commit is contained in:
parent
34163fc7cf
commit
599e5a360b
145
md.4
145
md.4
|
@ -14,13 +14,16 @@ redundancy, and hence the acronym RAID which stands for a Redundant
|
||||||
Array of Independent Devices.
|
Array of Independent Devices.
|
||||||
.PP
|
.PP
|
||||||
.B md
|
.B md
|
||||||
supports RAID levels 1 (mirroring) 4 (striped array with parity
|
supports RAID levels
|
||||||
device), 5 (striped array with distributed parity information) and 6
|
1 (mirroring),
|
||||||
(striped array with distributed dual redundancy information.) If
|
4 (striped array with parity device),
|
||||||
some number of underlying devices fails while using one of these
|
5 (striped array with distributed parity information),
|
||||||
|
6 (striped array with distributed dual redundancy information), and
|
||||||
|
10 (striped and mirrored).
|
||||||
|
If some number of underlying devices fails while using one of these
|
||||||
levels, the array will continue to function; this number is one for
|
levels, the array will continue to function; this number is one for
|
||||||
RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
|
RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
|
||||||
RAID level 1.
|
RAID level 1, and dependant of configuration for level 10.
|
||||||
.PP
|
.PP
|
||||||
.B md
|
.B md
|
||||||
also supports a number of pseudo RAID (non-redundant) configurations
|
also supports a number of pseudo RAID (non-redundant) configurations
|
||||||
|
@ -65,7 +68,7 @@ The superblock contains, among other things:
|
||||||
.TP
|
.TP
|
||||||
LEVEL
|
LEVEL
|
||||||
The manner in which the devices are arranged into the array
|
The manner in which the devices are arranged into the array
|
||||||
(linear, raid0, raid1, raid4, raid5, multipath).
|
(linear, raid0, raid1, raid4, raid5, raid10, multipath).
|
||||||
.TP
|
.TP
|
||||||
UUID
|
UUID
|
||||||
a 128 bit Universally Unique Identifier that identifies the array that
|
a 128 bit Universally Unique Identifier that identifies the array that
|
||||||
|
@ -111,6 +114,8 @@ an extra drive and so the array is made bigger without disturbing the
|
||||||
data that is on the array. However this cannot be done on a live
|
data that is on the array. However this cannot be done on a live
|
||||||
array.
|
array.
|
||||||
|
|
||||||
|
If a chunksize is given with a LINEAR array, the usable space on each
|
||||||
|
device is rounded down to a multiple of this chunksize.
|
||||||
|
|
||||||
.SS RAID0
|
.SS RAID0
|
||||||
|
|
||||||
|
@ -188,6 +193,46 @@ The performance for RAID6 is slightly lower but comparable to RAID5 in
|
||||||
normal mode and single disk failure mode. It is very slow in dual
|
normal mode and single disk failure mode. It is very slow in dual
|
||||||
disk failure mode, however.
|
disk failure mode, however.
|
||||||
|
|
||||||
|
.SS RAID10
|
||||||
|
|
||||||
|
RAID10 provides a combination of RAID1 and RAID0, and sometimes known
|
||||||
|
as RAID1+0. Every datablock is duplicated some number of times, and
|
||||||
|
the resulting collection of datablocks are distributed over multiple
|
||||||
|
drives.
|
||||||
|
|
||||||
|
When configuring a RAID10 array it is necessary to specify the number
|
||||||
|
of replicas of each data block that are required (this will normally
|
||||||
|
be 2) and whether the replicas should be 'near' or 'far'.
|
||||||
|
|
||||||
|
When 'near' replicas are chosen, the multiple copies of a given chunk
|
||||||
|
are laid out consecutively across the stripes of the array, so the two
|
||||||
|
copies of a datablock will likely be at the same offset on two
|
||||||
|
adjacent devices.
|
||||||
|
|
||||||
|
When 'far' replicas are chosen, the multiple copies of a given chunk
|
||||||
|
are laid out quite distant from each other. The first copy of all
|
||||||
|
data blocks will be striped across the early part of all drives in
|
||||||
|
RAID0 fashion, and then the next copy of all blocks will be striped
|
||||||
|
across a later section of all drives, always ensuring that all copies
|
||||||
|
of any given block are on different drives.
|
||||||
|
|
||||||
|
The 'far' arrangement can give sequential read performance equal to
|
||||||
|
that of a RAID0 array, but at the cost of degraded write performance.
|
||||||
|
|
||||||
|
It should be noted that the number of devices in a RAID10 array need
|
||||||
|
not be a multiple of the number of replica of each data block, those
|
||||||
|
there must be at least as many devices as replicas.
|
||||||
|
|
||||||
|
If, for example, an array is created with 5 devices and 2 replicas,
|
||||||
|
then space equivalent to 2.5 of the devices will be available, and
|
||||||
|
every block will be stored on two different devices.
|
||||||
|
|
||||||
|
Finally, it is possible to have an array with both 'near' and 'far'
|
||||||
|
copies. If and array is configured with 2 near copies and 2 far
|
||||||
|
copies, then there will be a total of 4 copies of each block, each on
|
||||||
|
a different drive. This is an artifact of the implementation and is
|
||||||
|
unlikely to be of real value.
|
||||||
|
|
||||||
.SS MUTIPATH
|
.SS MUTIPATH
|
||||||
|
|
||||||
MULTIPATH is not really a RAID at all as there is only one real device
|
MULTIPATH is not really a RAID at all as there is only one real device
|
||||||
|
@ -231,13 +276,13 @@ failure modes can be cleared.
|
||||||
|
|
||||||
.SS UNCLEAN SHUTDOWN
|
.SS UNCLEAN SHUTDOWN
|
||||||
|
|
||||||
When changes are made to a RAID1, RAID4, RAID5 or RAID6 array there is a
|
When changes are made to a RAID1, RAID4, RAID5, RAID6, or RAID10 array
|
||||||
possibility of inconsistency for short periods of time as each update
|
there is a possibility of inconsistency for short periods of time as
|
||||||
requires are least two block to be written to different devices, and
|
each update requires are least two block to be written to different
|
||||||
these writes probably wont happen at exactly the same time.
|
devices, and these writes probably wont happen at exactly the same
|
||||||
Thus if a system with one of these arrays is shutdown in the middle of
|
time. Thus if a system with one of these arrays is shutdown in the
|
||||||
a write operation (e.g. due to power failure), the array may not be
|
middle of a write operation (e.g. due to power failure), the array may
|
||||||
consistent.
|
not be consistent.
|
||||||
|
|
||||||
To handle this situation, the md driver marks an array as "dirty"
|
To handle this situation, the md driver marks an array as "dirty"
|
||||||
before writing any data to it, and marks it as "clean" when the array
|
before writing any data to it, and marks it as "clean" when the array
|
||||||
|
@ -246,9 +291,10 @@ to be dirty at startup, it proceeds to correct any possibly
|
||||||
inconsistency. For RAID1, this involves copying the contents of the
|
inconsistency. For RAID1, this involves copying the contents of the
|
||||||
first drive onto all other drives. For RAID4, RAID5 and RAID6 this
|
first drive onto all other drives. For RAID4, RAID5 and RAID6 this
|
||||||
involves recalculating the parity for each stripe and making sure that
|
involves recalculating the parity for each stripe and making sure that
|
||||||
the parity block has the correct data. This process, known as
|
the parity block has the correct data. For RAID10 it involves copying
|
||||||
"resynchronising" or "resync" is performed in the background. The
|
one of the replicas of each block onto all the others. This process,
|
||||||
array can still be used, though possibly with reduced performance.
|
known as "resynchronising" or "resync" is performed in the background.
|
||||||
|
The array can still be used, though possibly with reduced performance.
|
||||||
|
|
||||||
If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
|
If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
|
||||||
drive) when it is restarted after an unclean shutdown, it cannot
|
drive) when it is restarted after an unclean shutdown, it cannot
|
||||||
|
@ -261,12 +307,13 @@ start an array in this condition without manual intervention.
|
||||||
.SS RECOVERY
|
.SS RECOVERY
|
||||||
|
|
||||||
If the md driver detects any error on a device in a RAID1, RAID4,
|
If the md driver detects any error on a device in a RAID1, RAID4,
|
||||||
RAID5 or RAID6 array, it immediately disables that device (marking it
|
RAID5, RAID6, or RAID10 array, it immediately disables that device
|
||||||
as faulty) and continues operation on the remaining devices. If there
|
(marking it as faulty) and continues operation on the remaining
|
||||||
is a spare drive, the driver will start recreating on one of the spare
|
devices. If there is a spare drive, the driver will start recreating
|
||||||
drives the data what was on that failed drive, either by copying a
|
on one of the spare drives the data what was on that failed drive,
|
||||||
working drive in a RAID1 configuration, or by doing calculations with
|
either by copying a working drive in a RAID1 configuration, or by
|
||||||
the parity block on RAID4, RAID5 or RAID6.
|
doing calculations with the parity block on RAID4, RAID5 or RAID6, or
|
||||||
|
by finding a copying originals for RAID10.
|
||||||
|
|
||||||
While this recovery process is happening, the md driver will monitor
|
While this recovery process is happening, the md driver will monitor
|
||||||
accesses to the array and will slow down the rate of recovery if other
|
accesses to the array and will slow down the rate of recovery if other
|
||||||
|
@ -279,6 +326,60 @@ and
|
||||||
.B speed_limit_max
|
.B speed_limit_max
|
||||||
control files mentioned below.
|
control files mentioned below.
|
||||||
|
|
||||||
|
.SS BITMAP WRITE-INTENT LOGGING
|
||||||
|
|
||||||
|
From Linux 2.6.13,
|
||||||
|
.I md
|
||||||
|
supports a bitmap based write-intent log. If configured, the bitmap
|
||||||
|
is used to record which blocks of the array may be out of sync.
|
||||||
|
Before any write request is honoured, md will make sure that the
|
||||||
|
corresponding bit in the log is set. After a period of time with no
|
||||||
|
writes to an area of the array, the corresponding bit will be cleared.
|
||||||
|
|
||||||
|
This bitmap is used for two optimisations.
|
||||||
|
|
||||||
|
Firstly, after an unclear shutdown, the resync process will consult
|
||||||
|
the bitmap and only resync those blocks that correspond to bits in the
|
||||||
|
bitmap that are set. This can dramatically increase resync time.
|
||||||
|
|
||||||
|
Secondly, when a drive fails and is removed from the array, md stops
|
||||||
|
clearing bits in the intent log. If that same drive is re-added to
|
||||||
|
the array, md will notice and will only recover the sections of the
|
||||||
|
drive that are covered by bits in the intent log that are set. This
|
||||||
|
can allow a device to be temporarily removed and reinserted without
|
||||||
|
causing an enormous recovery cost.
|
||||||
|
|
||||||
|
The intent log can be stored in a file on a separate device, or it can
|
||||||
|
be stored near the superblocks of an array which has superblocks.
|
||||||
|
|
||||||
|
Subsequent versions of Linux will support hot-adding of bitmaps to
|
||||||
|
existing arrays.
|
||||||
|
|
||||||
|
In 2.6.13, intent bitmaps are only supported with RAID1. Other levels
|
||||||
|
will follow.
|
||||||
|
|
||||||
|
.SS WRITE-BEHIND
|
||||||
|
|
||||||
|
From Linux 2.6.14,
|
||||||
|
.I md
|
||||||
|
will support WRITE-BEHIND on RAID1 arrays.
|
||||||
|
|
||||||
|
This allows certain devices in the array to be flagged as
|
||||||
|
.IR write-mostly .
|
||||||
|
MD will only read from such devices if there is no
|
||||||
|
other option.
|
||||||
|
|
||||||
|
If a write-intent bitmap is also provided, write requests to
|
||||||
|
write-mostly devices will be treated as write-behind requests and md
|
||||||
|
will not wait for writes to those requests to complete before
|
||||||
|
reporting the write as complete to the filesystem.
|
||||||
|
|
||||||
|
This allows for a RAID1 with WRITE-BEHIND to be used to mirror data
|
||||||
|
over a slow link to a remove computer (providing the link isn't too
|
||||||
|
slow). The extra latency of the remote link will not slow down normal
|
||||||
|
operations, but the remote system will still have a reasonably
|
||||||
|
up-to-date copy of all data.
|
||||||
|
|
||||||
.SS KERNEL PARAMETERS
|
.SS KERNEL PARAMETERS
|
||||||
|
|
||||||
The md driver recognised three different kernel parameters.
|
The md driver recognised three different kernel parameters.
|
||||||
|
|
Loading…
Reference in New Issue