Example output:
./mdadm --assemble /dev/md0 /dev/sd[c-f] /dev/sdb1
mdadm: /dev/md0 has been started with 4 drives and 1 journal.
mdadm checks superblock for journal devices. If the journal device
is missing or faulty, mdadm will show warning
./mdadm --assemble /dev/md0 /dev/sd[c-q] /dev/sdb1
mdadm: Not safe to assemble with missing or stale journal device, consider --force.
User can insist to start the array (read only) with --force
./mdadm --assemble /dev/md0 /dev/sd[c-q] /dev/sdb1 --force
mdadm: Journal is missing or stale, starting array read only.
mdadm: /dev/md0 has been started with 15 drives.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Specify the write journal device with --write-journal DEVICE
./mdadm --create -f /dev/md0 --assume-clean -c 32 --raid-devices=4 --level=5 /dev/sd[c-f] --write-journal /dev/sdb1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
Only one journal device is allowed. If multiple --write-journal
are given, mdadm will use the first and ignore others
./mdadm --create -f /dev/md0 --assume-clean -c 32 --raid-devices=4 --level=5 /dev/sd[c-f] --write-journal /dev/sdb1 --write-journal /dev/sdx
mdadm: Please specify only one journal device for the array.
mdadm: Ignoring --write-journal /dev/sdx...
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Replace special disk roles (0xffff, 0xfffe) with macros:
define MD_DISK_ROLE_SPARE 0xffff
define MD_DISK_ROLE_FAULTY 0xfffe
Will add macro for journal device in next patch:
define MD_DISK_ROLE_JOURNAL 0xfffd
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
We currently have no synchronization techniques for the bad
block log, so disable it for the cluster.
Reported-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Add BITMAP_MAJOR_CLUSTERED as 5, in order to prevent older kernels
to assemble a clustered device.
In order to maximize compatibility, the major version is set to
BITMAP_MAJOR_CLUSTERED *only* if the bitmap is clustered.
Also, added MD_FEATURE_CLUSTERED in order to return error
for older kernels which would assemble MD in case bitmap is
corrupted.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Left align is better for cluster with name less than 64. Also
make the output of cluster name is aligned with others.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
We can use the new added calc_bitmap_size func to remove some
redundant lines.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This extends nodes option for assemble mode, make the num of
cluster node could be change by user.
Before that, it is necessary to ensure there are enough space
for those nodes, calc_bitmap_size is introduced to calculate
the bitmap size of each node.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
To support change the cluster name, the commit do the followings:
1. extend original write_bitmap function for new scenario.
2. add the scenarion to handle the modification of cluster's name
in write_bitmap1.
3. let the cluster name also show in examine_super1 and detail_super1
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We want the clustered devices to be started exclusively by a cluster
resource-agent. So, avoid starting using the incremental option.
This also skips a clustered md from starting during boot in inactive mode.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The home-cluster is stored in the bitmap super block of the
array. The device can be assembled on a cluster with the
cluster name same as the one recorded in the bitmap.
If home-cluster is not specified, this is auto-detected using
dlopen corosync cmap library.
neilb: allow code to compile when corosync-devel is not installed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Specifies the maximum number of nodes in the cluster that may use
this device simultaneously. This is equivalent to the number of
bitmaps created in the internal superblock (patches to follow).
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
For a clustered MD, create bitmaps equal to number of nodes so
each node has an independent bitmap.
Only the first bitmap is has the bits set so that the first node
that assembles the device also performs the sync.
The bitmaps are aligned to 4k boundaries.
On-disk format:
0 4k 8k 12k
-------------------------------------------------------------------
| idle | md super | bm super [0] + bits |
| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] |
| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits |
| bm bits [3, contd] | | |
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It is best to keep strings all together so that they
are easier to search for in the source code.
If a string is so long that it looks ugly one line,
them maybe it should be broken into multiple lines
for display too.
Only strings which contain a newline can be broken
into multiple lines:
"It is OK to\n"
"break this string\n"
Signed-off-by: NeilBrown <neilb@suse.de>
make dprintf() print program name and __func__, so that
this messaging is consistent.
Also remove all __func__ messages from pr_err(). We shouldn't
leak that internal data in error message.
If we really want function name there, we new pr_XXX might
be wanted.
Signed-off-by: NeilBrown <neilb@suse.de>
If the data is too close to the superblock there may be
no space for a bitmap.
If that happens, fail the adding of the bitmap rather than
corrupt data.
Reported-by: Lars Wijtemans <rhelbugzilla@lars.wijtemans.nl>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=922944
CREATE bbl=no
in mdadm.conf will cause any devices added to an array
to not have a bad block list. By default they do for 1.x
metadata.
This is useful if you are suspicious of the bad-block-list
implementation.
Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 23bf42cc79
super1: simplify setting of array size.
removed the setting for sb->data_offset for 1.0 metadata for some reason,
and messed up the size calculation for 1.0 metadata too.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the extra space to leave before the data in the array
is calculated in two separate places, and they can be inconsistent.
Instead, do it all in validate_geometry. This records the
'data_offset' chosen which all other devices then use.
'write_init_super' now just uses the value rather than doing all the
calculations again.
This results in more consistent numbers.
Also, load_super sets st->data_offset so that it is used by "--add",
so the new device has a data offset matching a pre-existing device.
Signed-off-by: NeilBrown <neilb@suse.de>
_avail_space1() is calls from both avail_space1() and validate_geometry1()
and does slightly different things.
The partial code sharing doesn't really help. In particularly the
responsibility for setting the size of the array is currently
confused.
So duplicate the code into the two locations - one where 'super' is
always NULL (validate_geometry1) and one where it is never NULL
(avail_space1), and simplify.
No behaviour change - just code re-organisation.
Signed-off-by: NeilBrown <neilb@suse.de>
This call to validate_geometry is really rather gratuitous.
It is purely about the fact that super0 cannot use more than 4TB.
So just make it an explicit test - less confusing that way.
With this, validate_geometry is only called from Create, which
makes it easier to reason about.
Also validate_geometry is now never passed NULL for the 'chunk'
parameter, so we can remove those annoying tests for NULL.
Signed-off-by: NeilBrown <neilb@suse.de>
The "data_size" is with respect to "data_offset". When the kernel
changes "data_offset" it modifies "data_size" to match - see
md_finish_reshape() in the kernel.
So when mdadm switches the data_offset for the new data_offset, it
must update data_size correspondingly.
Signed-off-by: NeilBrown <neilb@suse.de>
We can only revert a reshape if the reshape_position aligns
properly for the old geometry.
If it doesn't we just fail for now.
Also fix a +/- error with updating raid_disks for super1.c
Signed-off-by: NeilBrown <neilb@suse.de>
We need to check for a backup iff the data_offset has changed.
Testing against level==10 was an effective but short-sighted approach.
Signed-off-by: NeilBrown <neilb@suse.de>
This allows the smooth conversion of legacy 0.90 arrays
to 1.0 metadata.
Old metadata is likely to remain but will be ignored.
It can be removed with
mdadm --zero-superblock --metadata=0.90 /dev/whatever
Signed-off-by: NeilBrown <neilb@suse.de>
These need to be cast to uint32_t before being cast to 'long', else
sign extension doesn't happen on 64bit hosts.
And bitmap_offset is le32, not le64 !!
Signed-off-by: NeilBrown <neilb@suse.de>
It seems like a nice location, but it means that we cannot
decrease the data_offset during a reshape.
So put it just after the bitmap, leaving 32K.
Signed-off-by: NeilBrown <neilb@suse.de>
If space_after and space_before are zero (the default) then assume that
metadata doesn't support changing data_offset.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ these must allow for bad-block-list
2/ they must match the kernel, which has a 32k buffer after the
superblock.
Signed-off-by: NeilBrown <neilb@suse.de>
This allows the metadata on a device to be saved and later restored.
This can be useful before experimenting on an array that is misbehaving.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We widely use a "devnum" which is 0 or +ve for md%d devices
and -ve for md_d%d devices.
But I want to be able to use md_%s device names.
So get rid of devnum (a number) and use devnm (a 32char string).
eg.
md0
md_d2
md_home
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 1e2b276535 (Report error in --update
string is not recognised) broke homehost updating functionality because it
depended on each string comparison being done even after we already found
a match. Make it work again by restructuring code.
Reported-by: (and original version by) Justin Maggard <jmaggard10@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that we use O_DIRECT for all device IO, BLKFLSBUF is not needed to
ensure we get current data, and it can impose a cost if any flush-out
is needed. So remove it.
To be safe, add O_DIRECT to one place where it isn't currently used:
when reading a bitmap.
Signed-off-by: NeilBrown <neilb@suse.de>
--detail needs to be read to report 2 devices in each slot,
and --examine need to report if the device is the original or
the replacement.
Signed-off-by: NeilBrown <neilb@suse.de>
mdadm --create /dev/md0 .... /dev/sda1:1024 /dev/sdb1:2048 ...
The size is in K unless a suffix: K M G is given.
The suffix 's' means sectors.
Signed-off-by: NeilBrown <neilb@suse.de>
sometimes 0.1% isn't enough, though mostly only in testing.
We need one chunk for a successful reshape, so reserve 2.
Signed-off-by: NeilBrown <neilb@suse.de>
Some arrays (raid10) never need a backup file, so during assembly
we can avoid the whole Grow_continue check in that case.
Achieve this using a flag set by the metadata handler.
Also get "mdadm -I" to fail if a backup process would be
needed. It currently does fail as the kernel rejects things,
but it is nicer to have this explicit.
Signed-off-by: NeilBrown <neilb@suse.de>
The 'new_offset' is used for reshaping to avoid the need
for a backup file.
For now we only report the value when it is set.
Signed-off-by: NeilBrown <neilb@suse.de>
This is currently only useful for 1.x metadata and will allow an
explicit --data-offset request on command line.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ When printing the "name=" entry for --brief output,
enclose name in quotes if it contains spaces etc.
Quotes are already supported for reading mdadm.conf
2/ When a name is used as a device name, translate spaces
and tabs to '_', as well as the current translation of
'/' to '-'.
Signed-off-by: NeilBrown <neilb@suse.de>
--update=bbl will add a bad block list to each device.
--update=no-bblk will remove the bad block list providing that it
is empty.
Signed-off-by: NeilBrown <neilb@suse.de>
An additional pair of key=value for --examine --export.
Signed-off-by: Maciej Naruszewicz <maciej.naruszewicz@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we change some functions to accept 'verbose', where <0 means to be
quiet, in place of 'quiet', then we will be able to merge
'quiet' and 'verbose' together for simplicity.
Signed-off-by: NeilBrown <neilb@suse.de>
malloc should never fail, and if it does it is unlikely
that anything else useful can be done. Best approach is to
abort and let some super-daemon restart.
So define xmalloc, xcalloc, xrealloc, xstrdup which don't
fail but just print a message and exit. Then use those
removing all the tests for failure.
Also replace all "malloc;memset" sequences with 'xcalloc'.
Signed-off-by: NeilBrown <neilb@suse.de>
In function write_init_super1():
If "rv = store_super1(st, di->fd)" return error and the di is the last.
Then the di = NULL && rv > 0, so exec:
if (rv)
fprintf(stderr, Name ": Failed to write metadata to%s\n",
di->devname);
will be segmentation fault.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
While it is nice to set a high data_offset to leave plenty of head
room it is much more important to leave enough space to allow
of the data of the array.
So after we check that sb->size is still available, only reduce the
'reserved', don't increase it.
This fixes a bug where --adding a spare fails because it does not have
enough space in it.
Reported-by: nowhere <nowhere@hakkenden.ath.cx>
Signed-off-by: NeilBrown <neilb@suse.de>
fbdef49811 incorrectly tried to fix sign
extension of the bitmap offset. However mdinfo->bitmap_offset is a u32
and needs to be converted to a 32 bit signed integer before the sign
extension.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The kernel is growing the ability to avoid the need for a
backup file during reshape by being able to change the data offset.
For this to be useful we need plenty of free space before the
data so the data offset can be reduced.
So for v1.1 and v1.2 metadata make the default data_offset much
larger. Aim for 128Meg, but keep a power of 2 and don't use more
than 0.1% of each device.
Don't change v1.0 as that is used when the data_offset is required to
be zero.
Signed-off-by: NeilBrown <neilb@suse.de>
As the bitmap can be before the superblock, bitmap_offset is signed.
But some of the code didn't honour that :-(
Signed-off-by: NeilBrown <neilb@suse.de>
RAID10 arrays with an odd number of devices had the arraysize
reported wrongly by --examine due to a rounding error.
Reported-by: Chris Francy <zoredache@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This uses a struct to cache the block size for aligned reads/writes,
to avoid repeated ioctl(BLKSSZGET) calls.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Avoid possibly using stale data in bitmap and misc area of superblock.
In addition, remove superfluous memsets already covered by memset of
full superblock.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use a #define rather than calculate the size of the superblock buffer
on every allocation.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We just calculated the pointer to the bitmap, so use it instead of
recalculating.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The current 1024 byte limit on 1.x superblocks limits us to
384 devices. Sometimes people want more.
The kernel is already prepared for superblocks up to 4K,
so enable that in mdadm allowing up to
(4096-256)/2 == 1920
devices (active plus spare).
Signed-off-by: NeilBrown <neilb@suse.de>
In addition remove attempt to print an error message if
write_init_super() fails, as this is handled in the various
write_init_super() functions. This avoids a segfault on error.
Reported by Jim Meyering in
https://bugzilla.redhat.com/show_bug.cgi?id=795461
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This makes super[01].c properly align buffers used for the bitmap
using posix_memalign() to make sure the writes don't fail in case the
bitmap is opened using O_DIRECT.
This is based on https://bugzilla.redhat.com/show_bug.cgi?id=789898
and an initial patch by Alexander Murashkin.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A recently change to write_bitmap1 meant awrite would sometimes
write from a non-aligned buffer which of course break.
So change awrite (and aread) to always use their own aligned
buffer to ensure safety.
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
when deciding whether the array is clean or dirty, compare
sb->resync_offset against MaxSector and not against sb->size
With RAID6 resyncing and subsequent drive failures, it is possible to
reach the case, in which sb->resync_offset==sb->size. This happens
when resync is aborted due to drive failures, and immediately a
rebuild of a spare starts. In this case, mdadm was considered the
array as clean, while kernel was considering the array as dirty. It is
better for mdadm also to consider the array as dirty in this case.
Signed-off-by: NeilBrown <neilb@suse.de>
Adding a bitmap via ioctl can only add it at a fixed location.
That location is not suitable for 4K-block devices.
So allow setting the bitmap location via sysfs if kernel supports it
and aim to always use 4K alignments.
Signed-off-by: NeilBrown <neilb@suse.de>
When container is assembled while reshape is active on one of its member
whole container can be required to be blocked from monitoring.
For such purpose field recovery blocked is added to mdinfo structure.
When metadata handler finds active reshape in container it should set
recovery_blocked field to disable whole container monitoring during
reshape.
For arrays that doesn't use containers, recovery_blocked field
has the same value as reshape_active field e.g. super0/1.
In fact,recovery is blocked during reshape for such arrays.
For ddf, metadata handler doesn't set reshape_active field,
so recovery_blocked is not set also.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If you create a two drive raid1 array with one device writemostly, then
fail the readwrite drive, when you add a new device, it will get the
writemostly bit copied out of the remaining device's superblock into
it's own. You can then remove the new drive and readd it as readwrite,
which will work for the readd, but it leaves the stale WriteMostly1 bit
in devflags resulting in the device going back to writemostly on the
next assembly.
The fix is to make sure that A) when we readd a device and we might have
filled the st->sb info from a running device instead of the device being
readded, then clear/set the WriteMostly1 bit in the super1 struct in
addition to setting the disk state (ditto for super0, but slightly
different mechanism) and B) when adding a clean device to an array (when
we most certainly did copy the superblock info from an existing device),
then clear any writemostly bits.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some code currently clears 'info' before calling getinfo_super,
some code doesn't.
To be consistent, change it so no caller ever clears 'info',
but ever getinfo_super function must clear it.
Note that ->raid_disk may be meaningful if that 'map' is passed
non-NULL. In that case it is copied out before the structure
is zeroed.
Signed-off-by: NeilBrown <neilb@suse.de>
As homehost defaults to the system name it is not possible to specify
a NULL homehost.
This patch restored this ability with either --homehost="" or
--homehost="<none>".
This allows the creation of v1.x arrays without a "hostname:"
prefix in the name.
Signed-off-by: NeilBrown <neilb@suse.de>