If all devices have the same event count and the first one is a spare,
then that spare will be the 'most_recent'.
However then other devices will think the 'most_recent' has failed
(for v0.90 metadata) and will be rejected from the array.
So never consider a 'spare' to be 'most recent'.
Reported-by: Andreas Baer <synthetic.gods@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We only need 'hold' if we want to mdstat_wait for a change.
These two callers don't care about a change, so they shouldn't
use the 'hold' flag.
Signed-off-by: NeilBrown <neilb@suse.de>
mdadm will normally not include a device into an array if that device
reports that the "best" device has failed, as this normally implies
some sort of inconsistency.
However when --force is given it means that the given drives really
should be assembled if at all possible so in that case the test should
be avoided.
The particular case where this was a problem was a RAID5 were all
devices had the same event count but three of them reported that the
first two had failed.
As they all had the same event count the first was taken as the 'best'
and that caused the later ones to be excluded. Listing one of the
later ones first allowed the array to be assembled. So in this case
the test clearly just got in the way and did nothing useful.
Reported-by: "Marek Jaros" <mjaros1@nbox.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
If the restarted reshape needs a backup file and we don't have one,
that should be reported before we try to start the array.
Also we shouldn't say the "Cannot grow" but "cannot complete".
Signed-off-by: NeilBrown <neilb@suse.de>
If "container=" is present, then we are going to assemble from the
given container where that container is made of those devices or not.
So in this case the "devices=" is purely documentation and is best
ignored.
As part of this, move the test on the "container=" value when that
start with "/" up before the device is opened. There sooner we test
things, the better.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If the container metadata doesn't know how many device to expect (as
is the case with IMSM), don't fail an --assemble which over-specifies
the number of devices.
Signed-off-by: NeilBrown <neilb@suse.de>
When an active/degraded RAID6 array is force-started we clear the
'active' flag, but it is still possible that some parity is
no in sync. This is because there are two parity block.
It would be nice to be able to tell the kernel "P is OK, Q maybe not".
But that is not possible.
So when we force-assemble such an array, trigger a 'repair' to fix up
any errant Q blocks.
This is not ideal as a restart during the repair will not be continued
after the restart, but it is the best we can do without kernel help.
Signed-off-by: NeilBrown <neilb@suse.de>
Some failure scenarios can leave a spare with a higher event count
than an in-sync device. Assembling an array like this will confuse
the kernel.
So detect spares with event counts higher than the best non-spare
event count and exclude them from the array.
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some people want to create truely enormous arrays.
As we sometimes need to hold one file descriptor for each
device, this can hit the NOFILE limit.
So raise the limit if it ever looks like it might be a problem.
Signed-off-by: NeilBrown <neilb@suse.de>
This allows the smooth conversion of legacy 0.90 arrays
to 1.0 metadata.
Old metadata is likely to remain but will be ignored.
It can be removed with
mdadm --zero-superblock --metadata=0.90 /dev/whatever
Signed-off-by: NeilBrown <neilb@suse.de>
We widely use a "devnum" which is 0 or +ve for md%d devices
and -ve for md_d%d devices.
But I want to be able to use md_%s device names.
So get rid of devnum (a number) and use devnm (a 32char string).
eg.
md0
md_d2
md_home
Signed-off-by: NeilBrown <neilb@suse.de>
When auto-assembling we might find an array which appear in
mdadm.conf.
This can happen if the array (based on UUID) doesn't match what is
in mdadm.conf.
For consistency we should avoid auto-assembling such an array just as
we avoid regular-assembling of the array.
Reported-by: Ross Boylan <ross@biostat.ucsf.edu>
Signed-off-by: NeilBrown <neilb@suse.de>
It isn't enough to simply not assemble arrays found to be called
<ignore>, as the final stage of auto-assemble doesn't check for names
in mdadm.conf.
So add a check to Assemble, similar to the check in Incremental()
Reported-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Recent patch closed 'mdfd' before calling wait_for, which means
it doesn't work.
Put the close back in the right place.
Signed-off-by: NeilBrown <neilb@suse.de>
commit aacb2f816a
Assemble: add support for replacement devices.
broke the restoring of the 'critical section' because it messed up the
list of file descriptors passed to Grow_restart. Put it back the way
it should be.
Signed-off-by: NeilBrown <neilb@suse.de>
It is important to check for compatibility with 'platform' or
Option ROM when creating or changing and array. However there is no
real need when simply assembling the array.
On some systems there are situations where the platform information is
not available. e.g. on some UEFI systems, UEFI is not available
during 'kdump' handling. This makes it impossible to assemble
an IMSM array to receive the dump.
So remove the requirements that the platform be visible to assemble
an IMSM array.
Signed-off-by: NeilBrown <neilb@suse.de>
Using "START_ARRAY" ioctl never really worked reliably,
was removed a decade ago, and just clutters the code.
So remove it.
Signed-off-by: NeilBrown <neilb@suse.de>
Apart from code movement, there is a small functional change here.
If the array is not successfully started, it is stopped.
Previously we would sometimes leave the array in a partially-assembled
but inactive state.
This just causes confusion.
"--incremental" can be used to partially assemble arrays.
Signed-off-by: NeilBrown <neilb@suse.de>
Once we have found the devices we want, we need to load the
metadata from them and store it. This new function extracts that
functionality out of Assemble()
Signed-off-by: NeilBrown <neilb@suse.de>
Assemble() is way too big.
This patch starts cleaning it up by pulling the 'select_devices()'
function. This examines the device to make sure they all belong to
one array, or select those that do (depending on exact use case).
Signed-off-by: NeilBrown <neilb@suse.de>
If --incremental has partly assembled an array and
--assemble is asked to assemble it, the just finds remaining
devices and makes a new array. Not good.
So:
1/ modify locking policy so that assemble can be sure that
no --incremental is running once it locks the map file
2/ Assemble() checks the map file for a duplicate and adds to
that array instead of creating a new one.
Signed-off-by: NeilBrown <neilb@suse.de>
Some arrays (raid10) never need a backup file, so during assembly
we can avoid the whole Grow_continue check in that case.
Achieve this using a flag set by the metadata handler.
Also get "mdadm -I" to fail if a backup process would be
needed. It currently does fail as the kernel rejects things,
but it is nicer to have this explicit.
Signed-off-by: NeilBrown <neilb@suse.de>
malloc should never fail, and if it does it is unlikely
that anything else useful can be done. Best approach is to
abort and let some super-daemon restart.
So define xmalloc, xcalloc, xrealloc, xstrdup which don't
fail but just print a message and exit. Then use those
removing all the tests for failure.
Also replace all "malloc;memset" sequences with 'xcalloc'.
Signed-off-by: NeilBrown <neilb@suse.de>
When we are looking for a candidate disk to bump up the event count,
we consider only disks that have recovery_start==MaxSector.
However, after we find one such disk, we agree to accept more disks
having same event count, regardless of their recovery_start.
Be consistent and don't accept disks with a valid recovery_start at all.
Signed-off-by: NeilBrown <neilb@suse.de>
When arrays using external metadata are assembled, and one of array
in container is under reshape, second array will remain in read only
state (not auto read only). It is caused by array fact that array
is frozen and mdmon doesn't has opportunity to switch array in r/w mode.
Freezing not reshaped array just after it is being assembled allows mdmon
to enable it for writing.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reporting:
mdadm: added /dev/loop1 to /dev/md0 as 1
mdadm: added /dev/loop2 to /dev/md0 as 2
mdadm: added /dev/loop0 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 2 drives (out of 3).
is confusing - why only 2? Code now reports:
mdadm: added /dev/loop1 to /dev/md0 as 1
mdadm: added /dev/loop2 to /dev/md0 as 2 (possibly out of date)
mdadm: added /dev/loop0 to /dev/md0 as 0
mdadm: /dev/md0 has been started with 2 drives (out of 3).
which is somewhat clearer.
Signed-off-by: NeilBrown <neilb@suse.de>
This is a bit of a hack and the code need to be made more
general. But this adds the special case of a RAID0 being
reshaped which looks like a RAID4 but doesn't need as many
devices.
Signed-off-by: NeilBrown <neilb@suse.de>
If we open with O_EXCL before checking that the device is one that
we really want, then that could cause some other process to think
the device is busy when it isn't really.
This particularly affects running "mdadm -A devname" in parallel for
different arrays. One might be looking at a device that it won't
end up using while another trys and fails to look at a device that
it needs.
So delay the O_EXCL until after all identity checks.
Multiple "mdadm -As" will still have races, but that is fundamentally
racy anyway.
Signed-off-by: NeilBrown <neilb@suse.de>
When added disk is disk added by expansion and this is last disk added
to array, assemble_container_content() will not even try to run such array.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If we have to --force assembly during reshape, we need to
check by the 'before' and 'after' cases to make sure there
are enough devices.
Reported-by: Richard Herd <2001oddity@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It can easily be calculated from 'avail' and 'raid_disks', and we
will soon have a case where we don't have it easily available to pass
in.
Signed-off-by: NeilBrown <neilb@suse.de>
This fields doesn't work any more as ->getinfo_super clears the info
structure at an awkward time. So get rid of it and do it differently.
The issue is that the metadata handler cannot tell if the uuid it has
was randomly generated or explicitly requested, except on the first
call.
And we don't want to accept explicit requests for IMSM.
So when it was auto-generated, make it look distinctive by having the
same int copied in all 4 positions. If someone requests a uuid like
that, I guess they get away with it.
Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The problem occurs when array under migration is assembled incrementally.
st->update_tail is not initialized in function
assemble_container_content() and during reshape
the checkpoint information in metadata is not being updated.
The value of st->update_tail is now initialized in function
assemble_container_content() and during reshape the checkpoint
information in metadata is being updated correctly on all disks.
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>