1. 31 Mar, 2009 40 commits
    • NeilBrown's avatar
      md: allow number of drives in raid5 to be reduced · ec32a2bd
      NeilBrown authored
      When reshaping a raid5 to have fewer devices, we work from the end of
      the array to the beginning.
      md_do_sync gives addresses to sync_request that go from the beginning
      to the end.  So largely ignore them use the internal state variable
      "reshape_progress" to keep track of what to do next.
      
      Never allow the size to be reduced below the minimum (4 for raid6,
      3 otherwise).
      
      We require that the size of the array has already been reduced before
      the array is reshaped to a smaller size.  This is because simply
      reducing the size is an easily reversible operation, while the reshape
      is immediately destructive and so is not reversible for the blocks at
      the ends of the devices.
      Thus to reshape an array to have fewer devices, you must first write
      an appropriately small size to md/array_size.
      
      When reshape finished, we remove any drives that are no longer
      needed and fix up ->degraded.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ec32a2bd
    • NeilBrown's avatar
      md/raid5: change reshape-progress measurement to cope with reshaping backwards. · fef9c61f
      NeilBrown authored
      When reducing the number of devices in a raid4/5/6, the reshape
      process has to start at the end of the array and work down to the
      beginning.  So we need to handle expand_progress and expand_lo
      differently.
      
      This patch renames "expand_progress" and "expand_lo" to avoid the
      implication that anything is getting bigger (expand->reshape) and
      every place they are used, we make sure that they are used the right
      way depending on whether delta_disks is positive or negative.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      fef9c61f
    • NeilBrown's avatar
      md: add explicit method to signal the end of a reshape. · cea9c228
      NeilBrown authored
      Currently raid5 (the only module that supports restriping)
      notices that the reshape has finished be sync_request being
      given a large value, and handles any cleanup them.
      
      This patch changes it so md_check_recovery calls into an
      explicit finish_reshape method as well.
      
      The clean-up from sync_request can do things that need to be
      done promptly, typically things local to the raid5_conf_t
      structure.
      
      The "finish_reshape" method is called under the mddev_lock
      so it can do things involving reconfiguring the device.
      
      This allows us to get rid of md_set_array_sectors_locked, which
      would have caused a deadlock if you tried to stop and array
      while a reshape was happening.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      cea9c228
    • NeilBrown's avatar
      md/raid5: enhance raid5_size to work correctly with negative delta_disks · 7ec05478
      NeilBrown authored
      This is the first of four patches which combine to allow md/raid5 to
      reduce the number of devices in the array by restriping the data over
      a subset of the devices.
      
      If the number of disks in a raid4/5/6 is being reduced, then the
      default size must be based on the new number, not the old number
      of devices.
      In general, it should be based on the smaller of new and old.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      7ec05478
    • NeilBrown's avatar
      md/raid5: drop qd_idx from r6_state · 34e04e87
      NeilBrown authored
      We now have this value in stripe_head so we don't need to duplicate
      it.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      34e04e87
    • Dan Williams's avatar
      md/raid6: move raid6 data processing to raid6_pq.ko · f701d589
      Dan Williams authored
      Move the raid6 data processing routines into a standalone module
      (raid6_pq) to prepare them to be called from async_tx wrappers and other
      non-md drivers/modules.  This precludes a circular dependency of raid456
      needing the async modules for data processing while those modules in
      turn depend on raid456 for the base level synchronous raid6 routines.
      
      To support this move:
      1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
      2/ The raid6_call, recovery calls, and table symbols are exported
      3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
         compile
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      f701d589
    • Andre Noll's avatar
      md: raid5 run(): Fix max_degraded for raid level 4. · 18b00334
      Andre Noll authored
      raid4 allows only one failed disk.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      18b00334
    • Dan Williams's avatar
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams authored
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      b522adcd
    • Dan Williams's avatar
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams authored
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      1f403624
    • Dan Williams's avatar
      md: add 'size' as a personality method · 80c3a6ce
      Dan Williams authored
      In preparation for giving userspace control over ->array_sectors we need
      to be able to retrieve the 'default' size, and the 'anticipated' size
      when a reshape is requested.  For personalities that do not reshape emit
      a warning if anything but the default size is requested.
      
      In the raid5 case we need to update ->previous_raid_disks to make the
      new 'default' size available.
      Reviewed-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      80c3a6ce
    • Atsushi SAKAI's avatar
      md: fix typo in FSF address · 93ed05e2
      Atsushi SAKAI authored
      Hello,
      
       I found a typo Bosto"m" in FSF address.
      And I am checking around linux source code.
      Here is the only place which uses Bosto"m" (not Boston).
      Signed-off-by: default avatarAtsushi SAKAI <sakaia@jp.fujitsu.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      93ed05e2
    • NeilBrown's avatar
      md: add takeover support for converting raid6 back into raid5 · fc9739c6
      NeilBrown authored
      If a raid6 is still in the layout that comes from converting raid5
      into a raid6. this will allow us to convert it back again.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      fc9739c6
    • NeilBrown's avatar
      e9d4758f
    • NeilBrown's avatar
      md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. · b3546035
      NeilBrown authored
      2-drive raid5's aren't very interesting.  But if you are converting
      a raid1 into a raid5, you will at least temporarily have one.  And
      that it a good time to set the layout/chunksize for the new RAID5
      if you aren't happy with the defaults.
      
      layout and chunksize don't actually affect the placement of data
      on a 2-drive raid5, so we just do some internal book-keeping.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b3546035
    • NeilBrown's avatar
      md: add ->takeover method for raid5 to be able to take over raid1 · d562b0c4
      NeilBrown authored
      The RAID1 must have two drives and be a suitable size to
      be a multiple of a chunksize that isn't too small.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d562b0c4
    • NeilBrown's avatar
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown authored
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      245f46c2
    • NeilBrown's avatar
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown authored
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      409c57f3
    • NeilBrown's avatar
      md: md_unregister_thread should cope with being passed NULL · e0cf8f04
      NeilBrown authored
      Mostly md_unregister_thread is only called when we know that the
      thread is NULL, but sometimes we need to check first.  It is safer
      to put the check inside md_unregister_thread itself.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e0cf8f04
    • NeilBrown's avatar
      md/raid5: refactor raid5 "run" · 91adb564
      NeilBrown authored
      .. so that the code to create the private data structures is separate.
      This will help with future code to change the level of an active
      array.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      91adb564
    • NeilBrown's avatar
      md: make sure new_level, new_chunksize, new_layout always have sensible values. · 34817e8c
      NeilBrown authored
      When an md array is undergoing a change, we have new_* fields that
      show the new values.
      When no change is happening, it is least confusing if these have
      the same value as the normal fields.
      This is true in most cases, but not when the values are set via sysfs.
      
      So fix this up.
      
      A subsequent patch will BUG_ON if these things aren't consistent.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      34817e8c
    • NeilBrown's avatar
      md/raid5: finish support for DDF/raid6 · 67cc2b81
      NeilBrown authored
      DDF requires RAID6 calculations over different devices in a different
      order.
      For md/raid6, we calculate over just the data devices, starting
      immediately after the 'Q' block.
      For ddf/raid6 we calculate over all devices, using zeros in place of
      the P and Q blocks.
      
      This requires unfortunately complex loops...
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      67cc2b81
    • NeilBrown's avatar
      md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f
      NeilBrown authored
      DDF uses different layouts for P and Q blocks than current md/raid6
      so add those that are missing.
      Also add support for RAID6 layouts that are identical to various
      raid5 layouts with the simple addition of one device to hold all of
      the 'Q' blocks.
      Finally add 'raid5' layouts to match raid4.
      These last to will allow online level conversion.
      
      Note that this does not provide correct support for DDF/raid6 yet
      as the order in which data blocks are summed to produce the Q block
      is significant and different between current md code and DDF
      requirements.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      99c0fb5f
    • NeilBrown's avatar
      md/raid5: simplify raid5_compute_sector interface · 911d4ee8
      NeilBrown authored
      Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
      a 'struct stripe_head *' and fill in the relevant fields.  This is
      more extensible.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      911d4ee8
    • NeilBrown's avatar
      md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e
      NeilBrown authored
      
      Code currently assumes that the devices in a raid6 stripe are
        0 1 ... N-1 P Q
      in some rotated order.  We will shortly add new layouts in which
      this strict pattern is broken.
      So remove this expectation.  We still assume that the data disks
      are roughly in-order.  However P and Q can be inserted anywhere within
      that order.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d0dabf7e
    • NeilBrown's avatar
      md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument · 112bf897
      NeilBrown authored
      This similar to the recent change to get_active_stripe.
      There is no functional change, just come rearrangement to make
      future patches cleaner.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      112bf897
    • NeilBrown's avatar
      md/raid5: simplify interface for init_stripe and get_active_stripe · b5663ba4
      NeilBrown authored
      Rather than passing 'pd_idx' and 'disks' to these functions, just pass
      'previous' which tells whether to use the 'previous' or 'current'
      geometry during a reshape, and let init_stripe calculate
      disks and pd_idx and anything else it might need.
      
      This is not a substantial simplification and even adds a division.
      However we will shortly be adding more complexity to init_stripe
      to handle more interesting 'reshape' activities, and without this
      change, the interface to these functions would get very complex.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b5663ba4
    • Andre Noll's avatar
      md: Represent raid device size in sectors. · dd8ac336
      Andre Noll authored
      This patch renames the "size" field of struct mdk_rdev_s to
      "sectors" and changes this field to store sectors instead of
      blocks.
      
      All users of this field, linear.c, raid0.c and md.c, are fixed up
      accordingly which gets rid of many multiplications and divisions.
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      dd8ac336
    • Andre Noll's avatar
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll authored
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: default avatarAndre Noll <maan@systemlinux.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      58c0fed4
    • NeilBrown's avatar
      md: be more consistent about setting WriteMostly flag when adding a drive to an array · 575a80fa
      NeilBrown authored
      When a drive is added to an array using ADD_NEW_DISK, there are two
      places we can get certain flags from:  the metadata on the disk or the
      flags passed through the IOCTL.
      
      For the WriteMostly flag (aka MD_DISK_WRITEMOSTLY) we take the value
      from either of those sources depending on if it is set (i.e. we
      effectively 'or' the two sources together).
      
      This makes it awkward to clear, and is at best inconsistent.
      
      As documented code (in mdadm) requires that setting
      MD_DISK_WRITEMOSTLY in the ioctl will be effective, we resolve the
      inconsistency by always using the value for this flag from the ioctl,
      and ignoring the value on disk.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      575a80fa
    • NeilBrown's avatar
      md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d
      NeilBrown authored
      Version 1.x metadata has the ability to record the status of a
      partially completed drive recovery.
      However we only update that record on a clean shutdown.
      It would be nice to update it on unclean shutdowns too, particularly
      when using a bitmap that removes much to the 'sync' effort after an
      unclean shutdown.
      
      One complication with checkpointing recovery is that we only know
      where we are up to in terms of IO requests started, not which ones
      have completed.  And we need to know what has completed to record
      how much is recovered.  So occasionally pause the recovery until all
      submitted requests are completed, then update the record of where
      we are up to.
      
      When we have a bitmap, we already do that pause occasionally to keep
      the bitmap up-to-date.  So enhance that code to record the recovery
      offset and schedule a superblock update.
      And when there is no bitmap, just pause 16 times during the resync to
      do a checkpoint.
      '16' is a fairly arbitrary number.  But we don't really have any good
      way to judge how often is acceptable, and it seems like a reasonable
      number for now.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      97e4f42d
    • NeilBrown's avatar
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown authored
      It really is nicer to keep related code together..
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      43b2e5d8
    • NeilBrown's avatar
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown authored
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      bff61975
    • NeilBrown's avatar
      md: move most content from md.h to md_k.h · 92022950
      NeilBrown authored
      The extern function definitions are kernel-internal definitions, so
      they belong in md_k.h
      
      The MD_*_VERSION values could reasonably go in a number of places,
      but md_u.h seems most reasonable.
      
      This leaves almost nothing in md.h.  It will go soon.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      92022950
    • NeilBrown's avatar
      md: move LEVEL_* definition from md_k.h to md_u.h · 8b2b5c21
      NeilBrown authored
      .. as they are part of the user-space interface.
      Also move MdpMinorShift into there so we can remove duplication.
      
      Lastly move mdp_major in.  It is less obviously part of the user-space
      interface, but do_mounts_md.c uses it, and it is acting a bit like
      user-space.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8b2b5c21
    • Christoph Hellwig's avatar
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig authored
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      ef740c37
    • Christoph Hellwig's avatar
      cleanup drivers/md/Makefile · 2a40a8ae
      Christoph Hellwig authored
      Use the -y variables instead of the old -objs so we can easily add
      conditional objects to the modules.  Also always use += to add
      subobjects to avoid problems when placing additional objects in
      some place in the file.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      2a40a8ae
    • Christoph Hellwig's avatar
      md: stop defining MAJOR_NR · 3dbd8c2e
      Christoph Hellwig authored
      MAJOR_NR was only required for magic in linux/blk.h in 2.4 or earlier
      kernels, so no need to keep it around.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3dbd8c2e
    • Martin K. Petersen's avatar
      MD data integrity support · 3f9d99c1
      Martin K. Petersen authored
      md: Add support for data integrity to MD
      
      If all subdevices support the same protection format the MD device is
      flagged as integrity capable.
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      3f9d99c1
    • NeilBrown's avatar
      md: write bitmap information to devices that are undergoing recovery. · 355a43e6
      NeilBrown authored
      When we add some spares to an array and start recovery, and we have
      a bitmap which is stored 'internally' on all devices, we call
      bitmap_write_all to make sure the bitmap is correct on the new
      device(s).
      However that doesn't work as write_sb_page only writes to
      'In_sync' devices, and devices undergoing recovery are not
      'In_sync' until recovery finishes.
      
      So extend write_sb_page (actually next_active_rdev) to include devices
      that are under recovery.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      355a43e6
    • NeilBrown's avatar
      md: never clear bit from the write-intent bitmap when the array is degraded. · d0a4bb49
      NeilBrown authored
      
      It is safe to clear a bit from the write-intent bitmap for a raid1
      if we know the data has been written to all devices, which is
      what the current test does.
      
      But it is not always safe to update the 'events_cleared' counter in
      that case.  This is because one request could complete successfully
      after some other request has partially failed.
      
      So simply disable the clearing and updating of events_cleared whenever
      the array is degraded.  This might end up not clearing some bits that
      could safely be cleared, but it is safest approach.
      
      Note that the bug fixed here did not risk corrupting data by letting
      the array get out-of-sync.  Rather it meant that when a device is
      removed and re-added to the array, it might incorrectly require a full
      recovery rather than just recovering based on the bitmap.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      d0a4bb49