1. 03 Dec, 2009 1 commit
    • Martin K. Petersen's avatar
      block: Allow devices to indicate whether discarded blocks are zeroed · 98262f27
      Martin K. Petersen authored
      The discard ioctl is used by mkfs utilities to clear a block device
      prior to putting metadata down.  However, not all devices return zeroed
      blocks after a discard.  Some drives return stale data, potentially
      containing old superblocks.  It is therefore important to know whether
      discarded blocks are properly zeroed.
      
      Both ATA and SCSI drives have configuration bits that indicate whether
      zeroes are returned after a discard operation.  Implement a block level
      interface that allows this information to be bubbled up the stack and
      queried via a new block device ioctl.
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      98262f27
  2. 30 Nov, 2009 1 commit
  3. 26 Nov, 2009 9 commits
    • Corrado Zoccolo's avatar
      cfq-iosched: fix corner cases in idling logic · 8e550632
      Corrado Zoccolo authored
      Idling logic was disabled in some corner cases, leading to unfair share
       for noidle queues.
       * the idle timer was not armed if there were other requests in the
         driver. unfortunately, those requests could come from other workloads,
         or queues for which we don't enable idling. So we will check only
         pending requests from the active queue
       * rq_noidle check on no-idle queue could disable the end of tree idle if
         the last completed request was rq_noidle. Now, we will disable that
         idle only if all the queues served in the no-idle tree had rq_noidle
         requests.
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      8e550632
    • Corrado Zoccolo's avatar
      cfq-iosched: idling on deep seeky sync queues · 76280aff
      Corrado Zoccolo authored
      Seeky sync queues with large depth can gain unfairly big share of disk
       time, at the expense of other seeky queues. This patch ensures that
       idling will be enabled for queues with I/O depth at least 4, and small
       think time. The decision to enable idling is sticky, until an idle
       window times out without seeing a new request.
      
      The reasoning behind the decision is that, if an application is using
      large I/O depth, it is already optimized to make full utilization of
      the hardware, and therefore we reserve a slice of exclusive use for it.
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      76280aff
    • Corrado Zoccolo's avatar
      cfq-iosched: fix no-idle preemption logic · e4a22919
      Corrado Zoccolo authored
      An incoming no-idle queue should preempt the active no-idle queue
       only if the active queue is idling due to service tree empty.
       Previous code was buggy in two ways:
       * it relied on service_tree field to be set on the active queue, while
         it is not set when the code is idling for a new request
       * it didn't check for the service tree empty condition, so could lead to
         LIFO behaviour if multiple queues with depth > 1 were preempting each
         other on an non-NCQ device.
      Reported-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      e4a22919
    • Corrado Zoccolo's avatar
      cfq-iosched: fix ncq detection code · e459dd08
      Corrado Zoccolo authored
      CFQ's detection of queueing devices initially assumes a queuing device
      and detects if the queue depth reaches a certain threshold.
      However, it will reconsider this choice periodically.
      
      Unfortunately, if device is considered not queuing, CFQ will force a
      unit queue depth for some workloads, thus defeating the detection logic.
      This leads to poor performance on queuing hardware,
      since the idle window remains enabled.
      
      Given this premise, switching to hw_tag = 0 after we have proved at
      least once that the device is NCQ capable is not a good choice.
      
      The new detection code starts in an indeterminate state, in which CFQ behaves
      as if hw_tag = 1, and then, if for a long observation period we never saw
      large depth, we switch to hw_tag = 0, otherwise we stick to hw_tag = 1,
      without reconsidering it again.
      Signed-off-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      e459dd08
    • Jens Axboe's avatar
    • Vivek Goyal's avatar
      Fix regression in direct writes performance due to WRITE_ODIRECT flag removal · d9449ce3
      Vivek Goyal authored
      There seems to be a regression in direct write path due to following
      commit in for-2.6.33 branch of block tree.
      
      commit 1af60fbd
      Author: Jeff Moyer <jmoyer@redhat.com>
      Date:   Fri Oct 2 18:56:53 2009 -0400
      
          block: get rid of the WRITE_ODIRECT flag
      
      Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets
      the NOIDLE flag in bio and hence in request. This tells CFQ to not expect
      more request from the queue and not idle on it (despite the fact that
      queue's think time is less and it is not seeky).
      
      So direct writers lose big time when competing with sequential readers.
      
      Using fio, I have run one direct writer and two sequential readers and
      following are the results with 2.6.32-rc7 kernel and with for-2.6.33
      branch.
      
      Test
      ====
      1 direct writer and 2 sequential reader running simultaneously.
      
      [global]
      directory=/mnt/sdc/fio/
      runtime=10
      
      [seqwrite]
      rw=write
      size=4G
      direct=1
      
      [seqread]
      rw=read
      size=2G
      numjobs=2
      
      2.6.32-rc7
      ==========
      direct writes: aggrb=2,968KB/s
      readers	     : aggrb=101MB/s
      
      for-2.6.33 branch
      =================
      direct write: aggrb=19KB/s
      readers	      aggrb=137MB/s
      
      This patch brings back the WRITE_ODIRECT flag, with the difference that we
      don't set the BIO_RW_UNPLUG flag so that device is not unplugged after
      submission of request and an explicit unplug from submitter is required.
      
      That way we fix the jeff's issue of not enough merging taking place in aio
      path as well as make sure direct writes get their fair share.
      
      After the fix
      =============
      for-2.6.33 + fix
      ----------------
      direct writes: aggrb=2,728KB/s
      reads: aggrb=103MB/s
      
      Thanks
      Vivek
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d9449ce3
    • Corrado Zoccolo's avatar
      cfq-iosched: cleanup unreachable code · c16632ba
      Corrado Zoccolo authored
      cfq_should_idle returns false for no-idle queues that are not the last,
      so the control flow will never reach the removed code in a state that
      satisfies the if condition.
      The unreachable code was added to emulate previous cfq behaviour for
      non-NCQ rotational devices. My tests show that even without it, the
      performances and fairness are comparable with previous cfq, thanks to
      the fact that all seeky queues are grouped together, and that we idle at
      the end of the tree.
      Signed-off-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      c16632ba
    • Ilya Loginov's avatar
      block: add helpers to run flush_dcache_page() against a bio and a request's pages · 2d4dc890
      Ilya Loginov authored
      Mtdblock driver doesn't call flush_dcache_page for pages in request.  So,
      this causes problems on architectures where the icache doesn't fill from
      the dcache or with dcache aliases.  The patch fixes this.
      
      The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
      pointless empty cache-thrashing loops on architectures for which
      flush_dcache_page() is a no-op.  Every architecture was provided with this
      flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
      equal 1 or do nothing otherwise.
      
      See "fix mtd_blkdevs problem with caches on some architectures" discussion
      on LKML for more information.
      Signed-off-by: default avatarIlya Loginov <isloginov@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Peter Horton <phorton@bitbox.co.uk>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      2d4dc890
    • Gui Jianfeng's avatar
      cfq: Make use of service count to estimate the rb_key offset · 3586e917
      Gui Jianfeng authored
      For the moment, different workload cfq queues are put into different
      service trees. But CFQ still uses "busy_queues" to estimate rb_key
      offset when inserting a cfq queue into a service tree. I think this
      isn't appropriate, and it should make use of service tree count to do
      this estimation. This patch is for for-2.6.33 branch.
      Signed-off-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      3586e917
  4. 25 Nov, 2009 1 commit
  5. 24 Nov, 2009 4 commits
  6. 23 Nov, 2009 4 commits
    • Alex Chiang's avatar
      cciss: change Cmd_sg_list.sg_chain_dma type to dma_addr_t · 32a87c01
      Alex Chiang authored
      A recent commit broke the ia64 build:
      
      	Author: Don Brace <brace@beardog.cce.hp.com>
      	Date:   Thu Nov 12 12:50:01 2009 -0600
      
      	cciss: Add enhanced scatter-gather support.
      
      because of this hunk:
      
      	--- a/drivers/block/cciss.h
      	+++ b/drivers/block/cciss.h
      	+struct Cmd_sg_list {
      	+       SGDescriptor_struct     *sgchain;
      	+       dma64_addr_t            sg_chain_dma;
      	+       int                     chain_block_size;
      	+};
      
      The issue is that dma64_addr_t isn't #define'd on ia64.
      
      The way that we're using Cmd_sg_list.sg_chain_dma is to hold an
      address returned from pci_map_single().
      
      	+               temp64.val = pci_map_single(h->pdev,
      	+                                 h->cmd_sg_list[c->cmdindex]->sgchain,
      	+                                 len, dir);
      	+
      	+               h->cmd_sg_list[c->cmdindex]->sg_chain_dma = temp64.val;
      
      pci_map_single() returns a dma_addr_t too.
      
      This code will still work even on a 32-bit x86 build, where
      dma_addr_t is defined to be a u32 because it will simply be
      promoted to the __u64 that temp64.val is defined as.
      
      Thus, declaring Cmd_sg_list.sg_chain_dma as dma_addr_t is safe.
      
      Cc: Don Brace <brace@beardog.cce.hp.com>
      Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
      Signed-off-by: default avatarAlex Chiang <achiang@hp.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      32a87c01
    • Stephen M. Cameron's avatar
      cciss: fix scatter gather cleanup problems · d61c4269
      Stephen M. Cameron authored
      On driver unload, only free up the extra scatter gather data if they were
      allocated in the first place (the controller supports it) and don't forget
      to free up the sg_cmd_list array of pointers.
      Signed-off-by: default avatarDon Brace <brace@beardog.cce.hp.com>
      Signed-off-by: default avatarStephen M. Cameron <scameron@beardog.cce.hp.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      d61c4269
    • Karel Zak's avatar
      partitions: read whole sector with EFI GPT header · 87038c2d
      Karel Zak authored
      The size of EFI GPT header is not static, but whole sector is
      allocated for the header. The HeaderSize field must be greater
      than 92 (= sizeof(struct gpt_header) and must be less than or
      equal to the logical block size.
      
      It means we have to read whole sector with the header, because the
      header crc32 checksum is calculated according to HeaderSize.
      
      For more details see UEFI standard (version 2.3, May 2009):
        - 5.3.1 GUID Format overview, page 93
        - Table 13. GUID Partition Table Header, page 96
      Signed-off-by: default avatarKarel Zak <kzak@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      87038c2d
    • Karel Zak's avatar
      partitions: use sector size for EFI GPT · 7d13af32
      Karel Zak authored
      Currently, kernel uses strictly 512-byte sectors for EFI GPT parsing.
      That's wrong.
      
      UEFI standard (version 2.3, May 2009, 5.3.1 GUID Format overview, page
      95) defines that LBA is always based on the logical block size. It
      means bdev_logical_block_size() (aka BLKSSZGET) for Linux.
      
      This patch removes static sector size from EFI GPT parser.
      
      The problem is reproducible with the latest GNU Parted:
      
       # modprobe scsi_debug dev_size_mb=50 sector_size=4096
      
        # ./parted /dev/sdb print
        Model: Linux scsi_debug (scsi)
        Disk /dev/sdb: 52.4MB
        Sector size (logical/physical): 4096B/4096B
        Partition Table: gpt
      
        Number  Start   End     Size    File system  Name     Flags
         1      24.6kB  3002kB  2978kB               primary
         2      3002kB  6001kB  2998kB               primary
         3      6001kB  9003kB  3002kB               primary
      
        # blockdev --rereadpt /dev/sdb
        # dmesg | tail -1
         sdb: unknown partition table      <---- !!!
      
      with this patch:
      
        # blockdev --rereadpt /dev/sdb
        # dmesg | tail -1
         sdb: sdb1 sdb2 sdb3
      Signed-off-by: default avatarKarel Zak <kzak@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      7d13af32
  7. 13 Nov, 2009 11 commits
  8. 11 Nov, 2009 1 commit
  9. 10 Nov, 2009 1 commit
  10. 08 Nov, 2009 1 commit
    • Corrado Zoccolo's avatar
      cfq-iosched: fix next_rq computation · cf7c25cf
      Corrado Zoccolo authored
      Cfq has a bug in computation of next_rq, that affects transition
      between multiple sequential request streams in a single queue
      (e.g.: two sequential buffered writers of the same priority),
      causing the alternation between the two streams for a transient period.
      
        8,0    1    18737     0.260400660  5312  D   W 141653311 + 256
        8,0    1    20839     0.273239461  5400  D   W 141653567 + 256
        8,0    1    20841     0.276343885  5394  D   W 142803919 + 256
        8,0    1    20843     0.279490878  5394  D   W 141668927 + 256
        8,0    1    20845     0.292459993  5400  D   W 142804175 + 256
        8,0    1    20847     0.295537247  5400  D   W 141668671 + 256
        8,0    1    20849     0.298656337  5400  D   W 142804431 + 256
        8,0    1    20851     0.311481148  5394  D   W 141668415 + 256
        8,0    1    20853     0.314421305  5394  D   W 142804687 + 256
        8,0    1    20855     0.318960112  5400  D   W 142804943 + 256
      
      The fix makes sure that the next_rq is computed from the last
      dispatched request, and not affected by merging.
      
        8,0    1    37776     4.305161306     0  D   W 141738087 + 256
        8,0    1    37778     4.308298091     0  D   W 141738343 + 256
        8,0    1    37780     4.312885190     0  D   W 141738599 + 256
        8,0    1    37782     4.315933291     0  D   W 141738855 + 256
        8,0    1    37784     4.319064459     0  D   W 141739111 + 256
        8,0    1    37786     4.331918431  5672  D   W 142803007 + 256
        8,0    1    37788     4.334930332  5672  D   W 142803263 + 256
        8,0    1    37790     4.337902723  5672  D   W 142803519 + 256
        8,0    1    37792     4.342359774  5672  D   W 142803775 + 256
        8,0    1    37794     4.345318286     0  D   W 142804031 + 256
      Signed-off-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      cf7c25cf
  11. 04 Nov, 2009 6 commits