• Fengguang Wu's avatar
    readahead: on-demand readahead logic · 122a21d1
    Fengguang Wu authored
    This is a minimal readahead algorithm that aims to replace the current one.
    It is more flexible and reliable, while maintaining almost the same behavior
    and performance.  Also it is full integrated with adaptive readahead.
    
    It is designed to be called on demand:
    	- on a missing page, to do synchronous readahead
    	- on a lookahead page, to do asynchronous readahead
    
    In this way it eliminated the awkward workarounds for cache hit/miss,
    readahead thrashing, retried read, and unaligned read.  It also adopts the
    data structure introduced by adaptive readahead, parameterizes readahead
    pipelining with `lookahead_index', and reduces the current/ahead windows to
    one single window.
    
    HEURISTICS
    
    The logic deals with four cases:
    
    	- sequential-next
    		found a consistent readahead window, so push it forward
    
    	- random
    		standalone small read, so read as is
    
    	- sequential-first
    		create a new readahead window for a sequential/oversize request
    
    	- lookahead-clueless
    		hit a lookahead page not associated with the readahead window,
    		so create a new readahead window and ramp it up
    
    In each case, three parameters are determined:
    
    	- readahead index: where the next readahead begins
    	- readahead size:  how much to readahead
    	- lookahead size:  when to do the next readahead (for pipelining)
    
    BEHAVIORS
    
    The old behaviors are maximally preserved for trivial sequential/random reads.
    Notable changes are:
    
    	- It no longer imposes strict sequential checks.
    	  It might help some interleaved cases, and clustered random reads.
    	  It does introduce risks of a random lookahead hit triggering an
    	  unexpected readahead. But in general it is more likely to do good
    	  than to do evil.
    
    	- Interleaved reads are supported in a minimal way.
    	  Their chances of being detected and proper handled are still low.
    
    	- Readahead thrashings are better handled.
    	  The current readahead leads to tiny average I/O sizes, because it
    	  never turn back for the thrashed pages.  They have to be fault in
    	  by do_generic_mapping_read() one by one.  Whereas the on-demand
    	  readahead will redo readahead for them.
    
    OVERHEADS
    
    The new code reduced the overheads of
    
    	- excessively calling the readahead routine on small sized reads
    	  (the current readahead code insists on seeing all requests)
    
    	- doing a lot of pointless page-cache lookups for small cached files
    	  (the current readahead only turns itself off after 256 cache hits,
    	  unfortunately most files are < 1MB, so never see that chance)
    
    That accounts for speedup of
    	- 0.3% on 1-page sequential reads on sparse file
    	- 1.2% on 1-page cache hot sequential reads
    	- 3.2% on 256-page cache hot sequential reads
    	- 1.3% on cache hot `tar /lib`
    
    However, it does introduce one extra page-cache lookup per cache miss, which
    impacts random reads slightly. That's 1% overheads for 1-page random reads on
    sparse file.
    
    PERFORMANCE
    
    The basic benchmark setup is
    	- 2.6.20 kernel with on-demand readahead
    	- 1MB max readahead size
    	- 2.9GHz Intel Core 2 CPU
    	- 2GB memory
    	- 160G/8M Hitachi SATA II 7200 RPM disk
    
    The benchmarks show that
    	- it maintains the same performance for trivial sequential/random reads
    	- sysbench/OLTP performance on MySQL gains up to 8%
    	- performance on readahead thrashing gains up to 3 times
    
    iozone throughput (KB/s): roughly the same
    ==========================================
    iozone -c -t1 -s 4096m -r 64k
    
    			       2.6.20          on-demand      gain
    first run
    	  "  Initial write "   61437.27        64521.53      +5.0%
    	  "        Rewrite "   47893.02        48335.20      +0.9%
    	  "           Read "   62111.84        62141.49      +0.0%
    	  "        Re-read "   62242.66        62193.17      -0.1%
    	  "   Reverse Read "   50031.46        49989.79      -0.1%
    	  "    Stride read "    8657.61         8652.81      -0.1%
    	  "    Random read "   13914.28        13898.23      -0.1%
    	  " Mixed workload "   19069.27        19033.32      -0.2%
    	  "   Random write "   14849.80        14104.38      -5.0%
    	  "         Pwrite "   62955.30        65701.57      +4.4%
    	  "          Pread "   62209.99        62256.26      +0.1%
    
    second run
    	  "  Initial write "   60810.31        66258.69      +9.0%
    	  "        Rewrite "   49373.89        57833.66     +17.1%
    	  "           Read "   62059.39        62251.28      +0.3%
    	  "        Re-read "   62264.32        62256.82      -0.0%
    	  "   Reverse Read "   49970.96        50565.72      +1.2%
    	  "    Stride read "    8654.81         8638.45      -0.2%
    	  "    Random read "   13901.44        13949.91      +0.3%
    	  " Mixed workload "   19041.32        19092.04      +0.3%
    	  "   Random write "   14019.99        14161.72      +1.0%
    	  "         Pwrite "   64121.67        68224.17      +6.4%
    	  "          Pread "   62225.08        62274.28      +0.1%
    
    In summary, writes are unstable, reads are pretty close on average:
    
    			  access pattern  2.6.20  on-demand   gain
    				   Read  62085.61  62196.38  +0.2%
    				Re-read  62253.49  62224.99  -0.0%
    			   Reverse Read  50001.21  50277.75  +0.6%
    			    Stride read   8656.21   8645.63  -0.1%
    			    Random read  13907.86  13924.07  +0.1%
    	 		 Mixed workload  19055.29  19062.68  +0.0%
    				  Pread  62217.53  62265.27  +0.1%
    
    aio-stress: roughly the same
    ============================
    aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
    aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
    
    					2.6.20      on-demand  delta
    			sequential	 92.57s      92.54s    -0.0%
    			random		311.87s     312.15s    +0.1%
    
    sysbench fileio: roughly the same
    =================================
    sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
    	 --file-total-size=4G --file-block-size=64K \
    	 --num-threads=001 --max-requests=10000 --max-time=900 run
    
    				threads    2.6.20   on-demand    delta
    		first run
    				      1   59.1974s    59.2262s  +0.0%
    				      2   58.0575s    58.2269s  +0.3%
    				      4   48.0545s    47.1164s  -2.0%
    				      8   41.0684s    41.2229s  +0.4%
    				     16   35.8817s    36.4448s  +1.6%
    				     32   32.6614s    32.8240s  +0.5%
    				     64   23.7601s    24.1481s  +1.6%
    				    128   24.3719s    23.8225s  -2.3%
    				    256   23.2366s    22.0488s  -5.1%
    
    		second run
    				      1   59.6720s    59.5671s  -0.2%
    				      8   41.5158s    41.9541s  +1.1%
    				     64   25.0200s    23.9634s  -4.2%
    				    256   22.5491s    20.9486s  -7.1%
    
    Note that the numbers are not very stable because of the writes.
    The overall performance is close when we sum all seconds up:
    
                    sum all up               495.046s    491.514s   -0.7%
    
    sysbench oltp (trans/sec): up to 8% gain
    ========================================
    sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
    	 --mysql-socket=/var/run/mysqld/mysqld.sock \
    	 --mysql-user=root --mysql-password=readahead \
    	 --num-threads=064 --max-requests=10000 --max-time=900 run
    
    	10000-transactions run
    				threads    2.6.20   on-demand    gain
    				      1     62.81       64.56   +2.8%
    				      2     67.97       70.93   +4.4%
    				      4     81.81       85.87   +5.0%
    				      8     94.60       97.89   +3.5%
    				     16     99.07      104.68   +5.7%
    				     32     95.93      104.28   +8.7%
    				     64     96.48      103.68   +7.5%
    	5000-transactions run
    				      1     48.21       48.65   +0.9%
    				      8     68.60       70.19   +2.3%
    				     64     70.57       74.72   +5.9%
    	2000-transactions run
    				      1     37.57       38.04   +1.3%
    				      2     38.43       38.99   +1.5%
    				      4     45.39       46.45   +2.3%
    				      8     51.64       52.36   +1.4%
    				     16     54.39       55.18   +1.5%
    				     32     52.13       54.49   +4.5%
    				     64     54.13       54.61   +0.9%
    
    That's interesting results. Some investigations show that
    	- MySQL is accessing the db file non-uniformly: some parts are
    	  more hot than others
    	- It is mostly doing 4-page random reads, and sometimes doing two
    	  reads in a row, the latter one triggers a 16-page readahead.
    	- The on-demand readahead leaves many lookahead pages (flagged
    	  PG_readahead) there. Many of them will be hit, and trigger
    	  more readahead pages. Which might save more seeks.
    	- Naturally, the readahead windows tend to lie in hot areas,
    	  and the lookahead pages in hot areas is more likely to be hit.
    	- The more overall read density, the more possible gain.
    
    That also explains the adaptive readahead tricks for clustered random reads.
    
    readahead thrashing: 3 times better
    ===================================
    We boot kernel with "mem=128m single", and start a 100KB/s stream on every
    second, until reaching 200 streams.
    
    			      max throughput     min avg I/O size
    		2.6.20:            5MB/s               16KB
    		on-demand:        15MB/s              140KB
    Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
    Cc: Steven Pratt <slpratt@austin.ibm.com>
    Cc: Ram Pai <linuxram@us.ibm.com>
    Cc: Rusty Russell <rusty@rustcorp.com.au>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    122a21d1
readahead.c 23.4 KB