1. 30 Oct, 2009 2 commits
    • Andrew Morton's avatar
      Cc: Jeff Moyer <jmoyer@redhat.com> · f0a477b3
      Andrew Morton authored
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0a477b3
    • Jeff Moyer's avatar
      Intel reported a performance regression caused by the following commit: · b7f890ec
      Jeff Moyer authored
      commit 848c4dd5
      Author: Zach Brown <zach.brown@oracle.com>
      Date:   Mon Aug 20 17:12:01 2007 -0700
      
          dio: zero struct dio with kzalloc instead of manually
      
          This patch uses kzalloc to zero all of struct dio rather than
          manually trying to track which fields we rely on being zero.  It
          passed aio+dio stress testing and some bug regression testing on
          ext3.
      
          This patch was introduced by Linus in the conversation that lead up
          to Badari's minimal fix to manually zero .map_bh.b_state in commit:
      
            6a648fa7
      
          It makes the code a bit smaller.  Maybe a couple fewer cachelines to
          load, if we're lucky:
      
             text    data     bss     dec     hex filename
          3285925  568506 1304616 5159047  4eb887 vmlinux
          3285797  568506 1304616 5158919  4eb807 vmlinux.patched
      
          I was unable to measure a stable difference in the number of cpu
          cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
          at ~340MB/s.
      
          So the resulting intent of the patch isn't a performance gain but to
          avoid exposing ourselves to the risk of finding another field like
          .map_bh.b_state where we rely on zeroing but don't enforce it in the
          code.
      
      Zach surmised that zeroing out the page array was what caused most of
      the problem, and suggested the approach taken in the attached patch for
      resolving the issue.  Intel re-tested with this patch and saw a 0.6%
      performance gain (the original regression was 0.5%).
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarZach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b7f890ec
  2. 13 Nov, 2009 2 commits
  3. 14 Oct, 2009 1 commit
  4. 11 Nov, 2009 1 commit
  5. 30 Sep, 2009 3 commits
  6. 13 Nov, 2009 2 commits
    • Karel Zak's avatar
      The size of EFI GPT header is not static, but whole sector is · dbd508a2
      Karel Zak authored
      allocated for the header. The HeaderSize field must be greater
      than 92 (= sizeof(struct gpt_header) and must be less than or
      equal to the logical block size.
      
      It means we have to read whole sector with the header, because the
      header crc32 checksum is calculated according to HeaderSize.
      
      For more details see UEFI standard (version 2.3, May 2009):
        - 5.3.1 GUID Format overview, page 93
        - Table 13. GUID Partition Table Header, page 96
      Signed-off-by: default avatarKarel Zak <kzak@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dbd508a2
    • Karel Zak's avatar
      Currently, kernel uses strictly 512-byte sectors for EFI GPT parsing. · 58213974
      Karel Zak authored
      That's wrong.
      
      UEFI standard (version 2.3, May 2009, 5.3.1 GUID Format overview, page
      95) defines that LBA is always based on the logical block size. It
      means bdev_logical_block_size() (aka BLKSSZGET) for Linux.
      
      This patch removes static sector size from EFI GPT parser.
      
      The problem is reproducible with the latest GNU Parted:
      
       # modprobe scsi_debug dev_size_mb=50 sector_size=4096
      
        # ./parted /dev/sdb print
        Model: Linux scsi_debug (scsi)
        Disk /dev/sdb: 52.4MB
        Sector size (logical/physical): 4096B/4096B
        Partition Table: gpt
      
        Number  Start   End     Size    File system  Name     Flags
         1      24.6kB  3002kB  2978kB               primary
         2      3002kB  6001kB  2998kB               primary
         3      6001kB  9003kB  3002kB               primary
      
        # blockdev --rereadpt /dev/sdb
        # dmesg | tail -1
         sdb: unknown partition table      <---- !!!
      
      with this patch:
      
        # blockdev --rereadpt /dev/sdb
        # dmesg | tail -1
         sdb: sdb1 sdb2 sdb3
      Signed-off-by: default avatarKarel Zak <kzak@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58213974
  7. 12 Nov, 2009 1 commit
  8. 16 Oct, 2009 1 commit
  9. 09 Oct, 2009 1 commit
  10. 30 Sep, 2009 1 commit
  11. 24 Aug, 2009 1 commit
  12. 11 Nov, 2009 1 commit
  13. 29 Sep, 2009 1 commit
  14. 11 Nov, 2009 7 commits
  15. 13 Oct, 2009 1 commit
  16. 11 Nov, 2009 1 commit
    • Oleg Nesterov's avatar
      Thanks to Roland who pointed out de_thread() issues. · 17ea1f8b
      Oleg Nesterov authored
      Currently we add sub-threads to ->real_parent->children list.  This buys
      nothing but slows down do_wait().
      
      With this patch ->children contains only main threads (group leaders). 
      The only complication is that forget_original_parent() should iterate over
      sub-threads by hand, and de_thread() needs another list_replace() when it
      changes ->group_leader.
      
      Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
      tasks, we can remove this check (and we can unify do_wait_thread() and
      ptrace_do_wait()).
      
      This change can confuse the optimistic search in mm_update_next_owner(),
      but this is fixable and minor.
      
      Perhaps badness() and oom_kill_process() should be updated, but they
      should be fixed in any case.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17ea1f8b
  17. 30 Oct, 2009 1 commit
  18. 16 Oct, 2009 3 commits
  19. 11 Nov, 2009 1 commit
  20. 12 Nov, 2009 5 commits
    • Oleg Nesterov's avatar
      Suggested by Roland. · 5c97a8fc
      Oleg Nesterov authored
      Unlike powepc, x86 always calls tracehook_report_syscall_exit(step) with
      step = 0, and sends the trap by hand.
      
      This results in unnecessary SIGTRAP when PTRACE_SINGLESTEP follows the
      syscall-exit stop.
      
      Change syscall_trace_leave() to pass the correct "step" argument to
      tracehook and remove the send_sigtrap() logic.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c97a8fc
    • Oleg Nesterov's avatar
      Suggested by Roland. · 7a5524be
      Oleg Nesterov authored
      Implement user_single_step_siginfo() for x86.  Extract this code from
      send_sigtrap().
      
      Since x86 calls tracehook_report_syscall_exit(step => 0) the new helper is
      not used yet.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a5524be
    • Oleg Nesterov's avatar
      Suggested by Roland. · d5dda5fa
      Oleg Nesterov authored
      Change tracehook_report_syscall_exit() to look at step flag and send the
      trap signal if needed.
      
      This change affects ia64, microblaze, parisc, powerpc, sh.  They pass
      nonzero "step" argument to tracehook but since it was ignored the tracee
      reports via ptrace_notify(), this is not right and not consistent.
      
      	- PTRACE_SETSIGINFO doesn't work
      
      	- if the tracer resumes the tracee with signr != 0 the new signal
      	  is generated rather than delivering it
      
      	- If PT_TRACESYSGOOD is set the tracee reports the wrong exit_code
      
      I don't have a powerpc machine, but I think this test-case should see the
      difference:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      	#include <stdio.h>
      
      	int main(void)
      	{
      		int pid, status;
      
      		if (!(pid = fork())) {
      			assert(ptrace(PTRACE_TRACEME) == 0);
      			kill(getpid(), SIGSTOP);
      
      			getppid();
      
      			return 0;
      		}
      
      		assert(pid == wait(&status));
      		assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACESYSGOOD) == 0);
      
      		assert(ptrace(PTRACE_SYSCALL, pid, 0,0) == 0);
      		assert(pid == wait(&status));
      
      		assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
      		assert(pid == wait(&status));
      
      		if (status == 0x57F)
      			return 0;
      
      		printf("kernel bug: status=%X shouldn't have 0x80\n", status);
      		return 1;
      	}
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5dda5fa
    • Oleg Nesterov's avatar
      Suggested by Roland. · 666f20b2
      Oleg Nesterov authored
      Implement user_single_step_siginfo() for powerpc.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      666f20b2
    • Oleg Nesterov's avatar
      Suggested by Roland. · 43e0ae02
      Oleg Nesterov authored
      Currently there is no way to synthesize a single-stepping trap in the
      arch-independent manner.  This patch adds the default helper which fills
      siginfo_t, arch/ can can override it.
      
      Architetures which implement user_enable_single_step() should add
      user_single_step_siginfo() also.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43e0ae02
  21. 11 Nov, 2009 1 commit
    • Oleg Nesterov's avatar
      If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child · 17c18328
      Oleg Nesterov authored
      starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent. 
      This is not right, especially when the new child is not auto-attaced: in
      this case it is killed by SIGTRAP.
      
      Change copy_process() to call user_disable_single_step(). Tested on x86.
      
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <signal.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int pid, status;
      
      		if (!(pid = fork())) {
      			assert(ptrace(PTRACE_TRACEME) == 0);
      			kill(getpid(), SIGSTOP);
      
      			if (!fork()) {
      				/* kernel bug: this child will be killed by SIGTRAP */
      				printf("Hello world\n");
      				return 43;
      			}
      
      			wait(&status);
      			return WEXITSTATUS(status);
      		}
      
      		for (;;) {
      			assert(pid == wait(&status));
      			if (WIFEXITED(status))
      				break;
      			assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
      		}
      
      		assert(WEXITSTATUS(status) == 43);
      		return 0;
      	}
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17c18328
  22. 30 Oct, 2009 1 commit
  23. 12 Nov, 2009 1 commit
    • Daisuke Nishimura's avatar
      memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock · 8c23762d
      Daisuke Nishimura authored
      caused by race between oom and cpuset_attach) instead of cgroup_mutex to
      fix a deadlock problem.  The cgroup_mutex, which was removed by the
      commit, in mem_cgroup_out_of_memory() was originally introduced at commit
      c7ba5c9e (Memory controller: OOM handling).
      
      IIUC, the intention of this cgroup_mutex was to prevent task move during
      select_bad_process() so that situations like below can be avoided.
      
        Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
        1. Process A, which has been in cgroup "baa" and uses large memory, is just
           moved to cgroup "foo". Process A can be the candidates for being killed.
        2. Process B, which has been in cgroup "foo" and uses large memory, is just
           moved from cgroup "foo". Process B can be excluded from the candidates for
           being killed.
      
      But these race window exists anyway even if we hold a lock, because
      __mem_cgroup_try_charge() decides wether it should trigger oom or not
      outside of the lock.  So the original cgroup_mutex in
      mem_cgroup_out_of_memory and thus current memcg_tasklist has no use.  And
      IMHO, those races are not so critical for users.
      
      This patch removes it and make codes simpler.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8c23762d