1. 13 Nov, 2009 1 commit
    • Karel Zak's avatar
      Currently, kernel uses strictly 512-byte sectors for EFI GPT parsing. · 58213974
      Karel Zak authored
      That's wrong.
      
      UEFI standard (version 2.3, May 2009, 5.3.1 GUID Format overview, page
      95) defines that LBA is always based on the logical block size. It
      means bdev_logical_block_size() (aka BLKSSZGET) for Linux.
      
      This patch removes static sector size from EFI GPT parser.
      
      The problem is reproducible with the latest GNU Parted:
      
       # modprobe scsi_debug dev_size_mb=50 sector_size=4096
      
        # ./parted /dev/sdb print
        Model: Linux scsi_debug (scsi)
        Disk /dev/sdb: 52.4MB
        Sector size (logical/physical): 4096B/4096B
        Partition Table: gpt
      
        Number  Start   End     Size    File system  Name     Flags
         1      24.6kB  3002kB  2978kB               primary
         2      3002kB  6001kB  2998kB               primary
         3      6001kB  9003kB  3002kB               primary
      
        # blockdev --rereadpt /dev/sdb
        # dmesg | tail -1
         sdb: unknown partition table      <---- !!!
      
      with this patch:
      
        # blockdev --rereadpt /dev/sdb
        # dmesg | tail -1
         sdb: sdb1 sdb2 sdb3
      Signed-off-by: default avatarKarel Zak <kzak@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      58213974
  2. 12 Nov, 2009 1 commit
  3. 16 Oct, 2009 1 commit
  4. 09 Oct, 2009 1 commit
  5. 30 Sep, 2009 1 commit
  6. 24 Aug, 2009 1 commit
  7. 11 Nov, 2009 1 commit
  8. 29 Sep, 2009 1 commit
  9. 11 Nov, 2009 7 commits
  10. 13 Oct, 2009 1 commit
  11. 11 Nov, 2009 1 commit
    • Oleg Nesterov's avatar
      Thanks to Roland who pointed out de_thread() issues. · 17ea1f8b
      Oleg Nesterov authored
      Currently we add sub-threads to ->real_parent->children list.  This buys
      nothing but slows down do_wait().
      
      With this patch ->children contains only main threads (group leaders). 
      The only complication is that forget_original_parent() should iterate over
      sub-threads by hand, and de_thread() needs another list_replace() when it
      changes ->group_leader.
      
      Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
      tasks, we can remove this check (and we can unify do_wait_thread() and
      ptrace_do_wait()).
      
      This change can confuse the optimistic search in mm_update_next_owner(),
      but this is fixable and minor.
      
      Perhaps badness() and oom_kill_process() should be updated, but they
      should be fixed in any case.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ratan Nalumasu <rnalumasu@gmail.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17ea1f8b
  12. 30 Oct, 2009 1 commit
  13. 16 Oct, 2009 3 commits
  14. 11 Nov, 2009 1 commit
  15. 12 Nov, 2009 5 commits
    • Oleg Nesterov's avatar
      Suggested by Roland. · 5c97a8fc
      Oleg Nesterov authored
      Unlike powepc, x86 always calls tracehook_report_syscall_exit(step) with
      step = 0, and sends the trap by hand.
      
      This results in unnecessary SIGTRAP when PTRACE_SINGLESTEP follows the
      syscall-exit stop.
      
      Change syscall_trace_leave() to pass the correct "step" argument to
      tracehook and remove the send_sigtrap() logic.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5c97a8fc
    • Oleg Nesterov's avatar
      Suggested by Roland. · 7a5524be
      Oleg Nesterov authored
      Implement user_single_step_siginfo() for x86.  Extract this code from
      send_sigtrap().
      
      Since x86 calls tracehook_report_syscall_exit(step => 0) the new helper is
      not used yet.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7a5524be
    • Oleg Nesterov's avatar
      Suggested by Roland. · d5dda5fa
      Oleg Nesterov authored
      Change tracehook_report_syscall_exit() to look at step flag and send the
      trap signal if needed.
      
      This change affects ia64, microblaze, parisc, powerpc, sh.  They pass
      nonzero "step" argument to tracehook but since it was ignored the tracee
      reports via ptrace_notify(), this is not right and not consistent.
      
      	- PTRACE_SETSIGINFO doesn't work
      
      	- if the tracer resumes the tracee with signr != 0 the new signal
      	  is generated rather than delivering it
      
      	- If PT_TRACESYSGOOD is set the tracee reports the wrong exit_code
      
      I don't have a powerpc machine, but I think this test-case should see the
      difference:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      	#include <stdio.h>
      
      	int main(void)
      	{
      		int pid, status;
      
      		if (!(pid = fork())) {
      			assert(ptrace(PTRACE_TRACEME) == 0);
      			kill(getpid(), SIGSTOP);
      
      			getppid();
      
      			return 0;
      		}
      
      		assert(pid == wait(&status));
      		assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACESYSGOOD) == 0);
      
      		assert(ptrace(PTRACE_SYSCALL, pid, 0,0) == 0);
      		assert(pid == wait(&status));
      
      		assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
      		assert(pid == wait(&status));
      
      		if (status == 0x57F)
      			return 0;
      
      		printf("kernel bug: status=%X shouldn't have 0x80\n", status);
      		return 1;
      	}
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5dda5fa
    • Oleg Nesterov's avatar
      Suggested by Roland. · 666f20b2
      Oleg Nesterov authored
      Implement user_single_step_siginfo() for powerpc.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      666f20b2
    • Oleg Nesterov's avatar
      Suggested by Roland. · 43e0ae02
      Oleg Nesterov authored
      Currently there is no way to synthesize a single-stepping trap in the
      arch-independent manner.  This patch adds the default helper which fills
      siginfo_t, arch/ can can override it.
      
      Architetures which implement user_enable_single_step() should add
      user_single_step_siginfo() also.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      43e0ae02
  16. 11 Nov, 2009 1 commit
    • Oleg Nesterov's avatar
      If the tracee calls fork() after PTRACE_SINGLESTEP, the forked child · 17c18328
      Oleg Nesterov authored
      starts with TIF_SINGLESTEP/X86_EFLAGS_TF bits copied from ptraced parent. 
      This is not right, especially when the new child is not auto-attaced: in
      this case it is killed by SIGTRAP.
      
      Change copy_process() to call user_disable_single_step(). Tested on x86.
      
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <signal.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int pid, status;
      
      		if (!(pid = fork())) {
      			assert(ptrace(PTRACE_TRACEME) == 0);
      			kill(getpid(), SIGSTOP);
      
      			if (!fork()) {
      				/* kernel bug: this child will be killed by SIGTRAP */
      				printf("Hello world\n");
      				return 43;
      			}
      
      			wait(&status);
      			return WEXITSTATUS(status);
      		}
      
      		for (;;) {
      			assert(pid == wait(&status));
      			if (WIFEXITED(status))
      				break;
      			assert(ptrace(PTRACE_SINGLESTEP, pid, 0,0) == 0);
      		}
      
      		assert(WEXITSTATUS(status) == 43);
      		return 0;
      	}
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      17c18328
  17. 30 Oct, 2009 1 commit
  18. 12 Nov, 2009 3 commits
  19. 11 Nov, 2009 1 commit
  20. 16 Oct, 2009 1 commit
  21. 10 Oct, 2009 2 commits
    • Andrew Morton's avatar
      tweak comments · 8dd9b428
      Andrew Morton authored
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8dd9b428
    • KAMEZAWA Hiroyuki's avatar
      This is a patch for coalescing access to res_counter at charging by percpu · a2a2551f
      KAMEZAWA Hiroyuki authored
      caching.  At charge, memcg charges 64pages and remember it in percpu
      cache.  Because it's cache, drain/flush if necessary.
      
      This version uses public percpu area.
       2 benefits for using public percpu area.
       1. Sum of stocked charge in the system is limited to # of cpus
          not to the number of memcg. This shows better synchonization.
       2. drain code for flush/cpuhotplug is very easy (and quick)
      
      The most important point of this patch is that we never touch res_counter
      in fast path. The res_counter is system-wide shared counter which is modified
      very frequently. We shouldn't touch it as far as we can for avoiding
      false sharing.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 cache miss/faults
      
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 cache miss/faults
      
      [ + coalescing uncharge patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 cache miss/faults
      
      [ + coalescing uncharge patch + this patch ]
        34224709  page-faults              #      0.072 M/sec   ( +-   0.173% )
        34.69 cache miss/faults
      
      Changelog (since Oct/2):
        - updated comments
        - replaced get_cpu_var() with __get_cpu_var() if possible.
        - removed mutex for system-wide drain. adds a counter instead of it.
        - removed CONFIG_HOTPLUG_CPU
      
      Changelog (old):
        - rebased onto the latest mmotm
        - moved charge size check before __GFP_WAIT check for avoiding unnecesary
        - added asynchronous flush routine.
        - fixed bugs pointed out by Nishimura-san.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a2a2551f
  22. 11 Nov, 2009 1 commit
    • KAMEZAWA Hiroyuki's avatar
      In massive parallel enviroment, res_counter can be a performance · 92501285
      KAMEZAWA Hiroyuki authored
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      92501285
  23. 10 Nov, 2009 1 commit
  24. 03 Nov, 2009 1 commit
  25. 25 Sep, 2009 1 commit