- 01 Jul, 2009 2 commits
-
-
Andrew Morton authored
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Amerigo Wang <amwang@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: WANG Cong <amwang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Amerigo Wang authored
to do initializations, also fix the potential memory leaks. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 23 Jul, 2009 1 commit
-
-
Lai Jiangshan authored
the end. It has two problems: 1) The recovery of the current tasks's cpus_allowed will fail under some conditions. # grep Cpus_allowed_list /proc/$$/status Cpus_allowed_list: 0-3 # taskset -pc 2 $$ pid 29075's current affinity list: 0-3 pid 29075's new affinity list: 2 # grep Cpus_allowed_list /proc/$$/status Cpus_allowed_list: 2 # echo 0 > /sys/devices/system/cpu/cpu2/online # grep Cpus_allowed_list /proc/$$/status Cpus_allowed_list: 0 Here, the Cpus_allowed_list was originally "2" and has become "0-1,3" after cpu #2 is offlined. This "Cpus_allowed_list: 0" is incorrect. 2) If the current task is a userspace task, the user may change its cpu-affinity during the CPU hot-unplugging. This change can be overwritten when _cpu_down() changes the current task's affinity. Fix all this by not changing the current tasks's affinity. Instead we create a kernel thread to do the work. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 28 Jul, 2009 1 commit
-
-
Oleg Nesterov authored
search_binary_handler() does try_module_get(). In this case set_binfmt()->try_module_get() fails but since none of the callers check the returned error, the task will run with the wrong old ->binfmt. The proper fix should change all ->load_binary() methods, but we can rely on fact that the caller must hold a reference to binfmt->module and use __module_get() which never fails. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 22 Jul, 2009 2 commits
-
-
Neil Horman authored
One of the things that user space processes like to do is look at metadata for a crashing process in their /proc/<pid> directory. this is racy however, since do_coredump in the kernel doesn't wait for the user space process to complete before it reaps the crashing process. This patch corrects that. Allowing the kernel to wait for the user space process to complete before cleaning up the crashing process. This is a bit tricky to do for a few reasons: 1) The user space process isn't our child, so we can't sys_wait4 on it 2) We need to close the pipe before waiting for the user process to complete, since the user process may rely on an EOF condition I've discussed several solutions with Oleg Nesterov off-list about this, and this is the one we've come up with. We add ourselves as a pipe reader (to prevent premature cleanup of the pipe_inode_info), and remove ourselves as a writer (to provide an EOF condition to the writer in user space), then we iterate until the user space process exits (which we detect by pipe->readers == 1, hence the > 1 check in the loop). When we exit the loop, we restore the proper reader/writer values, then we return and let filp_close in do_coredump clean up the pipe data properly. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Andrew Morton authored
#115: FILE: fs/exec.c:1838: + ^I^Iif (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,$ ERROR: code indent should use tabs where possible #120: FILE: fs/exec.c:1842: + ^I^I^Igoto fail_dropcount;$ WARNING: externs should be avoided in .c files #149: FILE: kernel/sysctl.c:80: +extern unsigned int core_pipe_limit; total: 2 errors, 1 warnings, 120 lines checked ./patches/exec-let-do_coredump-limit-the-number-of-concurrent-dumps-to-pipes-v9.patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 1 commit
-
-
Neil Horman authored
Since we can dump cores to pipe, rather than directly to the filesystem, we create a condition in which a user can create a very high load on the system simply by running bad applications. If the pipe reader specified in core_pattern is poorly written, we can have lots of ourstandig resources and processes in the system. This sysctl introduces an ability to limit that resource consumption. core_pipe_limit defines how many in-flight dumps may be run in parallel, dumps beyond this value are skipped and a note is made in the kernel log. A special value of 0 in core_pipe_limit denotes unlimited core dumps may be handled (this is the default value). Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 22 Jul, 2009 2 commits
-
-
Andrew Morton authored
#48: FILE: fs/exec.c:1796: + if (core_limit == 0) { + /* WARNING: line over 80 characters #57: FILE: fs/exec.c:1805: + * but it runs as root, and can do lots of stupid things WARNING: line over 80 characters #58: FILE: fs/exec.c:1806: + * Note that we use task_tgid_vnr here to grab the pid of the WARNING: line over 80 characters #59: FILE: fs/exec.c:1807: + * process group leader. That way we get the right pid if a thread total: 0 errors, 4 warnings, 59 lines checked ./patches/exec-make-do_coredump-more-resilient-to-recursive-crashes-v9.patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Neil Horman authored
Currently we have a mechanism by which we try to compare pathnames of the crashing process to the core_pattern path. This is broken for a dozen reasons, and just doesn't work in any sort of robust way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps set RLIMIT_CORE to zero, we don't write out core files for any process with that particular limit set. It the core_pattern is a pipe, any non-zero limit is translated to RLIM_INFINITY. This allows complete dumps to be captured, but prevents infinite recursion in the event that the core_pattern process itself crashes. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Earl Chew <earl_chew@agilent.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 5 commits
-
-
Roland McGrath authored
implementing user thread tracing and debugging. This fits on top of the tracehook_* layer, so the new code is well-isolated. The new interface is in <linux/utrace.h> and the DocBook utrace book describes it. It allows for multiple separate tracing engines to work in parallel without interfering with each other. Higher-level tracing facilities can be implemented as loadable kernel modules using this layer. The new facility is made optional under CONFIG_UTRACE. When this is not enabled, no new code is added. It can only be enabled on machines that have all the prerequisites and select CONFIG_HAVE_ARCH_TRACEHOOK. In this initial version, utrace and ptrace do not play together at all. If ptrace is attached to a thread, the attach calls in the utrace kernel API return -EBUSY. If utrace is attached to a thread, the PTRACE_ATTACH or PTRACE_TRACEME request will return EBUSY to userland. The old ptrace code is otherwise unchanged and nothing using ptrace should be affected by this patch as long as utrace is not used at the same time. In the future we can clean up the ptrace implementation and rework it to use the utrace API. [oleg@redhat.com: kill exclude_xtrace logic] Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
do_signal_stop() after wakeup. Currently we lack the ability to report this state change. Also fix the comment, it should be placed before schedule(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
->group_stop_count and setting TASK_STOPPED/SIGNAL_STOP_STOPPED. This way the tracing hooks can drop and reacquire the siglock freely and do any blocking hooks without potential SIGCONT races. With this patch TASK_STOPPED/SIGNAL_STOP_STOPPED is set only when we know for sure we are going to schedule() after unlock(siglock). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roland McGrath authored
and changes its argument and return value definition. These clean-ups make it a better fit for what new tracing hooks need to check. Tracing needs the siglock here, held from the time TASK_STOPPED was set, to avoid potential SIGCONT races if it wants to allow any blocking in its tracing hooks. This also folds the finish_stop() function into its caller do_signal_stop(). The function is short, called only once and only unconditionally. It aids readability to fold it in. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Roland McGrath authored
instructions on other machines. It takes two longer x86 instructions just to call it and test its return value, not to mention the function itself. On my random x86_64 config, this saved 70 bytes of text (59 of those being __fatal_signal_pending itself). Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 17 Aug, 2009 1 commit
-
-
Peter Zijlstra authored
multi-threaded application we cannot, like suggested by the manpage, put a TID into the regular fcntl(F_SETOWN) call. It will still be send to the whole process of which that thread is part. Since people do want to properly direct SIGIO we introduce F_SETOWN_EX. The need to direct SIGIO comes from self-monitoring profiling such as with perf-counters. Perf-counters uses SIGIO to notify that new sample data is available. If the signal is delivered to the same task that generated the new sample it can augment that data by inspecting the task's user-space state right after it returns from the kernel. This is esp. convenient for interpreted or virtual machine driven environments. Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex as argument: struct f_owner_ex { int type; pid_t pid; }; Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Tested-by: stephane eranian <eranian@googlemail.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Roland McGrath <roland@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 04 Aug, 2009 2 commits
-
-
Oleg Nesterov authored
sender and uses current_cred(). This is not true in send_sigio_to_task() case. From the security pov the sender is not current, but the task which did fcntl(F_SETOWN), that is why we have sigio_perm() which uses the right creds to check. Fortunately, send_sigio() always sends either SEND_SIG_PRIV or SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But still it would be tidier to avoid this bogus security check and save a couple of cycles. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stephane eranian <eranian@googlemail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
send_sig_info(), do_send_specific() to use this helper. Hopefully it will have more users soon, it allows to specify specific/group behaviour via "bool group" argument. Shaves 80 bytes from .text. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stephane eranian <eranian@googlemail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 2 commits
-
-
Vitaly Mayatskikh authored
NULL, sys_waitid() returns success. When user additionally specifies flag WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user combines WNOWAIT with WCONTINUED, sys_waitid() again returns success. This patch adds check for ->wo_info in wait_noreap_copyout(). User-visible change: starting from this commit, sys_waitid() always checks infop != NULL and does not fail if it is NULL. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Vitaly Mayatskikh authored
NULL the caller should be sys_waitid(), in that case do_wait() fixes up the retval or zeros ->wo_info, depending on retval from underlying function. This is bug: user can pass ->wo_info == NULL and sys_waitid() will return incorrect value. man 2 waitid says: waitid(): returns 0 on success Test-case: int main(void) { if (fork()) assert(waitid(P_ALL, 0, NULL, WEXITED) == 0); return 0; } Result: Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed. Move that code to sys_waitid(). User-visible change: sys_waitid() will return 0 on success, either infop is set or not. Note, there's another bug in wait_noreap_copyout() which affects return value of sys_waitid(). It will be fixed in next patch. Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 13 Jul, 2009 1 commit
-
-
Oleg Nesterov authored
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 1 commit
-
-
Oleg Nesterov authored
Currently we add sub-threads to ->real_parent->children list. This buys nothing but slows down do_wait(). With this patch ->children contains only main threads (group leaders). The only complication is that forget_original_parent() should iterate over sub-threads by hand, and de_thread() needs another list_replace() when it changes ->group_leader. Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD tasks, we can remove this check (and we can unify do_wait_thread() and ptrace_do_wait()). This change can confuse the optimistic search in mm_update_next_owner(), but this is fixable and minor. Perhaps badness() and oom_kill_process() should be updated, but they should be fixed in any case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 02 Sep, 2009 2 commits
-
-
Oleg Nesterov authored
!= PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor out ->pids[type] access, this shrinks .text a bit and simplifies the code. The matches the behaviour of other similar helpers, say get_task_pid(). The caller must ensure that pid_type is valid, not the callee. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
wakeup if the task was detached before __wake_up_parent() and the caller of do_wait() didn't use __WALL. Move ->wo_pid checks from eligible_child() to the new helper, eligible_pid(), and change child_wait_callback() to use it instead of eligible_child(). Note: actually I think it would be better to fix the __WCLONE check in eligible_child(), it doesn't look exactly right. But it is not clear what is the supposed behaviour, and any change is user-visible. Reported-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 13 Jul, 2009 1 commit
-
-
Oleg Nesterov authored
do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or it is ->real_parent and the child is not traced. IOW, caller == p->parent otherwise we should not wake up. Change child_wait_callback() to check this. Ratan reports the workload with CPU load >99% caused by unnecessary wakeups, should be fixed by this patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 1 commit
-
-
Oleg Nesterov authored
which exports __wake_up_parent) Spotted by Valdis.Kletnieks@vt.edu. selinux_bprm_committed_creds() should not play with ->wait_chldexit, now that __wake_up_parent() is exported change the code to use this helper. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Acked-by: James Morris <jmorris@namei.org> Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 13 Jul, 2009 2 commits
-
-
Oleg Nesterov authored
unnecessary wakeups. Every waiting thread in the process wakes up to loop through the children and see that the only ones it cares about are still not ready. Now that we have struct wait_opts we can change do_wait/__wake_up_parent to use filtered wakeups. We can make child_wait_callback() more clever later, right now it only checks eligible_child(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Oleg Nesterov authored
eligible_child() has a single caller, wait_consider_task(). We can move security_task_wait() out from eligible_child(), this allows us to use it for filtered wake_up(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ratan Nalumasu <rnalumasu@gmail.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 1 commit
-
-
Oleg Nesterov authored
Test case: static void *tfunc(void *arg) { int pid = (long)arg; assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0); kill(pid, SIGKILL); sleep(1); return NULL; } int main(void) { pthread_t th; long pid = fork(); if (!pid) pause(); signal(SIGCHLD, SIG_IGN); assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0); int r = waitpid(-1, NULL, __WNOTHREAD); printf("waitpid: %d %m\n", r); return 0; } Before the patch this program hangs, after this patch waitpid() correctly fails with errno == -ECHILD. The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case we should wake up other threads which can sleep in do_wait(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Vitaly Mayatskikh <vmayatsk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 04 Sep, 2009 1 commit
-
-
Daisuke Nishimura authored
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 1 commit
-
-
Daisuke Nishimura authored
be useful for users to show swap usage in memory.stat file, because they don't need calculate memsw.usage - res.usage to know swap usage. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 14 Aug, 2009 1 commit
-
-
Balbir Singh authored
Cc: Prarit Bhargava <prarit@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 11 Aug, 2009 1 commit
-
-
Andrew Morton authored
#50: FILE: mm/memcontrol.c:485: + int val = (charge)? 1 : -1; ^ total: 1 errors, 0 warnings, 171 lines checked ./patches/memcg-improve-resource-counter-scalability.patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Please run checkpatch prior to sending patches Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 14 Aug, 2009 1 commit
-
-
Balbir Singh authored
root cgroup. This is a part of the several patches to reduce mem cgroup overhead. I had posted other approaches earlier (including using percpu counters). Those patches will be a natural addition and will be added iteratively on top of these. The patch stops resource counter accounting for the root cgroup. The data for display is derived from the statisitcs we maintain via mem_cgroup_charge_statistics (which is more scalable). What happens today is that, we do double accounting, once using res_counter_charge() and once using memory_cgroup_charge_statistics(). For the root, since we don't implement limits any more, we don't need to track every charge via res_counter_charge() and check for limit being exceeded and reclaim. The main mem->res usage_in_bytes can be derived by summing the cache and rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need additional data about swapped out memory. This patch adds a MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed recursively when hierarchy is enabled. The tests results I see on a 24 way show that 1. The lock contention disappears from /proc/lock_stats 2. The results of the test are comparable to running with cgroup_disable=memory. Here is a sample of my program runs Without Patch Performance counter stats for '/home/balbir/parallel_pagefault': 7192804.124144 task-clock-msecs # 23.937 CPUs 424691 context-switches # 0.000 M/sec 267 CPU-migrations # 0.000 M/sec 28498113 page-faults # 0.004 M/sec 5826093739340 cycles # 809.989 M/sec 408883496292 instructions # 0.070 IPC 7057079452 cache-references # 0.981 M/sec 3036086243 cache-misses # 0.422 M/sec 300.485365680 seconds time elapsed With cgroup_disable=memory Performance counter stats for '/home/balbir/parallel_pagefault': 7182183.546587 task-clock-msecs # 23.915 CPUs 425458 context-switches # 0.000 M/sec 203 CPU-migrations # 0.000 M/sec 92545093 page-faults # 0.013 M/sec 6034363609986 cycles # 840.185 M/sec 437204346785 instructions # 0.072 IPC 6636073192 cache-references # 0.924 M/sec 2358117732 cache-misses # 0.328 M/sec 300.320905827 seconds time elapsed With this patch applied Performance counter stats for '/home/balbir/parallel_pagefault': 7191619.223977 task-clock-msecs # 23.955 CPUs 422579 context-switches # 0.000 M/sec 88 CPU-migrations # 0.000 M/sec 91946060 page-faults # 0.013 M/sec 5957054385619 cycles # 828.333 M/sec 1058117350365 instructions # 0.178 IPC 9161776218 cache-references # 1.274 M/sec 1920494280 cache-misses # 0.267 M/sec 300.218764862 seconds time elapsed Data from Prarit (kernel compile with make -j64 on a 64 CPU/32G machine) For a single run Without patch real 27m8.988s user 87m24.916s sys 382m6.037s With patch real 4m18.607s user 84m58.943s sys 50m52.682s With config turned off real 4m54.972s user 90m13.456s sys 50m19.711s NOTE: The data looks counterintuitive due to the increased performance with the patch, even over the config being turned off. We probably need more runs, but so far all testing has shown that the patches definitely help. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 09 Sep, 2009 1 commit
-
-
Daisuke Nishimura authored
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 05 Sep, 2009 1 commit
-
-
KAMEZAWA Hiroyuki authored
be visited if reclaimd==0. But css'refcnt handling for next_mz is not enough and it makes css->refcnt leak. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 31 Jul, 2009 1 commit
-
-
Andrew Morton authored
Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 24 Aug, 2009 2 commits
-
-
Balbir Singh authored
Permit reclaim from memory cgroups on contention (via the direct reclaim path). memory cgroup soft limit reclaim finds the group that exceeds its soft limit by the largest number of pages and reclaims pages from it and then reinserts the cgroup into its correct place in the rbtree. Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long loops in case all swap is turned off. The code has been refactored and the loop check (loop < 2) has been enhanced for soft limits. For soft limits, we try to do more targetted reclaim. Instead of bailing out after two loops, the routine now reclaims memory proportional to the size by which the soft limit is exceeded. The proportion has been empirically determined. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
Balbir Singh authored
Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't have to be passed as we make the reclaim routine more flexible Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 04 Aug, 2009 1 commit
-
-
Hugh Dickins authored
Kernel panic - not syncing: No init found; after lots of scheduling while atomics, starting from when async_thread does sd_probe_async. mem_cgroup_soft_limit_check() was doing an unbalanced get_cpu(): don't get_cpu if we won't need it, and put_cpu if we did get_cpu. And fix the silliness of passing it an "over_soft_limit" argument that just tells it to return false when false. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reported-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-
- 21 Jul, 2009 1 commit
-
-
Balbir Singh authored
Introduce an RB-Tree for storing memory cgroups that are over their soft limit. The overall goal is to 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded. We are careful about updates, updates take place only after a particular time interval has passed 2. We remove the node from the RB-Tree when the usage goes below the soft limit The next set of patches will exploit the RB-Tree to get the group that is over its soft limit by the largest amount and reclaim from it, when we face memory contention. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-