Commits · a3aba43f43601cdb9e2dade7989a11a9bf121817 · linux / linux-davinci

30 Jul, 2009 1 commit
- Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> · a3aba43f
  Bartlomiej Zolnierkiewicz authored Jul 30, 2009
```
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
```
  a3aba43f
10 Aug, 2009 1 commit

Fix build warning about the unused variable in char_dev.c:267 · ca905a74

Dave Young authored Aug 10, 2009

Remove the unused char *s.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Renzo Davoli <renzo@cs.unibo.it>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ca905a74

18 Aug, 2009 1 commit

There are two useless lines in fs/char_dev.c. · 0061edb0

Renzo Davoli authored Aug 18, 2009

In register_chrdev there is a loop to change all '/' into '!' in the
kernel object name.
This code is useless as the same substitution is in kobject_set_name_vargs in
lib/kobject.c:
228         /* ewww... some of these buggers have '/' in the name ... */
229         while ((s = strchr(kobj->name, '/')))
230                 s[0] = '!';

kobject_set_name_vargs is called by kobject_set_name.
kobject_set_name is called just above the useless loop.
Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0061edb0

21 Jul, 2009 1 commit

In read_zero, we check for access_ok() once for the count bytes. It is · 927c3855

Nikanth Karthikesan authored Jul 22, 2009

unnecessarily checked again in clear_user.  Use __clear_user, which does
not check for access_ok().
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

927c3855

02 Jul, 2009 1 commit

There is a common macro now for testing mixed pointer/errno values, so use · 767327e4

Mike Frysinger authored Jul 03, 2009

that rather than handling the casts ourself.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David McCullough <david_mccullough@securecomputing.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

767327e4

20 Jul, 2009 1 commit

Ignore the loader's PT_GNU_STACK when calculating the stack size, and only · c335598a

David Howells authored Jul 21, 2009

consider the executable's PT_GNU_STACK, assuming the executable has one.

Currently the behaviour is to take the largest stack size and use that,
but that means you can't reduce the stack size in the executable.  The
loader's stack size should probably only be used when executing the loader
directly.

WARNING: This patch is slightly dangerous - it may render a system
inoperable if the loader's stack size is larger than that of important
executables, and the system relies unknowingly on this increasing the size
of the stack.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c335598a

01 Jul, 2009 2 commits

Remove NUM_NOTES · a9d88508

Andrew Morton authored Jul 02, 2009

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

a9d88508

Introduce a helper function elf_note_info_init() to help fill_note_info() · 80b2bed6

Amerigo Wang authored Jul 02, 2009

to do initializations, also fix the potential memory leaks.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

80b2bed6

23 Jul, 2009 1 commit

_cpu_down() changes the current task's affinity and then recovers it at · ce8acbd2

Lai Jiangshan authored Jul 23, 2009

the end.

It has two problems:

1) The recovery of the current tasks's cpus_allowed will fail under
   some conditions.

   # grep Cpus_allowed_list /proc/$$/status
   Cpus_allowed_list:      0-3

   # taskset -pc 2 $$
   pid 29075's current affinity list: 0-3
   pid 29075's new affinity list: 2

   # grep Cpus_allowed_list /proc/$$/status
   Cpus_allowed_list:      2

   # echo 0 > /sys/devices/system/cpu/cpu2/online

   # grep Cpus_allowed_list /proc/$$/status
   Cpus_allowed_list:      0

   Here, the Cpus_allowed_list was originally "2" and has become
   "0-1,3" after cpu #2 is offlined.

   This "Cpus_allowed_list:      0" is incorrect.

2) If the current task is a userspace task, the user may change its
   cpu-affinity during the CPU hot-unplugging.  This change can be
   overwritten when _cpu_down() changes the current task's affinity.


Fix all this by not changing the current tasks's affinity.  Instead we
create a kernel thread to do the work.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ce8acbd2

28 Jul, 2009 1 commit

sys_delete_module() can set MODULE_STATE_GOING after · 1504e6b5

Oleg Nesterov authored Jul 28, 2009

search_binary_handler() does try_module_get().  In this case
set_binfmt()->try_module_get() fails but since none of the callers
check the returned error, the task will run with the wrong old
->binfmt.

The proper fix should change all ->load_binary() methods, but we can
rely on fact that the caller must hold a reference to binfmt->module
and use __module_get() which never fails.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1504e6b5

22 Jul, 2009 5 commits

Allow core_pattern pipes to wait for user space to complete · aa200e2c

Neil Horman authored Jul 22, 2009

One of the things that user space processes like to do is look at metadata
for a crashing process in their /proc/<pid> directory. this is racy
however, since do_coredump in the kernel doesn't wait for the user space
process to complete before it reaps the crashing process. This patch
corrects that. Allowing the kernel to wait for the user space process to
complete before cleaning up the crashing process. This is a bit tricky to
do for a few reasons:

1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition

I've discussed several solutions with Oleg Nesterov off-list about this,
and this is the one we've come up with. We add ourselves as a pipe reader
(to prevent premature cleanup of the pipe_inode_info), and remove
ourselves as a writer (to provide an EOF condition to the writer in user
space), then we iterate until the user space process exits (which we
detect by pipe->readers == 1, hence the > 1 check in the loop). When we
exit the loop, we restore the proper reader/writer values, then we return
and let filp_close in do_coredump clean up the pipe data properly.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

aa200e2c

ERROR: code indent should use tabs where possible · 6c98b6d9

Andrew Morton authored Jul 22, 2009

#115: FILE: fs/exec.c:1838:
+ ^I^Iif (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,$

ERROR: code indent should use tabs where possible
#120: FILE: fs/exec.c:1842:
+ ^I^I^Igoto fail_dropcount;$

WARNING: externs should be avoided in .c files
#149: FILE: kernel/sysctl.c:80:
+extern unsigned int core_pipe_limit;

total: 2 errors, 1 warnings, 120 lines checked

./patches/exec-let-do_coredump-limit-the-number-of-concurrent-dumps-to-pipes-v9.patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6c98b6d9

Introduce core pipe limiting sysctl. · 21340ee9

Neil Horman authored Jul 22, 2009

Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.

If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.

This sysctl introduces an ability to limit that resource consumption. 
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log. 
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

21340ee9

WARNING: suspect code indent for conditional statements (16, 25) · 04de93c2

Andrew Morton authored Jul 22, 2009

#48: FILE: fs/exec.c:1796:
+		if (core_limit == 0) {
+			 /*

WARNING: line over 80 characters
#57: FILE: fs/exec.c:1805:
+			  * but it runs as root, and can do lots of stupid things

WARNING: line over 80 characters
#58: FILE: fs/exec.c:1806:
+			  * Note that we use task_tgid_vnr here to grab the pid of the

WARNING: line over 80 characters
#59: FILE: fs/exec.c:1807:
+			  * process group leader.  That way we get the right pid if a thread

total: 0 errors, 4 warnings, 59 lines checked

./patches/exec-make-do_coredump-more-resilient-to-recursive-crashes-v9.patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

04de93c2

Change how we detect recursive dumps. · 9e09dd6d

Neil Horman authored Jul 22, 2009

Currently we have a mechanism by which we try to compare pathnames of the
crashing process to the core_pattern path.  This is broken for a dozen
reasons, and just doesn't work in any sort of robust way.

I'm replacing it with the use of a 0 RLIMIT_CORE value.  Since helper apps
set RLIMIT_CORE to zero, we don't write out core files for any process
with that particular limit set.  It the core_pattern is a pipe, any
non-zero limit is translated to RLIM_INFINITY.

This allows complete dumps to be captured, but prevents infinite recursion
in the event that the core_pattern process itself crashes.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9e09dd6d

16 Jul, 2009 1 commit

This adds the utrace facility, a new modular interface in the kernel for · be6978b9

Roland McGrath authored Jul 16, 2009

implementing user thread tracing and debugging. This fits on top of the
tracehook_* layer, so the new code is well-isolated.

The new interface is in <linux/utrace.h> and the DocBook utrace book
describes it. It allows for multiple separate tracing engines to work in
parallel without interfering with each other. Higher-level tracing
facilities can be implemented as loadable kernel modules using this layer.

The new facility is made optional under CONFIG_UTRACE. When this is not
enabled, no new code is added. It can only be enabled on machines that
have all the prerequisites and select CONFIG_HAVE_ARCH_TRACEHOOK.

In this initial version, utrace and ptrace do not play together at all.
If ptrace is attached to a thread, the attach calls in the utrace kernel
API return -EBUSY. If utrace is attached to a thread, the PTRACE_ATTACH
or PTRACE_TRACEME request will return EBUSY to userland. The old ptrace
code is otherwise unchanged and nothing using ptrace should be affected by
this patch as long as utrace is not used at the same time. In the future
we can clean up the ptrace implementation and rework it to use the utrace
API.

[oleg@redhat.com: kill exclude_xtrace logic]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

be6978b9

30 Jul, 2009 2 commits

Introduce the empty inline tracehook_finish_jctl() helper called by · acb723cf

Oleg Nesterov authored Jul 30, 2009

do_signal_stop() after wakeup.

Currently we lack the ability to report this state change.

Also fix the comment, it should be placed before schedule().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acb723cf

do_signal_stop() can call tracehook_notify_jctl() before decrementing · 1dc1a3e6

Oleg Nesterov authored Jul 30, 2009

->group_stop_count and setting TASK_STOPPED/SIGNAL_STOP_STOPPED.

This way the tracing hooks can drop and reacquire the siglock freely and
do any blocking hooks without potential SIGCONT races.

With this patch TASK_STOPPED/SIGNAL_STOP_STOPPED is set only when we know
for sure we are going to schedule() after unlock(siglock).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

1dc1a3e6

13 Jul, 2009 1 commit

This changes tracehook_notify_jctl() so it's called with the siglock held, · 5b1e6f85

Roland McGrath authored Jul 13, 2009

and changes its argument and return value definition.  These clean-ups
make it a better fit for what new tracing hooks need to check.

Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.

This also folds the finish_stop() function into its caller
do_signal_stop().  The function is short, called only once and only
unconditionally.  It aids readability to fold it in.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

5b1e6f85

19 Aug, 2009 1 commit

__fatal_signal_pending inlines to one instruction on x86, probably two · 8b08dec1

Roland McGrath authored Aug 19, 2009

instructions on other machines.  It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.

On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8b08dec1

17 Aug, 2009 1 commit

In order to direct the SIGIO signal to a particular thread of a · 29e1ea2e

Peter Zijlstra authored Aug 18, 2009

multi-threaded application we cannot, like suggested by the manpage, put a
TID into the regular fcntl(F_SETOWN) call.  It will still be send to the
whole process of which that thread is part.

Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.

The need to direct SIGIO comes from self-monitoring profiling such as with
perf-counters.  Perf-counters uses SIGIO to notify that new sample data is
available.  If the signal is delivered to the same task that generated the
new sample it can augment that data by inspecting the task's user-space
state right after it returns from the kernel.  This is esp.  convenient
for interpreted or virtual machine driven environments.

Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
as argument:

struct f_owner_ex {
	int   type;
	pid_t pid;
};

Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: stephane eranian <eranian@googlemail.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

29e1ea2e

04 Aug, 2009 2 commits

group_send_sig_info()->check_kill_permission() assumes that current is the · b54b5767

Oleg Nesterov authored Aug 04, 2009

sender and uses current_cred().

This is not true in send_sigio_to_task() case.  From the security pov the
sender is not current, but the task which did fcntl(F_SETOWN), that is why
we have sigio_perm() which uses the right creds to check.

Fortunately, send_sigio() always sends either SEND_SIG_PRIV or
SI_FROMKERNEL() signal, so check_kill_permission() does nothing.  But
still it would be tidier to avoid this bogus security check and save a
couple of cycles.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b54b5767

Introduce do_send_sig_info() and convert group_send_sig_info(), · 3de5de9d

Oleg Nesterov authored Aug 04, 2009

send_sig_info(), do_send_specific() to use this helper.

Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.

Shaves 80 bytes from .text.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3de5de9d

21 Jul, 2009 2 commits

Current behaviour of sys_waitid() looks odd. If user passes infop == · 0ade5a80

Vitaly Mayatskikh authored Jul 22, 2009

NULL, sys_waitid() returns success.  When user additionally specifies flag
WNOWAIT, sys_waitid() returns -EFAULT on the same conditions.  When user
combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

This patch adds check for ->wo_info in wait_noreap_copyout().

User-visible change: starting from this commit, sys_waitid() always checks
infop != NULL and does not fail if it is NULL.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

0ade5a80

do_wait() checks ->wo_info to figure out who is the caller. If it's not · 6b55d5e3

Vitaly Mayatskikh authored Jul 22, 2009

NULL the caller should be sys_waitid(), in that case do_wait() fixes up
the retval or zeros ->wo_info, depending on retval from underlying
function.

This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
incorrect value.

man 2 waitid says:

	waitid(): returns 0 on success

Test-case:

	int main(void)
	{
		if (fork())
			assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

		return 0;
	}

Result:

	Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

Move that code to sys_waitid().

User-visible change: sys_waitid() will return 0 on success, either
infop is set or not.

Note, there's another bug in wait_noreap_copyout() which affects
return value of sys_waitid(). It will be fixed in next patch.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

6b55d5e3

13 Jul, 2009 3 commits

Kill the unused "parent" argument in wait_consider_task(), it was never used. · 7fe62721

Oleg Nesterov authored Jul 13, 2009

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7fe62721

Thanks to Roland who pointed out de_thread() issues. · f62e589f

Oleg Nesterov authored Jul 13, 2009

Currently we add sub-threads to ->real_parent->children list.  This buys
nothing but slows down do_wait().

With this patch ->children contains only main threads (group leaders). 
The only complication is that forget_original_parent() should iterate over
sub-threads by hand, and de_thread() needs another list_replace() when it
changes ->group_leader.

Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).

This change can confuse the optimistic search in mm_update_next_owner(),
but this is fixable and minor.

Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f62e589f

Suggested by Roland. · f444d269

Oleg Nesterov authored Jul 13, 2009

do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
it is ->real_parent and the child is not traced. IOW, caller == p->parent
otherwise we should not wake up.

Change child_wait_callback() to check this. Ratan reports the workload with
CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f444d269

18 Jul, 2009 1 commit

(depends on ptrace-__ptrace_detach-do-__wake_up_parent-if-we-reap-the-tracee.patch · 8c0a4c1b

Oleg Nesterov authored Jul 18, 2009

 which exports __wake_up_parent)

Spotted by Valdis.Kletnieks@vt.edu.

selinux_bprm_committed_creds() should not play with ->wait_chldexit, now
that __wake_up_parent() is exported change the code to use this helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8c0a4c1b

13 Jul, 2009 3 commits

Ratan Nalumasu reported that in a process with many threads doing · ff0c2ef4

Oleg Nesterov authored Jul 13, 2009

unnecessary wakeups.  Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.

Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.

We can make child_wait_callback() more clever later, right now it only
checks eligible_child().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ff0c2ef4

Preparation, no functional changes. · 8fe5acbc

Oleg Nesterov authored Jul 13, 2009

eligible_child() has a single caller, wait_consider_task(). We can move
security_task_wait() out from eligible_child(), this allows us to use it
for filtered wake_up().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8fe5acbc

The bug is old, it wasn't cause by recent changes. · b134054b

Oleg Nesterov authored Jul 13, 2009

Test case:

	static void *tfunc(void *arg)
	{
		int pid = (long)arg;

		assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
		kill(pid, SIGKILL);

		sleep(1);
		return NULL;
	}

	int main(void)
	{
		pthread_t th;
		long pid = fork();

		if (!pid)
			pause();

		signal(SIGCHLD, SIG_IGN);
		assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

		int r = waitpid(-1, NULL, __WNOTHREAD);
		printf("waitpid: %d %m\n", r);

		return 0;
	}

Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.

The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD.  But in this case
we should wake up other threads which can sleep in do_wait().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b134054b

14 Aug, 2009 1 commit

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> · 9798e8cd

Balbir Singh authored Aug 14, 2009

Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9798e8cd

11 Aug, 2009 1 commit

ERROR: spaces required around that '?' (ctx:VxW) · 9c251cc7

Andrew Morton authored Aug 12, 2009

#50: FILE: mm/memcontrol.c:485:
+	int val = (charge)? 1 : -1;
 	                  ^

total: 1 errors, 0 warnings, 171 lines checked

./patches/memcg-improve-resource-counter-scalability.patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9c251cc7

14 Aug, 2009 1 commit

Reduce the resource counter overhead (mostly spinlock) associated with the · 688985ed

Balbir Singh authored Aug 14, 2009

root cgroup.  This is a part of the several patches to reduce mem cgroup
overhead.  I had posted other approaches earlier (including using percpu
counters).  Those patches will be a natural addition and will be added
iteratively on top of these.

The patch stops resource counter accounting for the root cgroup.  The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable).  What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics().  For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.

The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory.  This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
recursively when hierarchy is enabled.

The tests results I see on a 24 way show that

1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
   cgroup_disable=memory.

Here is a sample of my program runs

Without Patch

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7192804.124144  task-clock-msecs         #     23.937 CPUs
         424691  context-switches         #      0.000 M/sec
            267  CPU-migrations           #      0.000 M/sec
       28498113  page-faults              #      0.004 M/sec
  5826093739340  cycles                   #    809.989 M/sec
   408883496292  instructions             #      0.070 IPC
     7057079452  cache-references         #      0.981 M/sec
     3036086243  cache-misses             #      0.422 M/sec

  300.485365680  seconds time elapsed

With cgroup_disable=memory

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7182183.546587  task-clock-msecs         #     23.915 CPUs
         425458  context-switches         #      0.000 M/sec
            203  CPU-migrations           #      0.000 M/sec
       92545093  page-faults              #      0.013 M/sec
  6034363609986  cycles                   #    840.185 M/sec
   437204346785  instructions             #      0.072 IPC
     6636073192  cache-references         #      0.924 M/sec
     2358117732  cache-misses             #      0.328 M/sec

  300.320905827  seconds time elapsed

With this patch applied

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7191619.223977  task-clock-msecs         #     23.955 CPUs
         422579  context-switches         #      0.000 M/sec
             88  CPU-migrations           #      0.000 M/sec
       91946060  page-faults              #      0.013 M/sec
  5957054385619  cycles                   #    828.333 M/sec
  1058117350365  instructions             #      0.178 IPC
     9161776218  cache-references         #      1.274 M/sec
     1920494280  cache-misses             #      0.267 M/sec

  300.218764862  seconds time elapsed

Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)

For a single run

Without patch

real 27m8.988s
user 87m24.916s
sys 382m6.037s

With patch

real    4m18.607s
user    84m58.943s
sys     50m52.682s

With config turned off

real    4m54.972s
user    90m13.456s
sys     50m19.711s

NOTE: The data looks counterintuitive due to the increased performance
with the patch, even over the config being turned off. We probably need
more runs, but so far all testing has shown that the patches definitely
help.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

688985ed

31 Jul, 2009 1 commit

build fix · efe9d9b2

Andrew Morton authored Jul 31, 2009

Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

efe9d9b2

21 Jul, 2009 2 commits

Implement reclaim from groups over their soft limit · 932286d8

Balbir Singh authored Jul 22, 2009

Permit reclaim from memory cgroups on contention (via the direct reclaim
path).

memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.

Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off.  The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits.  For soft
limits, we try to do more targetted reclaim.  Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded.  The proportion has been empirically
determined.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

932286d8

Refactor mem_cgroup_hierarchical_reclaim() · 34d1aa5f

Balbir Singh authored Jul 22, 2009

Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
flags, so that new parameters don't have to be passed as we make the
reclaim routine more flexible
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

34d1aa5f

04 Aug, 2009 1 commit

CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y mmotm fails to boot: · 62ee9018

Hugh Dickins authored Aug 04, 2009

Kernel panic - not syncing: No init found; after lots of scheduling
while atomics, starting from when async_thread does sd_probe_async.

mem_cgroup_soft_limit_check() was doing an unbalanced get_cpu(): don't
get_cpu if we won't need it, and put_cpu if we did get_cpu.

And fix the silliness of passing it an "over_soft_limit" argument
that just tells it to return false when false.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

62ee9018

21 Jul, 2009 1 commit

Organize cgroups over soft limit in a RB-Tree · fd9213af

Balbir Singh authored Jul 22, 2009

Introduce an RB-Tree for storing memory cgroups that are over their soft
limit.  The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fd9213af