Commits · 985a83a72539662d96193bb5477ebf94db2c6e1c · linux / linux-davinci

06 Nov, 2009 10 commits

v2.6.31.5-rt17 · 985a83a7
Thomas Gleixner authored Nov 06, 2009
```
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
```
985a83a7
Merge branch 'rt/head' into rt/2.6.31 · f846ed7a
Thomas Gleixner authored Nov 06, 2009

f846ed7a

ftrace: Add latency histograms of missed timer offsets · d2ac742d

Carsten Emde authored Nov 02, 2009

A source of system latencies not yet considered in the histograms
of effective latencies are delayed timer interrupts. Such latencies
are mainly due to disabled interrupts. Recording of effective latencies
allows to continuously monitor a system's real-time capabilities
under real-world conditions.

This patch adds latency histograms of missed timer offsets. If the
timer belongs to a sleeper that is about to wakeup a task and the
latency is higher than previous latencies of such timers, some data
of this task are recorded as well.

Adapted and expanded Documentation/trace/histograms.txt.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

d2ac742d

ftrace: Consider shared max priority in latency histograms · 16731e6f

Carsten Emde authored Oct 26, 2009

The algorithm used so far to trace the process with the highest priority
requires that no other processes with the same priority are being woken
up simultaneously. Otherwise, a process with a lower priority may be
picked up for tracing which leads to an erroneously high latency value.

Generally, the wakeup latency of a process that exclusively uses the
highest priority of the system is due to software or hardware issues we
would like to solve or, at least, keep as small as possible. This is
what latency measurements are made for, after all. The wakeup latency of
a process that shares the highest priority of the system with other
processes, is quite another story. It may contain the worst-case runtime
durations of the other processes; thus, it is the result of the priority
design of a given system and nothing a kernel developer or hardware
engineer may want to fix.

This said, we need to separately record latencies i) of processes that
exclusively use the highest priority of the system and ii) of processes
that share the highest priority of the system with other processes.

The above mentioned shortcoming of the tracing algorithm also applies to
the variable tracing_max_latency that the wakeup latency tracer uses,
since it is based on the same procedure as the original version of the
latency histogram. In consequence, if several processes share the
highest priority of the system, the variable tracing_max_latency may
contain erroneously high values. We could now patch the wakeup latency
tracer as well and separately record the various latencies, but we
better document this behavior and recommend the latency histograms to
reliably determine a system's worst-case wakeup latency.

Simplified and cleaned up a bit. Added some more help info to Kconfig.
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

16731e6f

hwlat: Fix Kconfig and check kthread_stop result · d9a4a1d0

Jon Masters authored Oct 22, 2009

Signed-off-by: Jon Masters <jcm@jonmasters.org>
Signed-off-by: John Kacur <jkacur@redhat.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

d9a4a1d0

netlink: fix typo in initialization · 55443288

Jiri Pirko authored Oct 08, 2009

Commit 9ef1d4c7 ("[NETLINK]: Missing
initializations in dumped data") introduced a typo in
initialization. This patch fixes this.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55443288

drm/r128: Add test for initialisation to all ioctls that require it · e8b444f5

Ben Hutchings authored Aug 23, 2009

Almost all r128's private ioctls require that the CCE state has
already been initialised.  However, most do not test that this has
been done, and will proceed to dereference a null pointer.  This may
result in a security vulnerability, since some ioctls are
unprivileged.

This adds a macro for the common initialisation test and changes all
ioctl implementations that require prior initialisation to use that
macro.

Also, r128_do_init_cce() does not test that the CCE state has not
been initialised already.  Repeated initialisation may lead to a crash
or resource leak.  This adds that test.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Dave Airlie <airlied@redhat.com>

e8b444f5

AF_UNIX: Fix deadlock on connecting to shutdown socket · 287c11d0

Tomoki Sekiyama authored Oct 18, 2009

I found a deadlock bug in UNIX domain socket, which makes able to DoS
attack against the local machine by non-root users.

How to reproduce:
1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
    namespace(*), and shutdown(2) it.
 2. Repeat connect(2)ing to the listening socket from the other sockets
    until the connection backlog is full-filled.
 3. connect(2) takes the CPU forever. If every core is taken, the
    system hangs.

PoC code: (Run as many times as cores on SMP machines.)

int main(void)
{
	int ret;
	int csd;
	int lsd;
	struct sockaddr_un sun;

	/* make an abstruct name address (*) */
	memset(&sun, 0, sizeof(sun));
	sun.sun_family = PF_UNIX;
	sprintf(&sun.sun_path[1], "%d", getpid());

	/* create the listening socket and shutdown */
	lsd = socket(AF_UNIX, SOCK_STREAM, 0);
	bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
	listen(lsd, 1);
	shutdown(lsd, SHUT_RDWR);

	/* connect loop */
	alarm(15); /* forcely exit the loop after 15 sec */
	for (;;) {
		csd = socket(AF_UNIX, SOCK_STREAM, 0);
		ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
		if (-1 == ret) {
			perror("connect()");
			break;
		}
		puts("Connection OK");
	}
	return 0;
}

(*) Make sun_path[0] = 0 to use the abstruct namespace.
    If a file-based socket is used, the system doesn't deadlock because
    of context switches in the file system layer.

Why this happens:
 Error checks between unix_socket_connect() and unix_wait_for_peer() are
 inconsistent. The former calls the latter to wait until the backlog is
 processed. Despite the latter returns without doing anything when the
 socket is shutdown, the former doesn't check the shutdown state and
 just retries calling the latter forever.

Patch:
 The patch below adds shutdown check into unix_socket_connect(), so
 connect(2) to the shutdown socket will return -ECONREFUSED.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Signed-off-by: Masanori Yoshida <masanori.yoshida.tv@hitachi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

287c11d0

KEYS: get_instantiation_keyring() should inc the keyring refcount in all cases · 83001da0

David Howells authored Oct 15, 2009

The destination keyring specified to request_key() and co. is made available to
the process that instantiates the key (the slave process started by
/sbin/request-key typically).  This is passed in the request_key_auth struct as
the dest_keyring member.

keyctl_instantiate_key and keyctl_negate_key() call get_instantiation_keyring()
to get the keyring to attach the newly constructed key to at the end of
instantiation.  This may be given a specific keyring into which a link will be
made later, or it may be asked to find the keyring passed to request_key().  In
the former case, it returns a keyring with the refcount incremented by
lookup_user_key(); in the latter case, it returns the keyring from the
request_key_auth struct - and does _not_ increment the refcount.

The latter case will eventually result in an oops when the keyring prematurely
runs out of references and gets destroyed.  The effect may take some time to
show up as the key is destroyed lazily.

To fix this, the keyring returned by get_instantiation_keyring() must always
have its refcount incremented, no matter where it comes from.

This can be tested by setting /etc/request-key.conf to:

#OP	TYPE	DESCRIPTION	CALLOUT INFO	PROGRAM ARG1 ARG2 ARG3 ...
#======	=======	===============	===============	===============================
create  *	test:*		*		|/bin/false %u %g %d %{user:_display}
negate	*	*		*		/bin/keyctl negate %k 10 @u

and then doing:

	keyctl add user _display aaaaaaaa @u
        while keyctl request2 user test:x test:x @u &&
        keyctl list @u;
        do
                keyctl request2 user test:x test:x @u;
                sleep 31;
                keyctl list @u;
        done

which will oops eventually.  Changing the negate line to have @u rather than
%S at the end is important as that forces the latter case by passing a special
keyring ID rather than an actual keyring ID.
Reported-by: Alexander Zangerl <az@bond.edu.au>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Alexander Zangerl <az@bond.edu.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

83001da0

fs: pipe.c null pointer dereference · 070609e1

Earl Chew authored Oct 19, 2009

This patch fixes a null pointer exception in pipe_rdwr_open() which
generates the stack trace:

> Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
>  [<ffffffff802899a5>] pipe_rdwr_open+0x35/0x70
>  [<ffffffff8028125c>] __dentry_open+0x13c/0x230
>  [<ffffffff8028143d>] do_filp_open+0x2d/0x40
>  [<ffffffff802814aa>] do_sys_open+0x5a/0x100
>  [<ffffffff8021faf3>] sysenter_do_call+0x1b/0x67

The failure mode is triggered by an attempt to open an anonymous
pipe via /proc/pid/fd/* as exemplified by this script:

=============================================================
while : ; do
   { echo y ; sleep 1 ; } | { while read ; do echo z$REPLY; done ; } &
   PID=$!
   OUT=$(ps -efl | grep 'sleep 1' | grep -v grep |
        { read PID REST ; echo $PID; } )
   OUT="${OUT%% *}"
   DELAY=$((RANDOM * 1000 / 32768))
   usleep $((DELAY * 1000 + RANDOM % 1000 ))
   echo n > /proc/$OUT/fd/1                 # Trigger defect
done
=============================================================

Note that the failure window is quite small and I could only
reliably reproduce the defect by inserting a small delay
in pipe_rdwr_open(). For example:

 static int
 pipe_rdwr_open(struct inode *inode, struct file *filp)
 {
       msleep(100);
       mutex_lock(&inode->i_mutex);

Although the defect was observed in pipe_rdwr_open(), I think it
makes sense to replicate the change through all the pipe_*_open()
functions.

The core of the change is to verify that inode->i_pipe has not
been released before attempting to manipulate it. If inode->i_pipe
is no longer present, return ENOENT to indicate so.

The comment about potentially using atomic_t for i_pipe->readers
and i_pipe->writers has also been removed because it is no longer
relevant in this context. The inode->i_mutex lock must be used so
that inode->i_pipe can be dealt with correctly.
Signed-off-by: Earl Chew <earl_chew@agilent.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

070609e1

29 Oct, 2009 19 commits

v2.6.31.5-rt16 · 7b724e6c

Thomas Gleixner authored Oct 29, 2009

Merge branch 'master' of

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.31.y

into rt/2.6.31

Conflicts:
	Makefile
	kernel/futex.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

7b724e6c

Merge branch 'rt/head' into rt/2.6.31 · 346bf891
Thomas Gleixner authored Oct 29, 2009

346bf891

sched: Fix dynamic power-balancing crash · 465a3c40

Dinakar Guniguntala authored Oct 22, 2009

This crash:
    
[ 1774.088275] divide error: 0000 [#1] SMP
[ 1774.100355] CPU 13
[ 1774.102498] Modules linked in:
[ 1774.105631] Pid: 30881, comm: hackbench Not tainted 2.6.31-rc8-tip-01308-g484d664-dirty #1629 X8DTN
[ 1774.114807] RIP: 0010:[<ffffffff81041c38>]  [<ffffffff81041c38>]
sched_balance_self+0x19b/0x2d4
    
Triggers because update_group_power() modifies the sd tree and does
temporary calculations there - not considering that other CPUs
could observe intermediate values, such as the zero initial value.
    
Calculate it in a temporary variable instead. (we need no memory
barrier as these are all statistical values anyway)
    
Got the same oops with the backport to -rt
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

465a3c40

sched: Deal with low-load in wake_affine() · 1e2ea8a1

Peter Zijlstra authored Oct 22, 2009

wake_affine() would always fail under low-load situations where
both prev and this were idle, because adding a single task will
always be a significant imbalance, even if there's nothing
around that could balance it.
    
Deal with this by allowing imbalance when there's nothing you
can do about it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

1e2ea8a1

sched: Add a missing = · cddfcb89

Dinakar Guniguntala authored Oct 22, 2009

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

cddfcb89

sched: cleanup wake_idle · 7332e524

Peter Zijlstra authored Oct 22, 2009

A more readable version, with a few differences:

 - don't check against the root domain, but instead check
   SD_LOAD_BALANCE

 - don't re-iterate the cpus already iterated on the previous SD

 - use rcu_read_lock() around the sd iteration
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

7332e524

sched: cleanup wake_idle power saving · 674134b9

Peter Zijlstra authored Oct 22, 2009

Hopefully a more readable version of the same.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

674134b9

x86: sched: provide arch implementations using aperf/mperf · d65d153b

Peter Zijlstra authored Oct 22, 2009

APERF/MPERF support for cpu_power.

APERF/MPERF is arch defined to be a relative scale of work capacity
per logical cpu, this is assumed to include SMT and Turbo mode.

APERF/MPERF are specified to both reset to 0 when either counter
wraps, which is highly inconvenient, since that'll give a blimp when
that happens. The manual specifies writing 0 to the counters after
each read, but that's 1) too expensive, and 2) destroys the
possibility of sharing these counters with other users, so we live
with the blimp - the other existing user does too.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

d65d153b

Provide an arch specific hook for cpufreq based scaling of cpu_power. · 02644174

Peter Zijlstra authored Oct 22, 2009

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

02644174

x86: Add generic aperf/mperf code · 9c8871b3

Dinakar Guniguntala authored Oct 22, 2009

Move some of the aperf/mperf code out from the cpufreq driver
thingy so that other people can enjoy it too.
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

9c8871b3

x86: move APERF/MPERF into a X86_FEATURE · 1755a782

Peter Zijlstra authored Oct 22, 2009

Move the APERFMPERF capacility into a X86_FEATURE flag so that it can
be used outside of the acpi cpufreq driver.

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

1755a782

sched: remove reciprocal for cpu_power · f7cf1cdd

Peter Zijlstra authored Oct 22, 2009

Its a source of fail, also, now that cpu_power is dynamical, its a
waste of time.

before:
<idle>-0   [000]   132.877936: find_busiest_group: avg_load: 0 group_load: 8241 power: 1 

after:
bash-1689  [001]   137.862151: find_busiest_group: avg_load: 10636288 group_load: 10387 power: 1

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
[andreas.herrmann3@amd.com: remove include]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

f7cf1cdd

sched: try to deal with low capacity · e9265e74

Peter Zijlstra authored Oct 22, 2009

When the capacity drops low, we want to migrate load away. Allow the
load-balancer to remove all tasks when we hit rock bottom.

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
[ego@in.ibm.com: fix to update_sd_power_savings_stats]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

e9265e74

sched: scale down cpu_power due to RT tasks · 82e5a1bb

Peter Zijlstra authored Oct 22, 2009

Keep an average on the amount of time spend on RT tasks and use that
fraction to scale down the cpu_power for regular tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

82e5a1bb

sched: dynamic cpu_power · 56e91477

Peter Zijlstra authored Oct 22, 2009

Recompute the cpu_power for each cpu during load-balance
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

56e91477

sched: add smt_gain · 581d1c64

Peter Zijlstra authored Oct 22, 2009

The idea is that multi-threading a core yields more work capacity than
a single thread, provide a way to express a static gain for threads.

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

581d1c64

sched: update the cpu_power sum during load-balance · 337946f8

Peter Zijlstra authored Oct 22, 2009

In order to prepare for a more dynamic cpu_power, update the group sum
while walking the sched domains during load-balance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

337946f8

sched: SD_PREFER_SIBLING · 1724d42a

Peter Zijlstra authored Oct 22, 2009

Do the placement thing using SD flags

XXX: consider degenerate bits

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

1724d42a

sched: restore __cpu_power to a straight sum of power · dac3518c

Peter Zijlstra authored Oct 22, 2009

cpu_power is supposed to be a representation of the process capacity
of the cpu, not a value to randomly tweak in order to affect
placement.

Remove the placement hacks.

[ dino: backport to 31-rt ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: John Kacur <jkacur@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

dac3518c

28 Oct, 2009 11 commits

futex: Correct queue_me and unqueue_me commentary · cd18252d

Darren Hart authored Sep 21, 2009

The queue_me/unqueue_me commentary is oddly placed and out of date.
Clean it up and correct the inaccurate bits.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053015.8717.71713.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

cd18252d

futex: Move drop_futex_key_refs out of spinlock'ed region · b9e40b50

Darren Hart authored Oct 15, 2009

When requeuing tasks from one futex to another, the reference held
by the requeued task to the original futex location needs to be
dropped eventually.

Dropping the reference may ultimately lead to a call to
"iput_final" and subsequently call into filesystem- specific code -
which may be non-atomic.

It is therefore safer to defer this drop operation until after the
futex_hash_bucket spinlock has been dropped.

Originally-From: Helge Bahmann <hcb@chaoticmind.net>
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: <stable@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
Cc: Sven-Thorsten Dietrich <sdietrich@novell.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <4AD7A298.5040802@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

b9e40b50

futex: Add memory barrier commentary to futex_wait_queue_me() · d6617954

Darren Hart authored Sep 24, 2009

The memory barrier semantics of futex_wait_queue_me() are
non-obvious. Add some commentary to try and clarify it.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090924185447.694.38948.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

d6617954

futex: Correct futex_wait_requeue_pi() commentary · eb78fc39

Darren Hart authored Jul 31, 2009

The state machine described in the comments wasn't updated with
a follow-on fix.  Address that and cleanup the corresponding
commentary in the function.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <4A737C2A.9090001@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

eb78fc39

futex: Fix locking imbalance · 11bc48db

Thomas Gleixner authored Oct 04, 2009

Rich reported a lock imbalance in the futex code:

   http://bugzilla.kernel.org/show_bug.cgi?id=14288

It's caused by the displacement of the retry_private label in
futex_wake_op(). The code unlocks the hash bucket locks in the
error handling path and retries without locking them again which
makes the next unlock fail.

Move retry_private so we lock the hash bucket locks when we retry.
Reported-by: Rich Ercolany <rercola@acm.jhu.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: stable-2.6.31 <stable@kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

11bc48db

futex: Correct futex_wait_requeue_pi() commentary · 9231abe1

Darren Hart authored Sep 21, 2009

Correct various typos and formatting inconsistencies in the
commentary of futex_wait_requeue_pi().
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922052958.8717.21932.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

9231abe1

futex: Make function kernel-doc commentary consistent · 6ca0f2a0

Darren Hart authored Sep 21, 2009

Make the existing function kernel-doc consistent throughout
futex.c, following Documentation/kernel-doc-nano-howto.txt as
closely as possible.

When unsure, at least be consistent within futex.c.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053022.8717.13339.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

6ca0f2a0

futex: Correct futex_q woken state commentary · 0699fd94

Darren Hart authored Sep 21, 2009

Use kernel-doc format to describe struct futex_q.

Correct the wakeup definition to eliminate the statement about
waking the waiter between the plist_del() and the q->lock_ptr = 0.

Note in the comment that PI futexes have a different definition of
the woken state.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090922053029.8717.62798.stgit@Aeon>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

0699fd94

futex: Check for NULL keys in match_futex · 29b33bb7

Darren Hart authored Oct 14, 2009

If userspace tries to perform a requeue_pi on a non-requeue_pi waiter,
it will find the futex_q->requeue_pi_key to be NULL and OOPS.

Check for NULL in match_futex() instead of doing explicit NULL pointer
checks on all call sites.  While match_futex(NULL, NULL) returning
false is a little odd, it's still correct as we expect valid key
references.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dinakar Guniguntala <dino@in.ibm.com>
CC: John Stultz <johnstul@us.ibm.com>
Cc: stable@kernel.org
LKML-Reference: <4AD60687.10306@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

29b33bb7

futex: Fix spurious wakeup for requeue_pi really · 43746940

Thomas Gleixner authored Oct 28, 2009

The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
NULL test) nor does it use the wake_list of futex_wake() which where
the reason for commit 41890f24 (futex: Handle spurious wake up)

See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>

The changes in this fix to the wait_requeue_pi path were considered to
be a likely unecessary, but harmless safety net. But it turns out that
due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
as EAGAIN we built an endless loop in the code path which returns
correctly EWOULDBLOCK.

Spurious wakeups in wait_requeue_pi code path are unlikely so we do
the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
it deal with the spurious wakeup.

Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Stultz <johnstul@linux.vnet.ibm.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
LKML-Reference: <4AE23C74.1090502@us.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

43746940

futex: Detect mismatched requeue targets · e814515d

Thomas Gleixner authored Oct 28, 2009

There is currently no check to ensure that userspace uses the same
futex requeue target (uaddr2) in futex_requeue() that the waiter used
in futex_wait_requeue_pi().  A mismatch here could very unexpected
results as the waiter assumes it either wakes on uaddr1 or uaddr2. We
could detect this on wakeup in the waiter, but the cleanup is more
intense after the improper requeue has occured.

This patch stores the waiter's expected requeue target in a new
requeue_pi_key pointer in the futex_q which futex_requeue() checks
prior to attempting to do a proxy lock acquistion or a requeue when
requeue_pi=1. If they don't match, return -EINVAL from futex_requeue,
aborting the requeue of any remaining waiters.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <20090814003650.14634.63916.stgit@Aeon>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Conflicts:

	kernel/futex.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

e814515d