1. 10 Nov, 2009 1 commit
    • Jupyung Lee's avatar
      softirqs: Add missing preemption point in ksoftirqd · ca0b4bfa
      Jupyung Lee authored
      In its current implementation, ksoftirq() includes a series of primitives
      related with kernel preemption and irq on/off, in the following order:
      
      preempt_disable()		... (1)
      local_irq_disable()		... (2)
      __preempt_enable_no_resched()	... (3)
      local_irq_enable()		... (4)
      
      A problem arises if a task is woken up between (1) and (2) because it
      is not given a chance to preempt the currently running process until
      interrupts are enabled at (4). At this point the the kernel is
      preemptible, but there is no explicit reschedule point.
      
      This is only true for a preempt-rt enabled kernel as !preempt-rt has
      preemption disabled at that point via local_bh_disable().
      
      A simple suggestion to resolve the problem is to add a reschedule point,
      preempt_check_resched(), just after (4).
      
      [ tglx: Modified: delete __preempt_enable_no_resched() and add
        	preempt_enable() after local_irq_enable() ]
      Signed-off-by: default avatarJupyung Lee <jupyung@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      ca0b4bfa
  2. 06 Nov, 2009 9 commits
    • Thomas Gleixner's avatar
      x86: Fix UP compile · 3c8167f9
      Thomas Gleixner authored
      The power balancing hackery is only relevant for SMP and stupidly
      enough breaks the UP build on x86
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      3c8167f9
    • Carsten Emde's avatar
      ftrace: Add latency histograms of missed timer offsets · d2ac742d
      Carsten Emde authored
      A source of system latencies not yet considered in the histograms
      of effective latencies are delayed timer interrupts. Such latencies
      are mainly due to disabled interrupts. Recording of effective latencies
      allows to continuously monitor a system's real-time capabilities
      under real-world conditions.
      
      This patch adds latency histograms of missed timer offsets. If the
      timer belongs to a sleeper that is about to wakeup a task and the
      latency is higher than previous latencies of such timers, some data
      of this task are recorded as well.
      
      Adapted and expanded Documentation/trace/histograms.txt.
      Signed-off-by: default avatarCarsten Emde <C.Emde@osadl.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      d2ac742d
    • Carsten Emde's avatar
      ftrace: Consider shared max priority in latency histograms · 16731e6f
      Carsten Emde authored
      The algorithm used so far to trace the process with the highest priority
      requires that no other processes with the same priority are being woken
      up simultaneously. Otherwise, a process with a lower priority may be
      picked up for tracing which leads to an erroneously high latency value.
      
      Generally, the wakeup latency of a process that exclusively uses the
      highest priority of the system is due to software or hardware issues we
      would like to solve or, at least, keep as small as possible. This is
      what latency measurements are made for, after all. The wakeup latency of
      a process that shares the highest priority of the system with other
      processes, is quite another story. It may contain the worst-case runtime
      durations of the other processes; thus, it is the result of the priority
      design of a given system and nothing a kernel developer or hardware
      engineer may want to fix.
      
      This said, we need to separately record latencies i) of processes that
      exclusively use the highest priority of the system and ii) of processes
      that share the highest priority of the system with other processes.
      
      The above mentioned shortcoming of the tracing algorithm also applies to
      the variable tracing_max_latency that the wakeup latency tracer uses,
      since it is based on the same procedure as the original version of the
      latency histogram. In consequence, if several processes share the
      highest priority of the system, the variable tracing_max_latency may
      contain erroneously high values. We could now patch the wakeup latency
      tracer as well and separately record the various latencies, but we
      better document this behavior and recommend the latency histograms to
      reliably determine a system's worst-case wakeup latency.
      
      Simplified and cleaned up a bit. Added some more help info to Kconfig.
      Signed-off-by: default avatarCarsten Emde <C.Emde@osadl.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      16731e6f
    • Jon Masters's avatar
      hwlat: Fix Kconfig and check kthread_stop result · d9a4a1d0
      Jon Masters authored
      Signed-off-by: default avatarJon Masters <jcm@jonmasters.org>
      Signed-off-by: default avatarJohn Kacur <jkacur@redhat.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Clark Williams <williams@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      d9a4a1d0
    • Jiri Pirko's avatar
      netlink: fix typo in initialization · 55443288
      Jiri Pirko authored
      Commit 9ef1d4c7 ("[NETLINK]: Missing
      initializations in dumped data") introduced a typo in
      initialization. This patch fixes this.
      Signed-off-by: default avatarJiri Pirko <jpirko@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55443288
    • Ben Hutchings's avatar
      drm/r128: Add test for initialisation to all ioctls that require it · e8b444f5
      Ben Hutchings authored
      Almost all r128's private ioctls require that the CCE state has
      already been initialised.  However, most do not test that this has
      been done, and will proceed to dereference a null pointer.  This may
      result in a security vulnerability, since some ioctls are
      unprivileged.
      
      This adds a macro for the common initialisation test and changes all
      ioctl implementations that require prior initialisation to use that
      macro.
      
      Also, r128_do_init_cce() does not test that the CCE state has not
      been initialised already.  Repeated initialisation may lead to a crash
      or resource leak.  This adds that test.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      e8b444f5
    • Tomoki Sekiyama's avatar
      AF_UNIX: Fix deadlock on connecting to shutdown socket · 287c11d0
      Tomoki Sekiyama authored
      I found a deadlock bug in UNIX domain socket, which makes able to DoS
      attack against the local machine by non-root users.
      
      How to reproduce:
      1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
          namespace(*), and shutdown(2) it.
       2. Repeat connect(2)ing to the listening socket from the other sockets
          until the connection backlog is full-filled.
       3. connect(2) takes the CPU forever. If every core is taken, the
          system hangs.
      
      PoC code: (Run as many times as cores on SMP machines.)
      
      int main(void)
      {
      	int ret;
      	int csd;
      	int lsd;
      	struct sockaddr_un sun;
      
      	/* make an abstruct name address (*) */
      	memset(&sun, 0, sizeof(sun));
      	sun.sun_family = PF_UNIX;
      	sprintf(&sun.sun_path[1], "%d", getpid());
      
      	/* create the listening socket and shutdown */
      	lsd = socket(AF_UNIX, SOCK_STREAM, 0);
      	bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
      	listen(lsd, 1);
      	shutdown(lsd, SHUT_RDWR);
      
      	/* connect loop */
      	alarm(15); /* forcely exit the loop after 15 sec */
      	for (;;) {
      		csd = socket(AF_UNIX, SOCK_STREAM, 0);
      		ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
      		if (-1 == ret) {
      			perror("connect()");
      			break;
      		}
      		puts("Connection OK");
      	}
      	return 0;
      }
      
      (*) Make sun_path[0] = 0 to use the abstruct namespace.
          If a file-based socket is used, the system doesn't deadlock because
          of context switches in the file system layer.
      
      Why this happens:
       Error checks between unix_socket_connect() and unix_wait_for_peer() are
       inconsistent. The former calls the latter to wait until the backlog is
       processed. Despite the latter returns without doing anything when the
       socket is shutdown, the former doesn't check the shutdown state and
       just retries calling the latter forever.
      
      Patch:
       The patch below adds shutdown check into unix_socket_connect(), so
       connect(2) to the shutdown socket will return -ECONREFUSED.
      Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
      Signed-off-by: default avatarMasanori Yoshida <masanori.yoshida.tv@hitachi.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      287c11d0
    • David Howells's avatar
      KEYS: get_instantiation_keyring() should inc the keyring refcount in all cases · 83001da0
      David Howells authored
      The destination keyring specified to request_key() and co. is made available to
      the process that instantiates the key (the slave process started by
      /sbin/request-key typically).  This is passed in the request_key_auth struct as
      the dest_keyring member.
      
      keyctl_instantiate_key and keyctl_negate_key() call get_instantiation_keyring()
      to get the keyring to attach the newly constructed key to at the end of
      instantiation.  This may be given a specific keyring into which a link will be
      made later, or it may be asked to find the keyring passed to request_key().  In
      the former case, it returns a keyring with the refcount incremented by
      lookup_user_key(); in the latter case, it returns the keyring from the
      request_key_auth struct - and does _not_ increment the refcount.
      
      The latter case will eventually result in an oops when the keyring prematurely
      runs out of references and gets destroyed.  The effect may take some time to
      show up as the key is destroyed lazily.
      
      To fix this, the keyring returned by get_instantiation_keyring() must always
      have its refcount incremented, no matter where it comes from.
      
      This can be tested by setting /etc/request-key.conf to:
      
      #OP	TYPE	DESCRIPTION	CALLOUT INFO	PROGRAM ARG1 ARG2 ARG3 ...
      #======	=======	===============	===============	===============================
      create  *	test:*		*		|/bin/false %u %g %d %{user:_display}
      negate	*	*		*		/bin/keyctl negate %k 10 @u
      
      and then doing:
      
      	keyctl add user _display aaaaaaaa @u
              while keyctl request2 user test:x test:x @u &&
              keyctl list @u;
              do
                      keyctl request2 user test:x test:x @u;
                      sleep 31;
                      keyctl list @u;
              done
      
      which will oops eventually.  Changing the negate line to have @u rather than
      %S at the end is important as that forces the latter case by passing a special
      keyring ID rather than an actual keyring ID.
      Reported-by: default avatarAlexander Zangerl <az@bond.edu.au>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarAlexander Zangerl <az@bond.edu.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83001da0
    • Earl Chew's avatar
      fs: pipe.c null pointer dereference · 070609e1
      Earl Chew authored
      This patch fixes a null pointer exception in pipe_rdwr_open() which
      generates the stack trace:
      
      > Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
      >  [<ffffffff802899a5>] pipe_rdwr_open+0x35/0x70
      >  [<ffffffff8028125c>] __dentry_open+0x13c/0x230
      >  [<ffffffff8028143d>] do_filp_open+0x2d/0x40
      >  [<ffffffff802814aa>] do_sys_open+0x5a/0x100
      >  [<ffffffff8021faf3>] sysenter_do_call+0x1b/0x67
      
      The failure mode is triggered by an attempt to open an anonymous
      pipe via /proc/pid/fd/* as exemplified by this script:
      
      =============================================================
      while : ; do
         { echo y ; sleep 1 ; } | { while read ; do echo z$REPLY; done ; } &
         PID=$!
         OUT=$(ps -efl | grep 'sleep 1' | grep -v grep |
              { read PID REST ; echo $PID; } )
         OUT="${OUT%% *}"
         DELAY=$((RANDOM * 1000 / 32768))
         usleep $((DELAY * 1000 + RANDOM % 1000 ))
         echo n > /proc/$OUT/fd/1                 # Trigger defect
      done
      =============================================================
      
      Note that the failure window is quite small and I could only
      reliably reproduce the defect by inserting a small delay
      in pipe_rdwr_open(). For example:
      
       static int
       pipe_rdwr_open(struct inode *inode, struct file *filp)
       {
             msleep(100);
             mutex_lock(&inode->i_mutex);
      
      Although the defect was observed in pipe_rdwr_open(), I think it
      makes sense to replicate the change through all the pipe_*_open()
      functions.
      
      The core of the change is to verify that inode->i_pipe has not
      been released before attempting to manipulate it. If inode->i_pipe
      is no longer present, return ENOENT to indicate so.
      
      The comment about potentially using atomic_t for i_pipe->readers
      and i_pipe->writers has also been removed because it is no longer
      relevant in this context. The inode->i_mutex lock must be used so
      that inode->i_pipe can be dealt with correctly.
      Signed-off-by: default avatarEarl Chew <earl_chew@agilent.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      070609e1
  3. 29 Oct, 2009 17 commits
  4. 28 Oct, 2009 11 commits
    • Darren Hart's avatar
      futex: Correct queue_me and unqueue_me commentary · cd18252d
      Darren Hart authored
      The queue_me/unqueue_me commentary is oddly placed and out of date.
      Clean it up and correct the inaccurate bits.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090922053015.8717.71713.stgit@Aeon>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      cd18252d
    • Darren Hart's avatar
      futex: Move drop_futex_key_refs out of spinlock'ed region · b9e40b50
      Darren Hart authored
      When requeuing tasks from one futex to another, the reference held
      by the requeued task to the original futex location needs to be
      dropped eventually.
      
      Dropping the reference may ultimately lead to a call to
      "iput_final" and subsequently call into filesystem- specific code -
      which may be non-atomic.
      
      It is therefore safer to defer this drop operation until after the
      futex_hash_bucket spinlock has been dropped.
      
      Originally-From: Helge Bahmann <hcb@chaoticmind.net>
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: <stable@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      Cc: Sven-Thorsten Dietrich <sdietrich@novell.com>
      Cc: John Kacur <jkacur@redhat.com>
      LKML-Reference: <4AD7A298.5040802@us.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b9e40b50
    • Darren Hart's avatar
      futex: Add memory barrier commentary to futex_wait_queue_me() · d6617954
      Darren Hart authored
      The memory barrier semantics of futex_wait_queue_me() are
      non-obvious. Add some commentary to try and clarify it.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090924185447.694.38948.stgit@Aeon>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d6617954
    • Darren Hart's avatar
      futex: Correct futex_wait_requeue_pi() commentary · eb78fc39
      Darren Hart authored
      The state machine described in the comments wasn't updated with
      a follow-on fix.  Address that and cleanup the corresponding
      commentary in the function.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <4A737C2A.9090001@us.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      eb78fc39
    • Thomas Gleixner's avatar
      futex: Fix locking imbalance · 11bc48db
      Thomas Gleixner authored
      Rich reported a lock imbalance in the futex code:
      
         http://bugzilla.kernel.org/show_bug.cgi?id=14288
      
      It's caused by the displacement of the retry_private label in
      futex_wake_op(). The code unlocks the hash bucket locks in the
      error handling path and retries without locking them again which
      makes the next unlock fail.
      
      Move retry_private so we lock the hash bucket locks when we retry.
      Reported-by: default avatarRich Ercolany <rercola@acm.jhu.edu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: stable-2.6.31 <stable@kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      11bc48db
    • Darren Hart's avatar
      futex: Correct futex_wait_requeue_pi() commentary · 9231abe1
      Darren Hart authored
      Correct various typos and formatting inconsistencies in the
      commentary of futex_wait_requeue_pi().
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090922052958.8717.21932.stgit@Aeon>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      9231abe1
    • Darren Hart's avatar
      futex: Make function kernel-doc commentary consistent · 6ca0f2a0
      Darren Hart authored
      Make the existing function kernel-doc consistent throughout
      futex.c, following Documentation/kernel-doc-nano-howto.txt as
      closely as possible.
      
      When unsure, at least be consistent within futex.c.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090922053022.8717.13339.stgit@Aeon>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      6ca0f2a0
    • Darren Hart's avatar
      futex: Correct futex_q woken state commentary · 0699fd94
      Darren Hart authored
      Use kernel-doc format to describe struct futex_q.
      
      Correct the wakeup definition to eliminate the statement about
      waking the waiter between the plist_del() and the q->lock_ptr = 0.
      
      Note in the comment that PI futexes have a different definition of
      the woken state.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090922053029.8717.62798.stgit@Aeon>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      0699fd94
    • Darren Hart's avatar
      futex: Check for NULL keys in match_futex · 29b33bb7
      Darren Hart authored
      If userspace tries to perform a requeue_pi on a non-requeue_pi waiter,
      it will find the futex_q->requeue_pi_key to be NULL and OOPS.
      
      Check for NULL in match_futex() instead of doing explicit NULL pointer
      checks on all call sites.  While match_futex(NULL, NULL) returning
      false is a little odd, it's still correct as we expect valid key
      references.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Dinakar Guniguntala <dino@in.ibm.com>
      CC: John Stultz <johnstul@us.ibm.com>
      Cc: stable@kernel.org
      LKML-Reference: <4AD60687.10306@us.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      29b33bb7
    • Thomas Gleixner's avatar
      futex: Fix spurious wakeup for requeue_pi really · 43746940
      Thomas Gleixner authored
      The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
      NULL test) nor does it use the wake_list of futex_wake() which where
      the reason for commit 41890f24 (futex: Handle spurious wake up)
      
      See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
      
      The changes in this fix to the wait_requeue_pi path were considered to
      be a likely unecessary, but harmless safety net. But it turns out that
      due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
      as EAGAIN we built an endless loop in the code path which returns
      correctly EWOULDBLOCK.
      
      Spurious wakeups in wait_requeue_pi code path are unlikely so we do
      the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
      it deal with the spurious wakeup.
      
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      LKML-Reference: <4AE23C74.1090502@us.ibm.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      43746940
    • Thomas Gleixner's avatar
      futex: Detect mismatched requeue targets · e814515d
      Thomas Gleixner authored
      There is currently no check to ensure that userspace uses the same
      futex requeue target (uaddr2) in futex_requeue() that the waiter used
      in futex_wait_requeue_pi().  A mismatch here could very unexpected
      results as the waiter assumes it either wakes on uaddr1 or uaddr2. We
      could detect this on wakeup in the waiter, but the cleanup is more
      intense after the improper requeue has occured.
      
      This patch stores the waiter's expected requeue target in a new
      requeue_pi_key pointer in the futex_q which futex_requeue() checks
      prior to attempting to do a proxy lock acquistion or a requeue when
      requeue_pi=1. If they don't match, return -EINVAL from futex_requeue,
      aborting the requeue of any remaining waiters.
      Signed-off-by: default avatarDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      LKML-Reference: <20090814003650.14634.63916.stgit@Aeon>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      
      Conflicts:
      
      	kernel/futex.c
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e814515d
  5. 20 Oct, 2009 2 commits