Commits · 600dfd7feabab7b88e29441bb0a029ca9421dfe5 · linux / linux-davinci

28 Oct, 2009 5 commits

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> · 600dfd7f

Akinobu Mita authored Oct 29, 2009

Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

600dfd7f

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> · e7c03d17

Akinobu Mita authored Oct 29, 2009

Reviewed-by: Roland Dreier <rolandd@cisco.com>
Cc: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e7c03d17

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> · 9412f216

Akinobu Mita authored Oct 29, 2009

Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Lothar Wassmann <LW@KARO-electronics.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9412f216

Use bitmap library and kill some unused iommu helper functions. · fb5afcee

Akinobu Mita authored Oct 29, 2009

1. s/iommu_area_free/bitmap_clear/

2. s/iommu_area_reserve/bitmap_set/

3. Use bitmap_find_next_zero_area instead of find_next_zero_area

  This cannot be simple substitution because find_next_zero_area
  doesn't check the last bit of the limit in bitmap

4. Remove iommu_area_free, iommu_area_reserve, and find_next_zero_area
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fb5afcee

This introduces new bitmap functions: · ba1e3792

Akinobu Mita authored Oct 29, 2009

bitmap_set: Set specified bit area
bitmap_clear: Clear specified bit area
bitmap_find_next_zero_area: Find free bit area

These are mostly stolen from iommu helper. The differences are:

- Use find_next_bit instead of doing test_bit for each bit

- Rewrite bitmap_set and bitmap_clear

  Instead of setting or clearing for each bit.

- Check the last bit of the limit

  iommu-helper doesn't want to find such area

- The return value if there is no zero area

  find_next_zero_area in iommu helper: returns -1
  bitmap_find_next_zero_area: return >= bitmap size
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Lothar Wassmann <LW@KARO-electronics.de>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ba1e3792

30 Oct, 2009 2 commits

Cc: Jeff Moyer <jmoyer@redhat.com> · 33b60a5f

Andrew Morton authored Oct 30, 2009

Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

33b60a5f

Intel reported a performance regression caused by the following commit: · 621687ce

Jeff Moyer authored Oct 30, 2009

commit 848c4dd5
Author: Zach Brown <zach.brown@oracle.com>
Date:   Mon Aug 20 17:12:01 2007 -0700

    dio: zero struct dio with kzalloc instead of manually

    This patch uses kzalloc to zero all of struct dio rather than
    manually trying to track which fields we rely on being zero.  It
    passed aio+dio stress testing and some bug regression testing on
    ext3.

    This patch was introduced by Linus in the conversation that lead up
    to Badari's minimal fix to manually zero .map_bh.b_state in commit:

      6a648fa7

    It makes the code a bit smaller.  Maybe a couple fewer cachelines to
    load, if we're lucky:

       text    data     bss     dec     hex filename
    3285925  568506 1304616 5159047  4eb887 vmlinux
    3285797  568506 1304616 5158919  4eb807 vmlinux.patched

    I was unable to measure a stable difference in the number of cpu
    cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads
    at ~340MB/s.

    So the resulting intent of the patch isn't a performance gain but to
    avoid exposing ourselves to the risk of finding another field like
    .map_bh.b_state where we rely on zeroing but don't enforce it in the
    code.

Zach surmised that zeroing out the page array was what caused most of
the problem, and suggested the approach taken in the attached patch for
resolving the issue.  Intel re-tested with this patch and saw a 0.6%
performance gain (the original regression was 0.5%).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

621687ce

14 Oct, 2009 1 commit

Not makes it a bool before the comparison. · 5f6af72a

Roel Kluin authored Oct 14, 2009

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

5f6af72a

16 Oct, 2009 1 commit

dma_mask is, when interpreted as address, the last valid byte, and hence · ff29c603

Jan Beulich authored Oct 16, 2009

comparison msut also be done using the last valid of the buffer in
question.

Also fix the open-coded instances in lib/swiotlb.c.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Becky Bruce <beckyb@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ff29c603

30 Sep, 2009 3 commits

Add support for 6 ranks per channel to the i5100 chipset. I have tested · 04ca382c

Nils Carlson authored Oct 01, 2009

the patch as far as possible with correctible errors and things appear
good.  The DIMM mapping is correct for our board, but boards may differ.
Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se>
Acked-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

04ca382c

Addscrubbing to the i5100 chipset. The i5100 chipset only supports one · 3b94c702

Nils Carlson authored Oct 01, 2009

scrubbing rate, which is not constant but dependent on memory load. The
rate returned by this driver is an estimate based on some experimentation,
but is substantially closer to the truth than the speed supplied in the
documentation.

Also, scrubbing is done once, and then a done-bit is set. This means that
to accomplish continuous scrubbing a re-enabling mechanism must be used.
I have created the simplest possible such mechanism in the form of a
work-queue which will check every five minutes. This interval is quite
arbitrary but should be sufficient for all sizes of system memory.
Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3b94c702

The i5100 driver uses the word controller instead of channel in a lot of · db4f4046

Nils Carlson authored Oct 01, 2009

places, this is simply a cleanup of the patch.
Signed-off-by: Nils Carlson <nils.carlson@ludd.ltu.se>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

db4f4046

16 Oct, 2009 1 commit

vtermnos[] is unsigned, so this test was wrong. · f0a4034c

Roel Kluin authored Oct 17, 2009

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

f0a4034c

09 Oct, 2009 1 commit

Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt · 9d895ad7

Joe Perches authored Oct 09, 2009

Convert printks to pr_<level>
Convert some embedded function names to %s...__func__
Remove a period after exclamation points.
Remove #define pr_dbg which could be used by future kernel.h includes
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9d895ad7

30 Sep, 2009 1 commit

Stanse found unnecessary test in mxser_startup. · fccb685a

Jiri Slaby authored Oct 01, 2009

tty is dereferenced earlier, the test is superfluous. Remove it.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fccb685a

24 Aug, 2009 1 commit
- Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> · d8785eb2
  Bartlomiej Zolnierkiewicz authored Aug 25, 2009
```
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
```
  d8785eb2
15 Oct, 2009 1 commit

Currently all architectures but microblaze unconditionally define · 7f79b363

Christoph Hellwig authored Oct 15, 2009

USE_ELF_CORE_DUMP.  The microblaze omission seems like an error to me, so
let's kill this ifdef and make sure we are the same everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: <linux-arch@vger.kernel.org>
Cc: Michal Simek <michal.simek@petalogix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7f79b363

29 Sep, 2009 1 commit

KCS_IDLE and KCS_IDLE state have the same value, but in this function the · df2a9c57

Julia Lawall authored Sep 30, 2009

constants ending in _STATE are compared to the state variable.
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Corey Minyard <minyard@acm.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

df2a9c57

06 Oct, 2009 7 commits

If multiple simple decrements on the same semaphore are pending, then the · 4f4b72ae

Manfred Spraul authored Oct 06, 2009

current code scans all decrement operations, even if the semaphore value
is already 0.

The patch optimizes that: if the semaphore value is 0, then there is no
need to scan the q->alter entries.

Note that this is a common case: It happens if 100 decrements by one are
pending and now an increment by one increases the semaphore value from 0
to 1.  Without this patch, all 100 entries are scanned.  With the patch,
only one entry is scanned, then woken up.  Then the new rule triggers and
the scanning is aborted, without looking at the remaining 99 tasks.

With this patch, single sop increment/decrement by 1 are now O(1).
(same as with Nick's patch)
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

4f4b72ae

sysv sem has the concept of semaphore arrays that consist out of multiple · 64686513

Manfred Spraul authored Oct 06, 2009

semaphores.  Atomic operations that affect multiple semaphores are
supported.

The patch optimizes single semaphore operation calls that affect only one
semaphore: It's not necessary to scan all pending operations, it is
sufficient to scan the per-semaphore list.

The idea is from Nick Piggin version of an ipc sem improvement, the
implementation is different: The code tries to keep as much common code as
possible.

As the result, the patch is simpler, but optimizes fewer cases.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

64686513

Based on Nick's findings: · cdc420dd

Manfred Spraul authored Oct 06, 2009

sysv sem has the concept of semaphore arrays that consist out of multiple
semaphores.  Atomic operations that affect multiple semaphores are
supported.

The patch is the first step for optimizing simple, single semaphore
operations: In addition to the global list of all pending operations, a
2nd, per-semaphore list with the simple operations is added.

Note: this patch does not make sense by itself, the new list is used
nowhere.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cdc420dd

Reduce the amount of scanning of the list of pending semaphore operations: · e53da5d6

Manfred Spraul authored Oct 06, 2009

If try_atomic_semop failed, then no changes were applied.  Thus no need to
restart.

Additionally, this patch correct an incorrect comment: It's possible to
wait for arbitrary semaphore values (do a dec by <x>, wait-for-zero, inc
by <x> in one atomic operation)

Both changes are from Nick Piggin, the patch is the result of a different
split of the individual changes.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

e53da5d6

The strange sysv semaphore wakeup scheme has a kind of busy-wait lock · 04331304

Nick Piggin authored Oct 06, 2009

involved, which could deadlock if preemption is enabled during the "lock".

It is an implementation detail (due to a spinlock being held) that this is
actually the case. However if "spinlocks" are made preemptible, or if the
sem lock is changed to a sleeping lock for example, then the wakeup would
become buggy. So this might be a bugfix for -rt kernels.

Imagine waker being preempted by wakee and never clearing IN_WAKEUP -- if
wakee has higher RT priority then there is a priority inversion deadlock.
Even if there is not a priority inversion to cause a deadlock, then there
is still time wasted spinning.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

04331304

Replace the handcoded list operations in update_queue() with the standard · 7d11c559

Nick Piggin authored Oct 06, 2009

list_for_each_entry macros.

list_for_each_entry_safe() must be used, because list entries can
disappear immediately uppon the wakeup event.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7d11c559

Around a month ago, there was some discussion about an improvement of the · ebba9872

Nick Piggin authored Oct 06, 2009

sysv sem algorithm: Most (at least: some important) users only use simple
semaphore operations, therefore it's worthwile to optimize this use case.


This patch:

Move last looked up sem_undo struct to the head of the task's undo list. 
Attempt to move common entries to the front of the list so search time is
reduced.  This reduces lookup_undo on oprofile of problematic SAP workload
by 30% (see patch 4 for a description of SAP workload).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Pierre Peiffer <peifferp@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ebba9872

13 Oct, 2009 1 commit

We have apparently had a memory leak since · 47276ee9

Serge E. Hallyn authored Oct 13, 2009

7ca7e564 "ipc: store ipcs into IDRs" in
2007.  The idr of which 3 exist for each ipc namespace is never freed.

This patch simply frees them when the ipcns is freed.  I don't believe any
idr_remove() are done from rcu (and could therefore be delayed until after
this idr_destroy()), so the patch should be safe.  Some quick testing
showed no harm, and the memory leak fixed.

Caught by kmemleak.
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

47276ee9

24 Sep, 2009 1 commit

Thanks to Roland who pointed out de_thread() issues. · 7b9f1838

Oleg Nesterov authored Sep 24, 2009

Currently we add sub-threads to ->real_parent->children list.  This buys
nothing but slows down do_wait().

With this patch ->children contains only main threads (group leaders). 
The only complication is that forget_original_parent() should iterate over
sub-threads by hand, and de_thread() needs another list_replace() when it
changes ->group_leader.

Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
tasks, we can remove this check (and we can unify do_wait_thread() and
ptrace_do_wait()).

This change can confuse the optimistic search in mm_update_next_owner(),
but this is fixable and minor.

Perhaps badness() and oom_kill_process() should be updated, but they
should be fixed in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

7b9f1838

25 Sep, 2009 1 commit

This adds the utrace facility, a new modular interface in the kernel for · 582156fe

Roland McGrath authored Sep 25, 2009

implementing user thread tracing and debugging. This fits on top of the
tracehook_* layer, so the new code is well-isolated.

The new interface is in <linux/utrace.h> and the DocBook utrace book
describes it. It allows for multiple separate tracing engines to work in
parallel without interfering with each other. Higher-level tracing
facilities can be implemented as loadable kernel modules using this layer.

The new facility is made optional under CONFIG_UTRACE. When this is not
enabled, no new code is added. It can only be enabled on machines that
have all the prerequisites and select CONFIG_HAVE_ARCH_TRACEHOOK.

In this initial version, utrace and ptrace do not play together at all.
If ptrace is attached to a thread, the attach calls in the utrace kernel
API return -EBUSY. If utrace is attached to a thread, the PTRACE_ATTACH
or PTRACE_TRACEME request will return EBUSY to userland. The old ptrace
code is otherwise unchanged and nothing using ptrace should be affected by
this patch as long as utrace is not used at the same time. In the future
we can clean up the ptrace implementation and rework it to use the utrace
API.

[oleg@redhat.com: kill exclude_xtrace logic]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

582156fe

30 Oct, 2009 1 commit

Move the call to do_signal_stop() down, after tracehook call. This makes · d26be99a

Oleg Nesterov authored Oct 31, 2009

->group_stop_count condition visible to tracers before do_signal_stop()
will participate in this group-stop.

Currently the patch has no effect, tracehook_get_signal() always returns 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

d26be99a

16 Oct, 2009 4 commits

Kill force_sig_specific(), this trivial wrapper has no callers. · b6b0d16c

Oleg Nesterov authored Oct 17, 2009

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

b6b0d16c

Trivial, s/0/SI_USER/ in collect_signal() for grep. · c3ce3611

Oleg Nesterov authored Oct 17, 2009

This is a bit confusing, we don't know the source of this signal.
But we don't care, and "info->si_code = 0" is imho worse.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c3ce3611

Change send_signal() to use si_fromuser(). From now SEND_SIG_NOINFO · 3791609a

Oleg Nesterov authored Oct 17, 2009

triggers the "from_ancestor_ns" check.

This fixes reparent_thread()->group_send_sig_info(pdeath_signal)
behaviour, before this patch send_signal() does not detect the
cross-namespace case when the child of the dying parent belongs to the
sub-namespace.

This patch can affect the behaviour of send_sig(), kill_pgrp() and
kill_pid() when the caller sends the signal to the sub-namespace with
"priv == 0" but surprisingly all callers seem to use them correctly,
including disassociate_ctty(on_exit).

Except: drivers/staging/comedi/drivers/addi-data/*.c incorrectly use
send_sig(priv => 0).  But his is minor and should be fixed anyway.
Reported-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3791609a

No changes in compiled code. The patch adds the new helper, si_fromuser() · 3fa06216

Oleg Nesterov authored Oct 17, 2009

and changes check_kill_permission() to use this helper.

The real effect of this patch is that from now we "officially" consider
SEND_SIG_NOINFO signal as "from user-space" signals. This is already true
if we look at the code which uses SEND_SIG_NOINFO, except __send_signal()
has another opinion - see the next patch.

The naming of these special SEND_SIG_XXX siginfo's is really bad
imho.  From __send_signal()'s pov they mean

	SEND_SIG_NOINFO		from user
	SEND_SIG_PRIV		from kernel
	SEND_SIG_FORCED		no info
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

3fa06216

30 Oct, 2009 1 commit

No functional changes. · c74a1705

Oleg Nesterov authored Oct 31, 2009

ptrace_init_task() looks confusing, as if we always auto-attach when "bool
ptrace" argument is true, while in fact we attach only if current is
traced.

Make the code more explicit and kill now unused ptrace_link().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

c74a1705

28 Oct, 2009 1 commit

In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes · 5b675426

KAMEZAWA Hiroyuki authored Oct 29, 2009

grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED

And in global VM, mapped shared memory is accounted into FILE_MAPPED.
But memcg doesn't. fix it.
Note:
  page_is_file_cache() just checks SwapBacked or not.
  So, we need to check PageAnon.

Cc: Balbir Singh <balbir@in.ibm.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

5b675426

16 Oct, 2009 1 commit

Don't do INIT_WORK() repeatedly against the same work_struct. It can · 72476ce7

Daisuke Nishimura authored Oct 16, 2009

actually lead to a BUG.

Just do it once in initialization.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

72476ce7

10 Oct, 2009 3 commits

tweak comments · 9c63e80b

Andrew Morton authored Oct 10, 2009

Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

9c63e80b

This is a patch for coalescing access to res_counter at charging by percpu · 8c20eaa6

KAMEZAWA Hiroyuki authored Oct 10, 2009

caching.  At charge, memcg charges 64pages and remember it in percpu
cache.  Because it's cache, drain/flush if necessary.

This version uses public percpu area.
 2 benefits for using public percpu area.
 1. Sum of stocked charge in the system is limited to # of cpus
    not to the number of memcg. This shows better synchonization.
 2. drain code for flush/cpuhotplug is very easy (and quick)

The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoiding
false sharing.

On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.

[without memcg config]
  40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
  27.67 cache-miss/faults

[root cgroup]
  36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
  31.58 cache miss/faults

[in a child cgroup]
  18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
  69.96 cache miss/faults

[ + coalescing uncharge patch]
  27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
  47.16 cache miss/faults

[ + coalescing uncharge patch + this patch ]
  34224709  page-faults              #      0.072 M/sec   ( +-   0.173% )
  34.69 cache miss/faults

Changelog (since Oct/2):
  - updated comments
  - replaced get_cpu_var() with __get_cpu_var() if possible.
  - removed mutex for system-wide drain. adds a counter instead of it.
  - removed CONFIG_HOTPLUG_CPU

Changelog (old):
  - rebased onto the latest mmotm
  - moved charge size check before __GFP_WAIT check for avoiding unnecesary
  - added asynchronous flush routine.
  - fixed bugs pointed out by Nishimura-san.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

8c20eaa6

In massive parallel enviroment, res_counter can be a performance · ec1d6cb0

KAMEZAWA Hiroyuki authored Oct 10, 2009

bottleneck.  One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.

This patch is a for coalescing uncharge.  For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task.  A reason for per-task batching is for making use
of caller's context information.  We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.

On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.

[without memcg config]
  40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
  27.67 cache-miss/faults
[root cgroup]
  36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
  31.58 miss/faults
[in a child cgroup]
  18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
  69.96 miss/faults
[child with this patch]
  27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
  47.16 miss/faults

We can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.

Changelog(since 2009/10/02):
 - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
 - some clean up and commentary/description updates.
 - added initialize code to copy_process(). (possible bug fix)

Changelog(old):
 - fixed !CONFIG_MEM_CGROUP case.
 - rebased onto the latest mmotm + softlimit fix patches.
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ec1d6cb0

25 Sep, 2009 1 commit

It is not necessary to write custom code for convert calendar time to · 625be630

Zhaolei authored Sep 25, 2009

broken-down time.  time_to_tm() is more generic to do that.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

625be630