- 08 Jan, 2009 40 commits
-
-
roel kluin authored
romfs_strnlen() returns int unsigned X >= 0 is always true [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: roel kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Magnus Damm authored
Remove the saved_max_pfn check from the /proc/vmcore function read_from_oldmem(). No need to verify, we should be able to just trust that "elfcorehdr=" is correctly passed to the crash kernel on the kernel command line like we do with other parameters. The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes sense to use saved_max_pfn. For oldmem it is used to determine when to stop reading. For vmcore we already have the elf header info pointing out the physical memory regions, no need to pass the end-of- old-memory twice. Removing the saved_max_pfn check from vmcore makes it possible for architectures to skip oldmem but still support crash dump through vmcore - without the need for the old saved_max_pfn cruft. Architectures that want to play safe can do the saved_max_pfn check in copy_oldmem_page(). Not sure why anyone would want to do that, but that's even safer than today - the saved_max_pfn check in vmcore removed by this patch only checks the first page. Signed-off-by: Magnus Damm <damm@igel.co.jp> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Simon Horman <horms@verge.net.au> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
Send completion status of the commands to the userspace. Message and protocol are described in the documentation. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Paul Alfille <paul.alfille@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
Command which allows to reset the bus. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Paul Alfille <paul.alfille@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Paul Alfille <paul.alfille@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
This small patchset extendes existing commands with reset, master IO and status messages. Reset is used to reset the bus for given master device, master IO command allows to initiate IO against bus itself not selecting slave device first, which can be used to probe the device for example. And status messages carry command completion status back to the userspace (namely very useful to get -ENODEV from when requested device was not found). Great thanks to Paul Alfille of OWFS for testing and commands suggestions. This patch: Allow starting of IO not against already found slave devices, but against the bus itself, which can be used for example to probe devices. [akpm@linux-foundation.org: reindent switch statements] Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Paul Alfille <paul.alfille@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
Initiates search (or alarm search) and returns all found devices to userspace. Found devices are not added into the system (i.e. they are not attached to family devices or bus masters), it will be done via (if was not done yet) usual timed searching. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
Writes and returns sampled data back to userspace. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Evgeniy Polyakov authored
This patch series introduces and extends several userspace commands used with netlink protocol. Touch block command allows to write data and return sampled data to the userspace. Extended search and alarm seach commands to return list of slave devices found during given search. List masters command allows to send all registered master IDs to the userspace. Great thanks to Paul Alfille (owfs) who tested this implementation and wrote w1-to-network daemon http://sourceforge.net/projects/w1repeater/ and Frederik Deweerdt and Randy Dunlap for review. This patch: Returns list of registered bus master devices. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Paul Alfille <paul.alfille@gmail.com> Cc: Frederik Deweerdt <frederik.deweerdt@xprog.eu> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sascha Hauer authored
This patch adds support for the 1-wire master interface for i.MX27 and i.MX31. Signed-off-by: Luotao Fu <l.fu@pengutronix.de> Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Smith authored
Signed-off-by: Marcel Selhorst <tpm@selhorst.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Garrett authored
Add a driver for controlling Dell-specific backlight and rfkill interfaces. This driver makes use of the dcdbas interface to the Dell firmware to allow the backlight and rfkill interfaces on Dell systems to be driven through the standardised sysfs interfaces. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Matt Domsch <Matt_Domsch@dell.com> Cc: Ivo van Doorn <ivdoorn@gmail.com> Cc: Len Brown <lenb@kernel.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Matthew Garrett authored
The dcdbas code allows calls to be made into the firmware on Dell systems. Exporting this to other drivers allows them to implement Dell-specific functionality in a safe way. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Matt Domsch <Matt_Domsch@dell.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kees Cook authored
While discussing[1] the need for glibc to have access to random bytes during program load, it seems that an earlier attempt to implement AT_RANDOM got stalled. This implements a random 16 byte string, available to every ELF program via a new auxv AT_RANDOM vector. [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html Ulrich said: glibc needs right after startup a bit of random data for internal protections (stack canary etc). What is now in upstream glibc is that we always unconditionally open /dev/urandom, read some data, and use it. For every process startup. That's slow. ... The solution is to provide a limited amount of random data to the starting process in the aux vector. I suggested 16 bytes and this is what the patch implements. If we need only 16 bytes or less we use the data directly. If we need more we'll use the 16 bytes to see a PRNG. This avoids the costly /dev/urandom use and it allows the kernel to use the most adequate source of random data for this purpose. It might not be the same pool as that for /dev/urandom. Concerns were expressed about the depletion of the randomness pool. But this patch doesn't make the situation worse, it doesn't deplete entropy more than happens now. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sukadev Bhattiprolu authored
If a process registers for asynchronous notification on a POSIX message queue, it gets a signal and a siginfo_t structure when a message arrives on the message queue. The si_pid in the siginfo_t structure is set to the PID of the process that sent the message to the message queue. The principle is the following: . when mq_notify(SIGEV_SIGNAL) is called, the caller registers for notification when a msg arrives. The associated pid structure is stroed into inode_info->notify_owner. Let's call this process P1. . when mq_send() is called by say P2, P2 sends a signal to P1 to notify him about msg arrival. The way .si_pid is set today is not correct, since it doesn't take into account the fact that the process that is sending the message might not be in the same namespace as the notified one. This patch proposes to set si_pid to the sender's pid into the notify_owner namespace. Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Eric W. Biederman authored
Currently task_active_pid_ns is not safe to call after a task becomes a zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By reading the pid namespace from the pid of the task we can trivially solve this problem at the cost of one extra memory read in what should be the same cacheline as we read the namespace from. When moving things around I have made task_active_pid_ns out of line because keeping it in pid_namespace.h would require adding includes of pid.h and sched.h that I don't think we want. This change does make task_active_pid_ns unsafe to call during copy_process until we attach a pid on the task_struct which seems to be a reasonable trade off. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Eric W. Biederman authored
A current problem with the pid namespace is that it is easy to do pid related work after exit_task_namespaces which drops the nsproxy pointer. However if we are doing pid namespace related work we are always operating on some struct pid which retains the pid_namespace pointer of the pid namespace it was allocated in. So provide ns_of_pid which allows us to find the pid namespace a pid was allocated in. Using this we have the needed infrastructure to do pid namespace related work at anytime we have a struct pid, removing the chance of accidentally having a NULL pointer dereference when accessing current->nsproxy. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
Impact: cleanups, use new cpumask API Final trivial cleanups: mainly s/cpumask_t/struct cpumask Note there is a FIXME in generate_sched_domains(). A future patch will change struct cpumask *doms to struct cpumask *doms[]. (I suppose Rusty will do this.) Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
Impact: use new cpumask API This patch mainly does the following things: - change cs->cpus_allowed from cpumask_t to cpumask_var_t - call alloc_bootmem_cpumask_var() for top_cpuset in cpuset_init_early() - call alloc_cpumask_var() for other cpusets - replace cpus_xxx() to cpumask_xxx() Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
Impact: cleanups, reduce stack usage This patch prepares for the next patch. When we convert cpuset.cpus_allowed to cpumask_var_t, (trialcs = *cs) no longer works. Another result of this patch is reducing stack usage of trialcs. sizeof(*cs) can be as large as 148 bytes on x86_64, so it's really not good to have it on stack. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
Impact: reduce stack usage Allocate a global cpumask_var_t at boot, and use it in cpuset_attach(), so we won't fail cpuset_attach(). Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
Impact: reduce stack usage Just use cs->cpus_allowed, and no need to allocate a cpumask_var_t. Signed-off-by: Li Zefan <lizf@cn.fujistu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
This patchset converts cpuset to use new cpumask API, and thus remove on stack cpumask_t to reduce stack usage. Before: # cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t 21 After: # cat kernel/cpuset.c include/linux/cpuset.h | grep -c cpumask_t 0 This patch: Impact: reduce stack usage It's safe to call cpulist_scnprintf inside callback_mutex, and thus we can just remove the cpumask_t and no need to allocate a cpumask_var_t. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Mike Travis <travis@sgi.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Miao Xie authored
I found a bug on my dual-cpu box. I created a sub cpuset in top cpuset and assign 1 to its cpus. And then we attach some tasks into this sub cpuset. After this, we offline CPU1. Now, the tasks in this new cpuset are moved into top cpuset automatically because there is no cpu in sub cpuset. Then we online CPU1, we find all the tasks which doesn't belong to top cpuset originally just run on CPU0. We fix this bug by setting task's cpu_allowed to cpu_possible_map when attaching it into top cpuset. This method needn't modify the current behavior of cpusets on CPU hotplug, and all of tasks in top cpuset use cpu_possible_map to initialize their cpu_allowed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Lai Jiangshan authored
task_cs() calls task_subsys_state(). We must use rcu_read_lock() to protect cgroup_subsys_state(). It's correct that top_cpuset is never freed, but cgroup_subsys_state() accesses css_set, this css_set maybe freed when task_cs() called. We use use rcu_read_lock() to protect it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Paul Menage authored
Add css_tryget(), that obtains a counted reference on a CSS. It is used in situations where the caller has a "weak" reference to the CSS, i.e. one that does not protect the cgroup from removal via a reference count, but would instead be cleaned up by a destroy() callback. css_tryget() will return true on success, or false if the cgroup is being removed. This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but with the difference that in the event of css_tryget() racing with a cgroup_rmdir(), css_tryget() will only return false if the cgroup really does get removed. This implementation is done by biasing css->refcnt, so that a refcnt of 1 means "releasable" and 0 means "released or releasing". In the event of a race, css_tryget() distinguishes between "released" and "releasing" by checking for the CSS_REMOVED flag in css->flags. Signed-off-by: Paul Menage <menage@google.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Paul Menage authored
Update the memory controller to use its hierarchy_mutex rather than calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy. Signed-off-by: Paul Menage <menage@google.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Paul Menage authored
These patches introduce new locking/refcount support for cgroups to reduce the need for subsystems to call cgroup_lock(). This will ultimately allow the atomicity of cgroup_rmdir() (which was removed recently) to be restored. These three patches give: 1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can use to prevent changes to its own cgroup tree 2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the memory controller 3/3 - introduce a css_tryget() function similar to the one recently proposed by Kamezawa, but avoiding spurious refcount failures in the event of a race between a css_tryget() and an unsuccessful cgroup_rmdir() Future patches will likely involve: - using hierarchy mutex in place of cgroup_lock() in more subsystems where appropriate - restoring the atomicity of cgroup_rmdir() with respect to cgroup_create() This patch: Add a hierarchy_mutex to the cgroup_subsys object that protects changes to the hierarchy observed by that subsystem. It is taken by the cgroup subsystem (in addition to cgroup_mutex) for the following operations: - linking a cgroup into that subsystem's cgroup tree - unlinking a cgroup from that subsystem's cgroup tree - moving the subsystem to/from a hierarchy (including across the bind() callback) Thus if the subsystem holds its own hierarchy_mutex, it can safely traverse its own hierarchy. Signed-off-by: Paul Menage <menage@google.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
Now, you can see following even when swap accounting is enabled. 1. Create Group 01, and 02. 2. allocate a "file" on tmpfs by a task under 01. 3. swap out the "file" (by memory pressure) 4. Read "file" from a task in group 02. 5. the charge of "file" is moved to group 02. This is not ideal behavior. This is because SwapCache which was loaded by read-ahead is not taken into account.. This is a patch to fix shmem's swapcache behavior. - remove mem_cgroup_cache_charge_swapin(). - Add SwapCache handler routine to mem_cgroup_cache_charge(). By this, shmem's file cache is charged at add_to_page_cache() with GFP_NOWAIT. - pass the page of swapcache to shrink_mem_cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
Now, a page can be deleted from SwapCache while do_swap_page(). memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is still broken. (above behavior broke assumption of memcg-synchronized-lru patch.) This patch is a fix for LRU handling (especially for per-zone counters). At charging SwapCache, - Remove page_cgroup from LRU if it's not used. - Add page cgroup to LRU if it's not linked to. Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> css_tryget() newly is added and we can know css is alive or not and get refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is not called.) This patch replaces css_get() to css_tryget(), where I cannot explain why css_get() is safe. And removes memcg->obsolete flag. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
1. Fix double-free BUG in error route of mem_cgroup_create(). mem_cgroup_free() itself frees per-zone-info. 2. Making refcnt of memcg simple. Add 1 refcnt at creation and call free when refcnt goes down to 0. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
Fix swapin charge operation of memcg. Now, memcg has hooks to swap-out operation and checks SwapCache is really unused or not. That check depends on contents of struct page. I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as still-in-use. Now, reuse_swap_page() calles delete_from_swap_cache() before establishment of any rmap. Then, in followinig sequence (Page fault with WRITE) try_charge() (charge += PAGESIZE) commit_charge() (Check page_cgroup is used or not..) reuse_swap_page() -> delete_from_swapcache() -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE) ...... New charge is uncharged soon.... To avoid this, move commit_charge() after page_mapcount() goes up to 1. By this, try_charge() (usage += PAGESIZE) reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set) commit_charge() (If page_cgroup is not marked as PCG_USED, add new charge.) Accounting will be correct. Changelog (v2) -> (v3) - fixed invalid charge to swp_entry==0. - updated documentation. Changelog (v1) -> (v2) - fixed comment. [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Daisuke Nishimura authored
mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of try_to_free_mem_cgroup_pages(), it should be used in many cases. The only exception is force_empty. The group has no children in this case. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Daisuke Nishimura authored
mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Daisuke Nishimura authored
After previous patch, mem_cgroup_try_charge is not used by anyone, so we can remove it. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Daisuke Nishimura authored
I think triggering OOM at mem_cgroup_prepare_migration would be just a bit overkill. Returning -ENOMEM would be enough for mem_cgroup_prepare_migration. The caller would handle the case anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
Documentation for implementation details and how to test. Just an example. feel free to modify, add, remove lines. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
KAMEZAWA Hiroyuki authored
Show "real" limit of memcg. This helps my debugging and maybe useful for users. While testing hierarchy like this mount -t cgroup none /cgroup -t memory mkdir /cgroup/A set use_hierarchy==1 to "A" mkdir /cgroup/A/01 mkdir /cgroup/A/01/02 mkdir /cgroup/A/01/03 mkdir /cgroup/A/01/03/04 mkdir /cgroup/A/08 mkdir /cgroup/A/08/01 .... and set each own limit to them, "real" limit of each memcg is unclear. This patch shows real limit by checking all ancestors. Changelog: (v1) -> (v2) - remove "if" and use "min(a,b)" Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-