Commit bea904d5 authored by Lee Schermerhorn's avatar Lee Schermerhorn Committed by Linus Torvalds

mempolicy: use MPOL_PREFERRED for system-wide default policy

Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
[set_mempolicy(), mbind() and internal versions], the kernel simply installs a
NULL struct mempolicy pointer in the appropriate context: task policy, vma
policy, or shared policy.  This causes any use of that policy to "fall back"
to the next most specific policy scope.

The only use of MPOL_DEFAULT to mean "local allocation" is in the system
default policy.  This requires extra checks/cases for MPOL_DEFAULT in many
mempolicy.c functions.

There is another, "preferred" way to specify local allocation via the APIs.
That is using the MPOL_PREFERRED policy mode with an empty nodemask.
Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
node local to the cpu where the allocation occurs.

System default policy, except during boot, is hard-coded to "local
allocation".  By using the MPOL_PREFERRED mode with a negative value of
preferred node for system default policy, MPOL_DEFAULT will never occur in the
'policy' member of a struct mempolicy.  Thus, we can remove all checks for
MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
paths.

In slab_node() return local node id when policy pointer is NULL.  No need to
set a pol value to take the switch default.  Replace switch default with
BUG()--i.e., shouldn't happen.

With this patch MPOL_DEFAULT is only used in the APIs, including internal
calls to do_set_mempolicy() and in the display of policy in
/proc/<pid>/numa_maps.  It always means "fall back" to the the next most
specific policy scope.  This simplifies the description of memory policies
quite a bit, with no visible change in behavior.

get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
the requested policy [task or vma/shared] is NULL.  These are the values one
would supply via set_mempolicy() or mbind() to achieve that condition--default
behavior.

This patch updates Documentation to reflect this change.
Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 52cd3b07
...@@ -147,35 +147,18 @@ Components of Memory Policies ...@@ -147,35 +147,18 @@ Components of Memory Policies
Linux memory policy supports the following 4 behavioral modes: Linux memory policy supports the following 4 behavioral modes:
Default Mode--MPOL_DEFAULT: The behavior specified by this mode is Default Mode--MPOL_DEFAULT: This mode is only used in the memory
context or scope dependent. policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
memory policy in all policy scopes. Any existing non-default policy
As mentioned in the Policy Scope section above, during normal will simply be removed when MPOL_DEFAULT is specified. As a result,
system operation, the System Default Policy is hard coded to MPOL_DEFAULT means "fall back to the next most specific policy scope."
contain the Default mode.
For example, a NULL or default task policy will fall back to the
In this context, default mode means "local" allocation--that is system default policy. A NULL or default vma policy will fall
attempt to allocate the page from the node associated with the cpu back to the task policy.
where the fault occurs. If the "local" node has no memory, or the
node's memory can be exhausted [no free pages available], local When specified in one of the memory policy APIs, the Default mode
allocation will "fallback to"--attempt to allocate pages from-- does not use the optional set of nodes.
"nearby" nodes, in order of increasing "distance".
Implementation detail -- subject to change: "Fallback" uses
a per node list of sibling nodes--called zonelists--built at
boot time, or when nodes or memory are added or removed from
the system [memory hotplug]. These per node zonelist are
constructed with nodes in order of increasing distance based
on information provided by the platform firmware.
When a task/process policy or a shared policy contains the Default
mode, this also means "local allocation", as described above.
In the context of a VMA, Default mode means "fall back to task
policy"--which may or may not specify Default mode. Thus, Default
mode can not be counted on to mean local allocation when used
on a non-shared region of the address space. However, see
MPOL_PREFERRED below.
It is an error for the set of nodes specified for this policy to It is an error for the set of nodes specified for this policy to
be non-empty. be non-empty.
...@@ -187,19 +170,18 @@ Components of Memory Policies ...@@ -187,19 +170,18 @@ Components of Memory Policies
MPOL_PREFERRED: This mode specifies that the allocation should be MPOL_PREFERRED: This mode specifies that the allocation should be
attempted from the single node specified in the policy. If that attempted from the single node specified in the policy. If that
allocation fails, the kernel will search other nodes, exactly as allocation fails, the kernel will search other nodes, in order of
it would for a local allocation that started at the preferred node increasing distance from the preferred node based on information
in increasing distance from the preferred node. "Local" allocation provided by the platform firmware.
policy can be viewed as a Preferred policy that starts at the node
containing the cpu where the allocation takes place. containing the cpu where the allocation takes place.
Internally, the Preferred policy uses a single node--the Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. A "distinguished preferred_node member of struct mempolicy. A "distinguished
value of this preferred_node, currently '-1', is interpreted value of this preferred_node, currently '-1', is interpreted
as "the node containing the cpu where the allocation takes as "the node containing the cpu where the allocation takes
place"--local allocation. This is the way to specify place"--local allocation. "Local" allocation policy can be
local allocation for a specific range of addresses--i.e. for viewed as a Preferred policy that starts at the node containing
VMA policies. the cpu where the allocation takes place.
It is possible for the user to specify that local allocation is It is possible for the user to specify that local allocation is
always preferred by passing an empty nodemask with this mode. always preferred by passing an empty nodemask with this mode.
......
...@@ -104,9 +104,13 @@ static struct kmem_cache *sn_cache; ...@@ -104,9 +104,13 @@ static struct kmem_cache *sn_cache;
policied. */ policied. */
enum zone_type policy_zone = 0; enum zone_type policy_zone = 0;
/*
* run-time system-wide default policy => local allocation
*/
struct mempolicy default_policy = { struct mempolicy default_policy = {
.refcnt = ATOMIC_INIT(1), /* never free it */ .refcnt = ATOMIC_INIT(1), /* never free it */
.mode = MPOL_DEFAULT, .mode = MPOL_PREFERRED,
.v = { .preferred_node = -1 },
}; };
static const struct mempolicy_operations { static const struct mempolicy_operations {
...@@ -189,7 +193,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, ...@@ -189,7 +193,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
if (mode == MPOL_DEFAULT) { if (mode == MPOL_DEFAULT) {
if (nodes && !nodes_empty(*nodes)) if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL); return ERR_PTR(-EINVAL);
return NULL; return NULL; /* simply delete any existing policy */
} }
VM_BUG_ON(!nodes); VM_BUG_ON(!nodes);
...@@ -246,7 +250,6 @@ void __mpol_put(struct mempolicy *p) ...@@ -246,7 +250,6 @@ void __mpol_put(struct mempolicy *p)
{ {
if (!atomic_dec_and_test(&p->refcnt)) if (!atomic_dec_and_test(&p->refcnt))
return; return;
p->mode = MPOL_DEFAULT;
kmem_cache_free(policy_cache, p); kmem_cache_free(policy_cache, p);
} }
...@@ -626,13 +629,16 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, ...@@ -626,13 +629,16 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
return 0; return 0;
} }
/* Fill a zone bitmap for a policy */ /*
static void get_zonemask(struct mempolicy *p, nodemask_t *nodes) * Return nodemask for policy for get_mempolicy() query
*/
static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
{ {
nodes_clear(*nodes); nodes_clear(*nodes);
if (p == &default_policy)
return;
switch (p->mode) { switch (p->mode) {
case MPOL_DEFAULT:
break;
case MPOL_BIND: case MPOL_BIND:
/* Fall through */ /* Fall through */
case MPOL_INTERLEAVE: case MPOL_INTERLEAVE:
...@@ -686,6 +692,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, ...@@ -686,6 +692,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
} }
if (flags & MPOL_F_ADDR) { if (flags & MPOL_F_ADDR) {
/*
* Do NOT fall back to task policy if the
* vma/shared policy at addr is NULL. We
* want to return MPOL_DEFAULT in this case.
*/
down_read(&mm->mmap_sem); down_read(&mm->mmap_sem);
vma = find_vma_intersection(mm, addr, addr+1); vma = find_vma_intersection(mm, addr, addr+1);
if (!vma) { if (!vma) {
...@@ -700,7 +711,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, ...@@ -700,7 +711,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
return -EINVAL; return -EINVAL;
if (!pol) if (!pol)
pol = &default_policy; pol = &default_policy; /* indicates default behavior */
if (flags & MPOL_F_NODE) { if (flags & MPOL_F_NODE) {
if (flags & MPOL_F_ADDR) { if (flags & MPOL_F_ADDR) {
...@@ -715,8 +726,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, ...@@ -715,8 +726,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
err = -EINVAL; err = -EINVAL;
goto out; goto out;
} }
} else } else {
*policy = pol->mode | pol->flags; *policy = pol == &default_policy ? MPOL_DEFAULT :
pol->mode;
*policy |= pol->flags;
}
if (vma) { if (vma) {
up_read(&current->mm->mmap_sem); up_read(&current->mm->mmap_sem);
...@@ -725,7 +739,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, ...@@ -725,7 +739,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
err = 0; err = 0;
if (nmask) if (nmask)
get_zonemask(pol, nmask); get_policy_nodemask(pol, nmask);
out: out:
mpol_cond_put(pol); mpol_cond_put(pol);
...@@ -1286,8 +1300,7 @@ static struct mempolicy *get_vma_policy(struct task_struct *task, ...@@ -1286,8 +1300,7 @@ static struct mempolicy *get_vma_policy(struct task_struct *task,
addr); addr);
if (vpol) if (vpol)
pol = vpol; pol = vpol;
} else if (vma->vm_policy && } else if (vma->vm_policy)
vma->vm_policy->mode != MPOL_DEFAULT)
pol = vma->vm_policy; pol = vma->vm_policy;
} }
if (!pol) if (!pol)
...@@ -1334,7 +1347,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy) ...@@ -1334,7 +1347,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
nd = first_node(policy->v.nodes); nd = first_node(policy->v.nodes);
break; break;
case MPOL_INTERLEAVE: /* should not happen */ case MPOL_INTERLEAVE: /* should not happen */
case MPOL_DEFAULT:
nd = numa_node_id(); nd = numa_node_id();
break; break;
default: default:
...@@ -1369,9 +1381,15 @@ static unsigned interleave_nodes(struct mempolicy *policy) ...@@ -1369,9 +1381,15 @@ static unsigned interleave_nodes(struct mempolicy *policy)
*/ */
unsigned slab_node(struct mempolicy *policy) unsigned slab_node(struct mempolicy *policy)
{ {
unsigned short pol = policy ? policy->mode : MPOL_DEFAULT; if (!policy)
return numa_node_id();
switch (policy->mode) {
case MPOL_PREFERRED:
if (unlikely(policy->v.preferred_node >= 0))
return policy->v.preferred_node;
return numa_node_id();
switch (pol) {
case MPOL_INTERLEAVE: case MPOL_INTERLEAVE:
return interleave_nodes(policy); return interleave_nodes(policy);
...@@ -1390,13 +1408,8 @@ unsigned slab_node(struct mempolicy *policy) ...@@ -1390,13 +1408,8 @@ unsigned slab_node(struct mempolicy *policy)
return zone->node; return zone->node;
} }
case MPOL_PREFERRED:
if (policy->v.preferred_node >= 0)
return policy->v.preferred_node;
/* Fall through */
default: default:
return numa_node_id(); BUG();
} }
} }
...@@ -1650,8 +1663,6 @@ int __mpol_equal(struct mempolicy *a, struct mempolicy *b) ...@@ -1650,8 +1663,6 @@ int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
if (a->mode != MPOL_DEFAULT && !mpol_match_intent(a, b)) if (a->mode != MPOL_DEFAULT && !mpol_match_intent(a, b))
return 0; return 0;
switch (a->mode) { switch (a->mode) {
case MPOL_DEFAULT:
return 1;
case MPOL_BIND: case MPOL_BIND:
/* Fall through */ /* Fall through */
case MPOL_INTERLEAVE: case MPOL_INTERLEAVE:
...@@ -1828,7 +1839,7 @@ void mpol_shared_policy_init(struct shared_policy *info, unsigned short policy, ...@@ -1828,7 +1839,7 @@ void mpol_shared_policy_init(struct shared_policy *info, unsigned short policy,
if (policy != MPOL_DEFAULT) { if (policy != MPOL_DEFAULT) {
struct mempolicy *newpol; struct mempolicy *newpol;
/* Falls back to MPOL_DEFAULT on any error */ /* Falls back to NULL policy [MPOL_DEFAULT] on any error */
newpol = mpol_new(policy, flags, policy_nodes); newpol = mpol_new(policy, flags, policy_nodes);
if (!IS_ERR(newpol)) { if (!IS_ERR(newpol)) {
/* Create pseudo-vma that contains just the policy */ /* Create pseudo-vma that contains just the policy */
...@@ -1952,9 +1963,14 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) ...@@ -1952,9 +1963,14 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
char *p = buffer; char *p = buffer;
int l; int l;
nodemask_t nodes; nodemask_t nodes;
unsigned short mode = pol ? pol->mode : MPOL_DEFAULT; unsigned short mode;
unsigned short flags = pol ? pol->flags : 0; unsigned short flags = pol ? pol->flags : 0;
if (!pol || pol == &default_policy)
mode = MPOL_DEFAULT;
else
mode = pol->mode;
switch (mode) { switch (mode) {
case MPOL_DEFAULT: case MPOL_DEFAULT:
nodes_clear(nodes); nodes_clear(nodes);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment