Now hugepage migration is enabled, although restricted on pmd-based
hugepages for now (due to lack of testing.) So we should allocate
migratable hugepages from ZONE_MOVABLE if possible.
This patch makes GFP flags in hugepage allocation dependent on migration
support, not only the value of hugepages_treat_as_movable. It provides no
change on the behavior for architectures which do not support hugepage
migration,
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simple cleanup. Every user of vma_set_policy() does the same work, this
looks a bit annoying imho. And the new trivial helper which does
mpol_dup() + vma_set_policy() to simplify the callers.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.
Then later copy_process() denies CLONE_SIGHAND if the new process will
be in a different pid namespace (task_active_pid_ns() doesn't match
current->nsproxy->pid_ns).
This looks confusing and inconsistent. CLONE_NEWPID is very similar to
the case when ->pid_ns was already unshared, we want the same
restrictions so copy_process() should also nack CLONE_PARENT.
And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
just for consistency.
Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
copy_process() to do the same check along with ->pid_ns check we already
have.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Walters <walters@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8382fcac1b ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
the same check.
Remove the CLONE_NEWPID check to cleanup the code and prepare for the
next change.
Test-case:
static int child(void *arg)
{
return 0;
}
static char stack[16 * 1024];
int main(void)
{
pid_t pid;
assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
pid = clone(child, stack + sizeof(stack) / 2,
CLONE_NEWPID | SIGCHLD, NULL);
assert(pid < 0 && errno == EINVAL);
return 0;
}
clone(CLONE_NEWPID) correctly fails with or without this change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Colin Walters <walters@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 8382fcac1b ("pidns: Outlaw thread creation after
unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
pid_ns, this obviously breaks vfork:
int main(void)
{
assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
assert(vfork() >= 0);
_exit(0);
return 0;
}
fails without this patch.
Change this check to use CLONE_SIGHAND instead. This also forbids
CLONE_THREAD automatically, and this is what the comment implies.
We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
would be safer to not do this. The current check denies CLONE_SIGHAND
implicitely and there is no reason to change this.
Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
Having shared signal handling between two different pid namespaces is
the case that we are fundamentally guarding against."
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Colin Walters <walters@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile 3 (of many) from Al Viro:
"Waiman's conversion of d_path() and bits related to it,
kern_path_mountpoint(), several cleanups and fixes (exportfs
one is -stable fodder, IMO).
There definitely will be more... ;-/"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
dcache: Translating dentry into pathname without taking rename_lock
autofs4 - fix device ioctl mount lookup
introduce kern_path_mountpoint()
rename user_path_umountat() to user_path_mountpoint_at()
take unlazy_walk() into umount_lookup_last()
Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
prune_super(): sb->s_op is never NULL
exportfs: don't assume that ->iterate() won't feed us too long entries
afs: get rid of redundant ->d_name.len checks
are still in flux, and will have to wait for 3.13.
The changes for 3.12 are mostly clean ups and minor fixes.
H. Peter Anvin added a check to x86_32 static function tracing that
helps a small segment of the kernel community.
Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
and not worth pushing in the -rc time frame.
Li Zefan had small clean up with annotating a raw_init with __init.
I fixed a slight race in updating function callbacks, but the race
is so small and the bug that happens when it occurs is so minor it's
not even worth pushing to stable.
The only real enhancement is from Alexander Z Lam that made the
tracing_cpumask work for trace buffer instances, instead of them all
sharing a global cpumask.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSLJm1AAoJEOdOSU1xswtMSu0H/0/Uuh0D5VhANZRcTATY4gUO
n3WH6sm3atOxH+cbeYQcFXxOcvRcR2n90tvCMpiFlPiC0NiNR1yjro3VLS4zWb77
twq7gABdJf+Tdq7sOBmSzmY5vRKQVHIXvAfC27mBez38nCWZz0BjJGEsPBwoly25
ZaiCbKlusw/QKIEy40tuKUL/rXF6yEWnQrMujhBbyNm0w7sJVdfnd+HHmCvy15H2
IQE1g83d/dAMBjFY2BYg77J+oV6qmJxql2itvDivQWXHqFb52Jw3ZTwHwWLZlPYU
AZcHtYGs2lSUscQLF56LejB7zZyE8taUufExFEVexXxZS5u7nNPXsPrA2LOOK70=
=JWO6
-----END PGP SIGNATURE-----
Merge tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"Not much changes for the 3.12 merge window. The major tracing changes
are still in flux, and will have to wait for 3.13.
The changes for 3.12 are mostly clean ups and minor fixes.
H Peter Anvin added a check to x86_32 static function tracing that
helps a small segment of the kernel community.
Oleg Nesterov had a few changes from 3.11, but were mostly clean ups
and not worth pushing in the -rc time frame.
Li Zefan had small clean up with annotating a raw_init with __init.
I fixed a slight race in updating function callbacks, but the race is
so small and the bug that happens when it occurs is so minor it's not
even worth pushing to stable.
The only real enhancement is from Alexander Z Lam that made the
tracing_cpumask work for trace buffer instances, instead of them all
sharing a global cpumask"
* tag 'trace-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace/rcu: Do not trace debug_lockdep_rcu_enabled()
x86-32, ftrace: Fix static ftrace when early microcode is enabled
ftrace: Fix a slight race in modifying what function callback gets traced
tracing: Make tracing_cpumask available for all instances
tracing: Kill the !CONFIG_MODULES code in trace_events.c
tracing: Don't pass file_operations array to event_create_dir()
tracing: Kill trace_create_file_ops() and friends
tracing/syscalls: Annotate raw_init function with __init
For 3.12-rc1 there are a number of bugfixes in addition to work to ease usage
of shared code between libxfs and the kernel, the rest of the work to enable
project and group quotas to be used simultaneously, performance optimisations
in the log and the CIL, directory entry file type support, fixes for log space
reservations, some spelling/grammar cleanups, and the addition of user
namespace support.
- introduce readahead to log recovery
- add directory entry file type support
- fix a number of spelling errors in comments
- introduce new Q_XGETQSTATV quotactl for project quotas
- add USER_NS support
- log space reservation rework
- CIL optimisations
- kernel/userspace libxfs rework
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJSLeikAAoJENaLyazVq6ZOciEP/3tc850sQsPlNwP9aqd1l2Wk
S1RJ8i+MUQ2W/PlbswCXvdUCT8DIwXWxL31tGvi8vtaLhh6t8ICSZwqNil+/GCIJ
BErVvY4oXhEMHhlbIRRvpxblTfJGiYy3puUEz9VI0yDdUVnC33+DuEeLTQ/0mibo
/UUqKFmM3KYpOc8vIQvH5K5i8PkjtMt9yge0k4l9COD30gtY2okkaD4b1voOsKc+
5YFqulq7zcXBUYti+EFCQeV8aUBTGEPN4PJRdcS12/ylzsTzZivAOO+QREu7qBW8
x+Gj8fOC+yYWCttmJlfa1n8taxge3ndEuzKN97nvvfQgjvvunMvwJ499skryYVdB
EcPnBnpDUQuz/y7exKBT9uROK817vZBtfHzSova29ayQSWC+qDpNE4xXeDIqeCtT
CPxdHuWMOvIdZg41E4x7je0elaZl8EAZ8hycc2WuRhtukEkIdE1O8aD7IVrMYee8
kg+aVHG5nmYRInO1WuMinbtiCzwvVoBJToWM3y4cbfgW0dILASRyL53HDd+eCr1j
kOpPIVgXlBZgiPMmdYahWxyVVWcE7zyex0w4frzWVlJMZ4lP5brppD6qfQg1JwOB
z21Y95F5C2GxSyN/Lwps0G6jujHrpe6GVeYK7uKCtnqTD83nSShv5Naln7pQ3AUs
qUMsqmJob4+bwt94Xgbx
=V4s4
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs
Pull xfs updates from Ben Myers:
"For 3.12-rc1 there are a number of bugfixes in addition to work to
ease usage of shared code between libxfs and the kernel, the rest of
the work to enable project and group quotas to be used simultaneously,
performance optimisations in the log and the CIL, directory entry file
type support, fixes for log space reservations, some spelling/grammar
cleanups, and the addition of user namespace support.
- introduce readahead to log recovery
- add directory entry file type support
- fix a number of spelling errors in comments
- introduce new Q_XGETQSTATV quotactl for project quotas
- add USER_NS support
- log space reservation rework
- CIL optimisations
- kernel/userspace libxfs rework"
* tag 'xfs-for-linus-v3.12-rc1' of git://oss.sgi.com/xfs/xfs: (112 commits)
xfs: XFS_MOUNT_QUOTA_ALL needed by userspace
xfs: dtype changed xfs_dir2_sfe_put_ino to xfs_dir3_sfe_put_ino
Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
xfs: finish removing IOP_* macros.
xfs: inode log reservations are too small
xfs: check correct status variable for xfs_inobt_get_rec() call
xfs: inode buffers may not be valid during recovery readahead
xfs: check LSN ordering for v5 superblocks during recovery
xfs: btree block LSN escaping to disk uninitialised
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
xfs: fix bad dquot buffer size in log recovery readahead
xfs: don't account buffer cancellation during log recovery readahead
xfs: check for underflow in xfs_iformat_fork()
xfs: xfs_dir3_sfe_put_ino can be static
xfs: introduce object readahead to log recovery
xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
xfs: Register hotcpu notifier after initialization
xfs: add xfs sb v4 support for dirent filetype field
xfs: Add write support for dirent filetype field
xfs: Add read-only support for dirent filetype field
...
kernel/cgroup.c is the only place in the tree that relies on eventfd.h
pulling file.h; move that include there. Switch from eventfd_fget()/fput()
to fdget()/fdput(), while we are at it - eventfd_ctx_fileget() will fail
on non-eventfd descriptors just fine, no need to do that check twice...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.
The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.
Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users
Pull crypto update from Herbert Xu:
"Here is the crypto update for 3.12:
- Added MODULE_SOFTDEP to allow pre-loading of modules.
- Reinstated crct10dif driver using the module softdep feature.
- Allow via rng driver to be auto-loaded.
- Split large input data when necessary in nx.
- Handle zero length messages correctly for GCM/XCBC in nx.
- Handle SHA-2 chunks bigger than block size properly in nx.
- Handle unaligned lengths in omap-aes.
- Added SHA384/SHA512 to omap-sham.
- Added OMAP5/AM43XX SHAM support.
- Added OMAP4 TRNG support.
- Misc fixes"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (66 commits)
Reinstate "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
hwrng: via - Add MODULE_DEVICE_TABLE
crypto: fcrypt - Fix bitoperation for compilation with clang
crypto: nx - fix SHA-2 for chunks bigger than block size
crypto: nx - fix GCM for zero length messages
crypto: nx - fix XCBC for zero length messages
crypto: nx - fix limits to sg lists for AES-CCM
crypto: nx - fix limits to sg lists for AES-XCBC
crypto: nx - fix limits to sg lists for AES-GCM
crypto: nx - fix limits to sg lists for AES-CTR
crypto: nx - fix limits to sg lists for AES-CBC
crypto: nx - fix limits to sg lists for AES-ECB
crypto: nx - add offset to nx_build_sg_lists()
padata - Register hotcpu notifier after initialization
padata - share code between CPU_ONLINE and CPU_DOWN_FAILED, same to CPU_DOWN_PREPARE and CPU_UP_CANCELED
hwrng: omap - reorder OMAP TRNG driver code
crypto: omap-sham - correct dma burst size
crypto: omap-sham - Enable Polling mode if DMA fails
crypto: tegra-aes - bitwise vs logical and
crypto: sahara - checking the wrong variable
...
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
Pull cputime fix from Ingo Molnar:
"This fixes a longer-standing cputime accounting bug that Stanislaw
Gruszka finally managed to track down"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Do not scale when utime == 0
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/
This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.
There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
The function debug_lockdep_rcu_enabled() is part of the RCU lockdep
debugging, and is called very frequently. I found that if I enable
a lot of debugging and run the function graph tracer, this
function can cause a live lock of the system.
We don't usually trace lockdep infrastructure, no need to trace
this either.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull KVM updates from Gleb Natapov:
"The highlights of the release are nested EPT and pv-ticketlocks
support (hypervisor part, guest part, which is most of the code, goes
through tip tree). Apart of that there are many fixes for all arches"
Fix up semantic conflicts as discussed in the pull request thread..
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits)
ARM: KVM: Add newlines to panic strings
ARM: KVM: Work around older compiler bug
ARM: KVM: Simplify tracepoint text
ARM: KVM: Fix kvm_set_pte assignment
ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256
ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs
ARM: KVM: vgic: fix GICD_ICFGRn access
ARM: KVM: vgic: simplify vgic_get_target_reg
KVM: MMU: remove unused parameter
KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
KVM: PPC: Book3S: Fix compile error in XICS emulation
KVM: PPC: Book3S PR: return appropriate error when allocation fails
arch: powerpc: kvm: add signed type cast for comparation
KVM: x86: add comments where MMIO does not return to the emulator
KVM: vmx: count exits to userspace during invalid guest emulation
KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp
kvm: optimize away THP checks in kvm_is_mmio_pfn()
...
CONFIG_DEBUG_KOBJECT_RELEASE which may be theoretical.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJSJsL7AAoJENkgDmzRrbjx/N0P+wU81L55uYdJnDVPsW/fzhyJ
BEZtmhFWtqxHeuFMQpwm5N3/ORwMht4czF4Hf957ePtI2PelHu+kapnFFQ+/KHZs
T5sjH0mAYf9+CPa8wj4OKVzvd8lH4udbV+E7INVfciDySRX5HKXynJ6pZHvfeu7y
/2q2PH6kGzuGoTEkDwOOwJ5yyZqs+RW9ZkSMStgCOUE8GmoDXEsH1KwIYE/9buCh
XTngMo7AhimQaQ9QKypJLjlcnI9X/9ljXqFRKqSFOeMA1Ba+h+7eUqd4FJI6jDJu
tecMOxX9PezPK6Wdg8V7AFBSzOhDPqoKQBOcaqeLd1wVICi8oQirVzwQNlsoiVNu
JC+8rDqaeuG3dazROhaAnez7nhHfTjnMsYLVMRUmYtqXetd0qWYSmlmcJRKSJwi4
okl/Lv5BroQdB9bB8+sc7l34nE3HXZGV3tJcNXf91NNEbDt97xFZ2YYbdtsQcOqj
igUHcjsZq1gLsdIhnkHjTGLkLPxMTdCi8mtUc9+uzXHvSPJqEUMcn+fEA7RdliUw
/WvpUX2tj2Al44LfBy6L0D6AVyS2/zIQ9PzH6FHsgVDqrNHomkF20w3btnM3yyPA
hakV6vPr+kOpfSJYlTSU7yhEJh+LGkfPXeaX4X3tqubKhsZjq8rS7vrfbcqmgwvT
DbzDKRx5R3URiiFSbb2v
=15lp
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Minor fixes mainly, including a potential use-after-free on remove
found by CONFIG_DEBUG_KOBJECT_RELEASE which may be theoretical"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: Fix mod->mkobj.kobj potentially freed too early
kernel/params.c: use scnprintf() instead of sprintf()
kernel/module.c: use scnprintf() instead of sprintf()
module/lsm: Have apparmor module parameters work with no args
module: Add NOARG flag for ops with param_set_bool_enable_only() set function
module: Add flag to allow mod params to have no arguments
modules: add support for soft module dependencies
scripts/mod/modpost.c: permit '.cranges' secton for sh64 architecture.
module: fix sprintf format specifier in param_get_byte()
Pull x86 spinlock changes from Ingo Molnar:
"The biggest change here are paravirtualized ticket spinlocks (PV
spinlocks), which bring a nice speedup on various benchmarks.
The KVM host side will come to you via the KVM tree"
* 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kvm/guest: Fix sparse warning: "symbol 'klock_waiting' was not declared as static"
kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor
kvm guest: Add configuration support to enable debug information for KVM Guests
kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi
xen, pvticketlock: Allow interrupts to be enabled while blocking
x86, ticketlock: Add slowpath logic
jump_label: Split jumplabel ratelimit
x86, pvticketlock: When paravirtualizing ticket locks, increment by 2
x86, pvticketlock: Use callee-save for lock_spinning
xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks
xen, pvticketlock: Xen implementation for PV ticket locks
xen: Defer spinlock setup until boot CPU setup
x86, ticketlock: Collapse a layer of functions
x86, ticketlock: Don't inline _spin_unlock when using paravirt spinlocks
x86, spinlock: Replace pv spinlocks with pv ticketlocks
Pull timers/nohz changes from Ingo Molnar:
"It mostly contains fixes and full dynticks off-case optimizations, by
Frederic Weisbecker"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
nohz: Include local CPU in full dynticks global kick
nohz: Optimize full dynticks's sched hooks with static keys
nohz: Optimize full dynticks state checks with static keys
nohz: Rename a few state variables
vtime: Always debug check snapshot source _before_ updating it
vtime: Always scale generic vtime accounting results
vtime: Optimize full dynticks accounting off case with static keys
vtime: Describe overriden functions in dedicated arch headers
m68k: hardirq_count() only need preempt_mask.h
hardirq: Split preempt count mask definitions
context_tracking: Split low level state headers
vtime: Fix racy cputime delta update
vtime: Remove a few unneeded generic vtime state checks
context_tracking: User/kernel broundary cross trace events
context_tracking: Optimize context switch off case with static keys
context_tracking: Optimize guest APIs off case with static key
context_tracking: Optimize main APIs off case with static key
context_tracking: Ground setup for static key use
context_tracking: Remove full dynticks' hacky dependency on wide context tracking
nohz: Only enable context tracking on full dynticks CPUs
...
Pull x86/asmlinkage changes from Ingo Molnar:
"As a preparation for Andi Kleen's LTO patchset (link time
optimizations using GCC's -flto which build time optimization has
steadily increased in quality over the past few years and might
eventually be usable for the kernel too) this tree includes a handful
of preparatory patches that make function calling convention
annotations consistent again:
- Mark every function without arguments (or 64bit only) that is used
by assembly code with asmlinkage()
- Mark every function with parameters or variables that is used by
assembly code as __visible.
For the vanilla kernel this has documentation, consistency and
debuggability advantages, for the time being"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asmlinkage: Fix warning in xen asmlinkage change
x86, asmlinkage, vdso: Mark vdso variables __visible
x86, asmlinkage, power: Make various symbols used by the suspend asm code visible
x86, asmlinkage: Make dump_stack visible
x86, asmlinkage: Make 64bit checksum functions visible
x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops
x86, asmlinkage, apm: Make APM data structure used from assembler visible
x86, asmlinkage: Make syscall tables visible
x86, asmlinkage: Make several variables used from assembler/linker script visible
x86, asmlinkage: Make kprobes code visible and fix assembler code
x86, asmlinkage: Make various syscalls asmlinkage
x86, asmlinkage: Make 32bit/64bit __switch_to visible
x86, asmlinkage: Make _*_start_kernel visible
x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible
x86, asmlinkage: Change dotraplinkage into __visible on 32bit
x86: Fix sys_call_table type in asm/syscall.h
Pull scheduler changes from Ingo Molnar:
"Various optimizations, cleanups and smaller fixes - no major changes
in scheduler behavior"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the sd_parent_degenerate() code
sched/fair: Rework and comment the group_imb code
sched/fair: Optimize find_busiest_queue()
sched/fair: Make group power more consistent
sched/fair: Remove duplicate load_per_task computations
sched/fair: Shrink sg_lb_stats and play memset games
sched: Clean-up struct sd_lb_stat
sched: Factor out code to should_we_balance()
sched: Remove one division operation in find_busiest_queue()
sched/cputime: Use this_cpu_add() in task_group_account_field()
cpumask: Fix cpumask leak in partition_sched_domains()
sched/x86: Optimize switch_mm() for multi-threaded workloads
generic-ipi: Kill unnecessary variable - csd_flags
numa: Mark __node_set() as __always_inline
sched/fair: Cleanup: remove duplicate variable declaration
sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
Pull perf changes from Ingo Molnar:
"As a first remark I'd like to point out that the obsolete '-f'
(--force) option, which has not done anything for several releases,
has been removed from 'perf record' and related utilities. Everyone
please update muscle memory accordingly! :-)
Main changes on the perf kernel side:
- Performance optimizations:
. for trace events, by Steve Rostedt.
. for time values, by Peter Zijlstra
- New hardware support:
. for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
. for Intel SNB-EP uncore PMUs, by Zheng Yan
- Enhanced hardware support:
. for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan
- Core perf events code enhancements and fixes:
. for full-nohz feature handling, by Frederic Weisbecker
. for group events, by Jiri Olsa
. for call chains, by Frederic Weisbecker
. for event stream parsing, by Adrian Hunter
- New ABI details:
. Add attr->mmap2 attribute, by Stephane Eranian
. Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
. Export u64 time_zero on the mmap header page to allow TSC
calculation, by Adrian Hunter
. Add dummy software event, by Adrian Hunter.
. Add a new PERF_SAMPLE_IDENTIFIER to make samples always
parseable, by Adrian Hunter.
. Make Power7 events available via sysfs, by Runzhen Wang.
- Code cleanups and refactorings:
. for nohz-full, by Frederic Weisbecker
. for group events, by Jiri Olsa
- Documentation updates:
. for perf_event_type, by Peter Zijlstra
Main changes on the perf tooling side (some of these tooling changes
utilize the above kernel side changes):
- Lots of 'perf trace' enhancements:
. Make 'perf trace' command line arguments consistent with
'perf record', by David Ahern.
. Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.
. Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.
. Support ! in -e expressions, to filter a list of syscalls,
by Arnaldo Carvalho de Melo.
. Arg formatting improvements to allow masking arguments in
syscalls such as futex and open, where the some arguments are
ignored and thus should not be printed depending on other args,
by Arnaldo Carvalho de Melo.
. Beautify futex open, openat, open_by_handle_at, lseek and futex
syscalls, by Arnaldo Carvalho de Melo.
. Add option to analyze events in a file versus live, so that
one can do:
[root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
[root@zoo ~]# perf trace -i perf.data -e futex --duration 1
17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
[root@zoo ~]#
By David Ahern.
. Honor target pid / tid options when analyzing a file, by David Ahern.
. Introduce better formatting of syscall arguments, including so
far beautifiers for mmap, madvise, syscall return values,
by Arnaldo Carvalho de Melo.
. Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.
- 'perf report/top' enhancements:
. Do annotation using /proc/kcore and /proc/kallsyms when
available, removing the forced need for a vmlinux file kernel
assembly annotation. This also improves this use case because
vmlinux has just the initial kernel image, not what is actually
in use after various code patchings by things like alternatives.
By Adrian Hunter.
. Add --ignore-callees=<regex> option to collapse undesired parts
of call graphs, by Greg Price.
. Simplify symbol filtering by doing it at machine class level,
by Adrian Hunter.
. Add support for callchains in the gtk UI, by Namhyung Kim.
. Add --objdump option to 'perf top', by Sukadev Bhattiprolu.
- 'perf kvm' enhancements:
. Add option to print only events that exceed a specified time
duration, by David Ahern.
. Improve stack trace printing, by David Ahern.
. Update documentation of the live command, by David Ahern
. Add perf kvm stat live mode that combines aspects of 'perf kvm
stat' record and report, by David Ahern.
. Add option to analyze specific VM in perf kvm stat report, by
David Ahern.
. Do not require /lib/modules/* on a guest, by Jason Wessel.
- 'perf script' enhancements:
. Fix symbol offset computation for some dsos, by David Ahern.
. Fix named threads support, by David Ahern.
. Don't install scripting files files when perl/python support
is disabled, by Arnaldo Carvalho de Melo.
- 'perf test' enhancements:
. Add various improvements and fixes to the "vmlinux matches
kallsyms" 'perf test' entry, related to the /proc/kcore
annotation feature. By Adrian Hunter.
. Add sample parsing test, by Adrian Hunter.
. Add test for reading object code, by Adrian Hunter.
. Add attr record group sampling test, by Jiri Olsa.
. Misc testing infrastructure improvements and other details,
by Jiri Olsa.
- 'perf list' enhancements:
. Skip unsupported hardware events, by Namhyung Kim.
. List pmu events, by Andi Kleen.
- 'perf diff' enhancements:
. Add support for more than two files comparison, by Jiri Olsa.
- 'perf sched' enhancements:
. Various improvements, including removing reliance on some
scheduler tracepoints that provide the same information as the
PERF_RECORD_{FORK,EXIT} events. By David Ahern.
. Remove odd build stall by moving a large struct initialization
from a local variable to a global one, by Namhyung Kim.
- 'perf stat' enhancements:
. Add --initial-delay option to skip measuring for a defined
startup phase, by Andi Kleen.
- Generic perf tooling infrastructure/plumbing changes:
. Tidy up sample parsing validation, by Adrian Hunter.
. Fix up jobserver setup in libtraceevent Makefile.
by Arnaldo Carvalho de Melo.
. Debug improvements, by Adrian Hunter.
. Fix correlation of samples coming after PERF_RECORD_EXIT event,
by David Ahern.
. Improve robustness of the topology parsing code,
by Stephane Eranian.
. Add group leader sampling, that allows just one event in a group
to sample while the other events have just its values read,
by Jiri Olsa.
. Add support for a new modifier "D", which requests that the
event, or group of events, be pinned to the PMU.
By Michael Ellerman.
. Support callchain sorting based on addresses, by Andi Kleen
. Prep work for multi perf data file storage, by Jiri Olsa.
. libtraceevent cleanups, by Namhyung Kim.
And lots and lots of other fixes and code reorganizations that did not
make it into the list, see the shortlog, diffstat and the Git log for
details!"
[ Also merge a leftover from the 3.11 cycle ]
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Prevent race in unthrottling code
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
perf trace: Tell arg formatters the arg index
perf trace: Add beautifier for open's flags arg
perf trace: Add beautifier for lseek's whence arg
perf tools: Fix symbol offset computation for some dsos
perf list: Skip unsupported events
perf tests: Add 'keep tracking' test
perf tools: Add support for PERF_COUNT_SW_DUMMY
perf: Add a dummy software event to keep tracking
perf trace: Add beautifier for futex 'operation' parm
perf trace: Allow syscall arg formatters to mask args
perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
perf: Export struct perf_branch_entry to userspace
perf: Add attr->mmap2 attribute to an event
perf/x86: Add Silvermont (22nm Atom) support
perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
perf trace: Handle missing HUGEPAGE defines
perf trace: Honor target pid / tid options when analyzing a file
perf trace: Add option to analyze events in a file versus live
perf evlist: Add tracepoint lookup by name
perf tests: Add a sample parsing test
...
Pull core/locking changes from Ingo Molnar:
"Main changes:
- another mutex optimization, from Davidlohr Bueso
- improved lglock lockdep tracking, from Michel Lespinasse
- [ assorted smaller updates, improvements, cleanups. ]"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
generic-ipi/locking: Fix misleading smp_call_function_any() description
hung_task debugging: Print more info when reporting the problem
mutex: Avoid label warning when !CONFIG_MUTEX_SPIN_ON_OWNER
mutex: Do not unnecessarily deal with waiters
mutex: Fix/document access-once assumption in mutex_can_spin_on_owner()
lglock: Update lockdep annotations to report recursive local locks
lockdep: Introduce lock_acquire_exclusive()/shared() helper macros
Pull RCU updates from Ingo Molnar:
"Main RCU changes this cycle were:
- Full-system idle detection. This is for use by Frederic
Weisbecker's adaptive-ticks mechanism. Its purpose is to allow the
timekeeping CPU to shut off its tick when all other CPUs are idle.
- Miscellaneous fixes.
- Improved rcutorture test coverage.
- Updated RCU documentation"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
nohz_full: Force RCU's grace-period kthreads onto timekeeping CPU
nohz_full: Add full-system-idle state machine
jiffies: Avoid undefined behavior from signed overflow
rcu: Simplify _rcu_barrier() processing
rcu: Make rcutorture emit online failures if verbose
rcu: Remove unused variable from rcu_torture_writer()
rcu: Sort rcutorture module parameters
rcu: Increase rcutorture test coverage
rcu: Add duplicate-callback tests to rcutorture
doc: Fix memory-barrier control-dependency example
rcu: Update RTFP documentation
nohz_full: Add full-system-idle arguments to API
nohz_full: Add full-system idle states and variables
nohz_full: Add per-CPU idle-state tracking
nohz_full: Add rcu_dyntick data for scalable detection of all-idle state
nohz_full: Add Kconfig parameter for scalable detection of all-idle state
nohz_full: Add testing information to documentation
rcu: Eliminate unused APIs intended for adaptive ticks
rcu: Select IRQ_WORK from TREE_PREEMPT_RCU
rculist: list_first_or_null_rcu() should use list_entry_rcu()
...
scale_stime() silently assumes that stime < rtime, otherwise
when stime == rtime and both values are big enough (operations
on them do not fit in 32 bits), the resulting scaling stime can
be bigger than rtime. In consequence utime = rtime - stime
results in negative value.
User space visible symptoms of the bug are overflowed TIME
values on ps/top, for example:
$ ps aux | grep rcu
root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0]
root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0]
root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
or overflowed utime values read directly from /proc/$PID/stat
Reference:
https://lkml.org/lkml/2013/8/20/259
Reported-and-tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: stable@vger.kernel.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.
- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.
Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.
- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.
- Various fixes and cleanups"
Fixed up conflict in kernel/cgroup.c as per Tejun.
* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...
Pull workqueue updates from Tejun Heo:
"Nothing interesting. All are doc / comment updates"
* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Correct/Drop references to gcwq in Documentation
workqueue: Fix manage_workers() RETURNS description
workqueue: Comment correction in file header
workqueue: mark WQ_NON_REENTRANT deprecated
There's a slight race when going from a list function to a non list
function. That is, when only one callback is registered to the function
tracer, it gets called directly by the mcount trampoline. But if this
function has filters, it may be called by the wrong functions.
As the list ops callback that handles multiple callbacks that are
registered to ftrace, it also handles what functions they call. While
the transaction is taking place, use the list function always, and
after all the updates are finished (only the functions that should be
traced are being traced), then we can update the trampoline to call
the function directly.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
of Intel Thunderbolt support on systems that use ACPI for signalling
Thunderbolt hotplug events. This also should make ACPIPHP work in
some cases in which it was known to have problems. From
Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
2) ACPI core code cleanups and dock station support cleanups from
Jiang Liu and Rafael J Wysocki.
3) Fixes for locking problems related to ACPI device hotplug from
Rafael J Wysocki.
4) ACPICA update to version 20130725 includig fixes, cleanups, support
for more than 256 GPEs per GPE block and a change to make the ACPI
PM Timer optional (we've seen systems without the PM Timer in the
field already). One of the fixes, related to the DeRefOf operator,
is necessary to prevent some Windows 8 oriented AML from causing
problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim.
5) Removal of the old and long deprecated /proc/acpi/event interface
and related driver changes from Thomas Renninger.
6) ACPI and Xen changes to make the reduced hardware sleep work with
the latter from Ben Guthro.
7) ACPI video driver cleanups and a blacklist of systems that should
not tell the BIOS that they are compatible with Windows 8 (or ACPI
backlight and possibly other things will not work on them). From
Felipe Contreras.
8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
Toshi Kani, and Wei Yongjun.
9) cpufreq ondemand governor target frequency selection change to
reduce oscillations between min and max frequencies (essentially,
it causes the governor to choose target frequencies proportional
to load) from Stratos Karafotis.
10) cpufreq fixes allowing sysfs attributes file permissions to be
preserved over suspend/resume cycles Srivatsa S Bhat.
11) Removal of Device Tree parsing for CPU device nodes from multiple
cpufreq drivers that required some changes related to
of_get_cpu_node() to be made in a few architectures and in the
driver core. From Sudeep KarkadaNagesha.
12) cpufreq core fixes and cleanups related to mutual exclusion and
driver module references from Viresh Kumar, Lukasz Majewski and
Rafael J Wysocki.
13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
Stratos Karafotis, and Viresh Kumar.
14) Fixes to prevent race conditions in coupled cpuidle from happening
from Colin Cross.
15) cpuidle core fixes and cleanups from Daniel Lezcano and
Tuukka Tikkanen.
16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
and Sahara.
17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
18) PNP subsystem conversion to using struct dev_pm_ops for power
management from Shuah Khan.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABCAAGBQJSJcKhAAoJEKhOf7ml8uNsplIQAJSOshxhkkemvFOuHZ+0YIbh
R9aufjXeDkMDBi8YtU+tB7ERth1j+0LUSM0NTnP51U7e+7eSGobA9s5jSZQj2l7r
HFtnSOegLuKAfqwgfSLK91xa1rTFdfW0Kych9G2nuHtBIt6P0Oc59Cb5M0oy6QXs
nVtaDEuU//tmO71+EF5HnMJHabRTrpvtn/7NbDUpU7LZYpWJrHJFT9xt1rXNab7H
YRCATPm3kXGRg58Doc3EZE4G3D7DLvq74jWMaI089X/m5Pg1G6upqArypOy6oxdP
p2FEzYVrb2bi8fakXp7BBeO1gCJTAqIgAkbSSZHLpGhFaeEMmb9/DWPXdm2TjzMV
c1EEucvsqZWoprXgy12i5Hk814xN8d8nBBLg/UYiRJ44nc/hevXfyE9ZYj6bkseJ
+GNHmZIa1QYC05nnGli4+W4kHns8EZf/gmvIxnPuco1RN2yMWagrud5/G6Dr9M2B
hzJV6qauLVzgZso4oe79zv9aVxe/dPHKANLD/sg23WBiJJbJF1ocBlnj2Xlbpqze
pmMUWGiO/gUiS0fmpW/lAJauza5jFmSCjE4E8R0Gyn0j4YXjmMhdEanaU6J3VuCi
yVgEzYEth4sowq4AflMMLKYN+WmozDnK7taZRGmT0t+EKRFKLT6EgnNrkQgs1vKl
oawD9LM4fZ8E0yroOEme
=CgqW
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
of Intel Thunderbolt support on systems that use ACPI for signalling
Thunderbolt hotplug events. This also should make ACPIPHP work in
some cases in which it was known to have problems. From
Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
2) ACPI core code cleanups and dock station support cleanups from
Jiang Liu and Rafael J Wysocki.
3) Fixes for locking problems related to ACPI device hotplug from
Rafael J Wysocki.
4) ACPICA update to version 20130725 includig fixes, cleanups, support
for more than 256 GPEs per GPE block and a change to make the ACPI
PM Timer optional (we've seen systems without the PM Timer in the
field already). One of the fixes, related to the DeRefOf operator,
is necessary to prevent some Windows 8 oriented AML from causing
problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim.
5) Removal of the old and long deprecated /proc/acpi/event interface
and related driver changes from Thomas Renninger.
6) ACPI and Xen changes to make the reduced hardware sleep work with
the latter from Ben Guthro.
7) ACPI video driver cleanups and a blacklist of systems that should
not tell the BIOS that they are compatible with Windows 8 (or ACPI
backlight and possibly other things will not work on them). From
Felipe Contreras.
8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
Toshi Kani, and Wei Yongjun.
9) cpufreq ondemand governor target frequency selection change to
reduce oscillations between min and max frequencies (essentially,
it causes the governor to choose target frequencies proportional
to load) from Stratos Karafotis.
10) cpufreq fixes allowing sysfs attributes file permissions to be
preserved over suspend/resume cycles Srivatsa S Bhat.
11) Removal of Device Tree parsing for CPU device nodes from multiple
cpufreq drivers that required some changes related to
of_get_cpu_node() to be made in a few architectures and in the
driver core. From Sudeep KarkadaNagesha.
12) cpufreq core fixes and cleanups related to mutual exclusion and
driver module references from Viresh Kumar, Lukasz Majewski and
Rafael J Wysocki.
13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
Stratos Karafotis, and Viresh Kumar.
14) Fixes to prevent race conditions in coupled cpuidle from happening
from Colin Cross.
15) cpuidle core fixes and cleanups from Daniel Lezcano and
Tuukka Tikkanen.
16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
and Sahara.
17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
18) PNP subsystem conversion to using struct dev_pm_ops for power
management from Shuah Khan.
* tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (217 commits)
cpufreq: Don't use smp_processor_id() in preemptible context
cpuidle: coupled: fix race condition between pokes and safe state
cpuidle: coupled: abort idle if pokes are pending
cpuidle: coupled: disable interrupts after entering safe state
ACPI / hotplug: Remove containers synchronously
driver core / ACPI: Avoid device hot remove locking issues
cpufreq: governor: Fix typos in comments
cpufreq: governors: Remove duplicate check of target freq in supported range
cpufreq: Fix timer/workqueue corruption due to double queueing
ACPI / EC: Add ASUSTEK L4R to quirk list in order to validate ECDT
ACPI / thermal: Add check of "_TZD" availability and evaluating result
cpufreq: imx6q: Fix clock enable balance
ACPI: blacklist win8 OSI for buggy laptops
cpufreq: tegra: fix the wrong clock name
cpuidle: Change struct menu_device field types
cpuidle: Add a comment warning about possible overflow
cpuidle: Fix variable domains in get_typical_interval()
cpuidle: Fix menu_device->intervals type
cpuidle: CodingStyle: Break up multiple assignments on single line
cpuidle: Check called function parameter in get_typical_interval()
...
Here's the big tty/serial driver pull request for 3.12-rc1.
Lots of n_tty reworks to resolve some very long-standing issues, removing the
3-4 different locks that were taken for every character. This code has been
beaten on for a long time in linux-next with no reported regressions.
Other than that, a range of serial and tty driver updates and revisions. Full
details in the shortlog.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (GNU/Linux)
iEYEABECAAYFAlIlI6UACgkQMUfUDdst+ym7kgCgmysv/TVeqsdvmkiO2eEB4+xs
ddwAoMqkJ/enCJ2f+fC8y2Wz+5+kDrU7
=CiCp
-----END PGP SIGNATURE-----
Merge tag 'tty-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver patches from Greg KH:
"Here's the big tty/serial driver pull request for 3.12-rc1.
Lots of n_tty reworks to resolve some very long-standing issues,
removing the 3-4 different locks that were taken for every character.
This code has been beaten on for a long time in linux-next with no
reported regressions.
Other than that, a range of serial and tty driver updates and
revisions. Full details in the shortlog"
* tag 'tty-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (226 commits)
hvc_xen: Remove unnecessary __GFP_ZERO from kzalloc
serial: imx: initialize the local variable
tty: ar933x_uart: add device tree support and binding documentation
tty: ar933x_uart: allow to build the driver as a module
ARM: dts: msm: Update uartdm compatible strings
devicetree: serial: Document msm_serial bindings
serial: unify serial bindings into a single dir
serial: fsl-imx-uart: Cleanup duplicate device tree binding
tty: ar933x_uart: use config_enabled() macro to clean up ifdefs
tty: ar933x_uart: remove superfluous assignment of ar933x_uart_driver.nr
tty: ar933x_uart: use the clk API to get the uart clock
tty: serial: cpm_uart: Adding proper request of GPIO used by cpm_uart driver
serial: sirf: fix the amount of serial ports
serial: sirf: define macro for some magic numbers of USP
serial: icom: move array overflow checks earlier
TTY: amiserial, remove unnecessary platform_set_drvdata()
serial: st-asc: remove unnecessary platform_set_drvdata()
msm_serial: Send more than 1 character on the console w/ UARTDM
msm_serial: Add support for non-GSBI UARTDM devices
msm_serial: Switch clock consumer strings and simplify code
...
Here's the big driver core pull request for 3.12-rc1.
Lots of tiny changes here fixing up the way sysfs attributes are
created, to try to make drivers simpler, and fix a whole class race
conditions with creations of device attributes after the device was
announced to userspace.
All the various pieces are acked by the different subsystem maintainers.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (GNU/Linux)
iEYEABECAAYFAlIlIPcACgkQMUfUDdst+ynUMwCaAnITsxyDXYQ4DqEsz8EcOtMk
718AoLrgnUZs3B+70AT34DVktg4HSThk
=USl9
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg KH:
"Here's the big driver core pull request for 3.12-rc1.
Lots of tiny changes here fixing up the way sysfs attributes are
created, to try to make drivers simpler, and fix a whole class race
conditions with creations of device attributes after the device was
announced to userspace.
All the various pieces are acked by the different subsystem
maintainers"
* tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
firmware loader: fix pending_fw_head list corruption
drivers/base/memory.c: introduce help macro to_memory_block
dynamic debug: line queries failing due to uninitialized local variable
sysfs: sysfs_create_groups returns a value.
debugfs: provide debugfs_create_x64() when disabled
rbd: convert bus code to use bus_groups
firmware: dcdbas: use binary attribute groups
sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
driver core: add #include <linux/sysfs.h> to core files.
HID: convert bus code to use dev_groups
Input: serio: convert bus code to use drv_groups
Input: gameport: convert bus code to use drv_groups
driver core: firmware: use __ATTR_RW()
driver core: core: use DEVICE_ATTR_RO
driver core: bus: use DRIVER_ATTR_WO()
driver core: create write-only attribute macros for devices and drivers
sysfs: create __ATTR_WO()
driver-core: platform: convert bus code to use dev_groups
workqueue: convert bus code to use dev_groups
MEI: convert bus code to use dev_groups
...
Pull RCU updates from Paul E. McKenney:
"
* Update RCU documentation. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/611.
* Miscellaneous fixes. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/619.
* Full-system idle detection. This is for use by Frederic
Weisbecker's adaptive-ticks mechanism. Its purpose is
to allow the timekeeping CPU to shut off its tick when
all other CPUs are idle. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/648.
* Improve rcutorture test coverage. These were posted to LKML at
https://lkml.org/lkml/2013/8/19/675.
"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adds a new PERF_RECORD_MMAP2 record type which is essence
an expanded version of PERF_RECORD_MMAP.
Used to request mmap records with more information about
the mapping, including device major, minor and the inode
number and generation for mappings associated with files
or shared memory segments. Works for code and data
(with attr->mmap_data set).
Existing PERF_RECORD_MMAP record is unmodified by this patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: http://lkml.kernel.org/r/1377079825-19057-2-git-send-email-eranian@google.com
[ Added Al to the Cc:. Are the ino, maj/min exports of vma->vm_file OK? ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I found that on my WSM box I had a redundant domain:
[ 0.949769] CPU0 attaching sched-domain:
[ 0.953765] domain 0: span 0,12 level SIBLING
[ 0.958335] groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[ 0.964548] domain 1: span 0-5,12-17 level MC
[ 0.969206] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[ 0.984993] domain 2: span 0-5,12-17 level CPU
[ 0.989822] groups: 0-5,12-17 (cpu_power = 7055)
[ 0.995049] domain 3: span 0-23 level NUMA
[ 0.999620] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
Note how domain 2 has only a single group and spans the same CPUs as
domain 1. We should not keep such domains and do in fact have code to
prune these.
It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
makes sd_parent_degenerate() fail on the CPU domain. We can easily
fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it
to whatever domain ends up covering the span.
With this patch the domains now look like this:
[ 0.950419] CPU0 attaching sched-domain:
[ 0.954454] domain 0: span 0,12 level SIBLING
[ 0.959039] groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[ 0.965271] domain 1: span 0-5,12-17 level MC
[ 0.969936] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[ 0.985737] domain 2: span 0-23 level NUMA
[ 0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik reported some weirdness due to the group_imb code. As a start to
looking at it, clean it up a little and add a few explanatory
comments.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-caeeqttnla4wrrmhp5uf89gp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use for_each_cpu_and() and thereby avoid computing the capacity for
CPUs we know we're not interested in.
Reviewed-by: Paul Turner <pjt@google.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-lppceyv6kb3a19g8spmrn20b@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For easier access, less dereferences and more consistent value, store
the group power in update_sg_lb_stats() and use it thereafter. The
actual value in sched_group::sched_group_power::power can change
throughout the load-balance pass if we're unlucky.
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-739xxqkyvftrhnh9ncudutc7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we already compute (but don't store) the sgs load_per_task value
in update_sg_lb_stats() we might as well store it and not re-compute
it later on.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ym1vmljiwbzgdnnrwp9azftq@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can shrink sg_lb_stats because rq::nr_running is an unsigned int
and cpu numbers are 'int'
Before:
sgs: /* size: 72, cachelines: 2, members: 10 */
sds: /* size: 184, cachelines: 3, members: 7 */
After:
sgs: /* size: 56, cachelines: 1, members: 10 */
sds: /* size: 152, cachelines: 3, members: 7 */
Further we can avoid clearing all of sds since we do a total
clear/assignment of sg_stats in update_sg_lb_stats() with exception of
busiest_stat.avg_load which is referenced in update_sd_pick_busiest().
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-0klzmz9okll8wc0nsudguc9p@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no reason to maintain separate variables for this_group
and busiest_group in sd_lb_stat, except saving some space.
But this structure is always allocated in stack, so this saving
isn't really benificial [peterz: reducing stack space is good; in this
case readability increases enough that I think its still beneficial]
This patch unify these variables, so IMO, readability may be improved.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[ Rename this to local -- avoids confusion between this_cpu and the C++ this pointer. ]
Reviewed-by: Paul Turner <pjt@google.com>
[ Lots of style edits, a few fixes and a rename. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now checking whether this cpu is appropriate to balance or not
is embedded into update_sg_lb_stats() and this checking has no direct
relationship to this function. There is not enough reason to place
this checking at update_sg_lb_stats(), except saving one iteration
for sched_group_cpus.
In this patch, I factor out this checking to should_we_balance() function.
And before doing actual work for load_balancing, check whether this cpu is
appropriate to balance via should_we_balance(). If this cpu is not
a candidate for balancing, it quit the work immediately.
With this change, we can save two memset cost and can expect better
compiler optimization.
Below is result of this patch.
* Vanilla *
text data bss dec hex filename
34499 1136 116 35751 8ba7 kernel/sched/fair.o
* Patched *
text data bss dec hex filename
34243 1136 116 35495 8aa7 kernel/sched/fair.o
In addition, rename @balance to @continue_balancing in order to represent
its purpose more clearly.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
[ s/should_balance/continue_balancing/g ]
Reviewed-by: Paul Turner <pjt@google.com>
[ Made style changes and a fix in should_we_balance(). ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375778203-31343-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current throttling code triggers WARN below via following
workload (only hit on AMD machine with 48 CPUs):
# while [ 1 ]; do perf record perf bench sched messaging; done
WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100()
SNIP
Call Trace:
<IRQ> [<ffffffff815f62d6>] dump_stack+0x19/0x1b
[<ffffffff8105f531>] warn_slowpath_common+0x61/0x80
[<ffffffff8105f60a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810213a6>] x86_pmu_start+0xc6/0x100
[<ffffffff81129dd2>] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0
[<ffffffff8112a058>] perf_event_task_tick+0xc8/0xf0
[<ffffffff81093221>] scheduler_tick+0xd1/0x140
[<ffffffff81070176>] update_process_times+0x66/0x80
[<ffffffff810b9565>] tick_sched_handle.isra.15+0x25/0x60
[<ffffffff810b95e1>] tick_sched_timer+0x41/0x60
[<ffffffff81087c24>] __run_hrtimer+0x74/0x1d0
[<ffffffff810b95a0>] ? tick_sched_handle.isra.15+0x60/0x60
[<ffffffff81088407>] hrtimer_interrupt+0xf7/0x240
[<ffffffff81606829>] smp_apic_timer_interrupt+0x69/0x9c
[<ffffffff8160569d>] apic_timer_interrupt+0x6d/0x80
<EOI> [<ffffffff81129f74>] ? __perf_event_task_sched_in+0x184/0x1a0
[<ffffffff814dd937>] ? kfree_skbmem+0x37/0x90
[<ffffffff815f2c47>] ? __slab_free+0x1ac/0x30f
[<ffffffff8118143d>] ? kfree+0xfd/0x130
[<ffffffff81181622>] kmem_cache_free+0x1b2/0x1d0
[<ffffffff814dd937>] kfree_skbmem+0x37/0x90
[<ffffffff814e03c4>] consume_skb+0x34/0x80
[<ffffffff8158b057>] unix_stream_recvmsg+0x4e7/0x820
[<ffffffff814d5546>] sock_aio_read.part.7+0x116/0x130
[<ffffffff8112c10c>] ? __perf_sw_event+0x19c/0x1e0
[<ffffffff814d5581>] sock_aio_read+0x21/0x30
[<ffffffff8119a5d0>] do_sync_read+0x80/0xb0
[<ffffffff8119ac85>] vfs_read+0x145/0x170
[<ffffffff8119b699>] SyS_read+0x49/0xa0
[<ffffffff810df516>] ? __audit_syscall_exit+0x1f6/0x2a0
[<ffffffff81604a19>] system_call_fastpath+0x16/0x1b
---[ end trace 622b7e226c4a766a ]---
The reason is a race in perf_event_task_tick() throttling code.
The race flow (simplified code):
- perf_throttled_count is per cpu variable and is
CPU throttling flag, here starting with 0
- perf_throttled_seq is sequence/domain for allowed
count of interrupts within the tick, gets increased
each tick
on single CPU (CPU bounded event):
... workload
perf_event_task_tick:
|
| T0 inc(perf_throttled_seq)
| T1 needs_unthr = xchg(perf_throttled_count, 0) == 0
tick gets interrupted:
... event gets throttled under new seq ...
T2 last NMI comes, event is throttled - inc(perf_throttled_count)
back to tick:
| perf_adjust_freq_unthr_context:
|
| T3 unthrottling is skiped for event (needs_unthr == 0)
| T4 event is stop and started via freq adjustment
|
tick ends
... workload
... no sample is hit for event ...
perf_event_task_tick:
|
| T5 needs_unthr = xchg(perf_throttled_count, 0) != 0 (from T2)
| T6 unthrottling is done on event (interrupts == MAX_INTERRUPTS)
| event is already started (from T4) -> WARN
Fixing this by not checking needs_unthr again and thus
check all events for unthrottling.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1377355554-8934-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because RCU's quiescent-state-forcing mechanism is used to drive the
full-system-idle state machine, and because this mechanism is executed
by RCU's grace-period kthreads, this commit forces these kthreads to
run on the timekeeping CPU (tick_do_timer_cpu). To do otherwise would
mean that the RCU grace-period kthreads would force the system into
non-idle state every time they drove the state machine, which would
be just a bit on the futile side.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit adds the state machine that takes the per-CPU idle data
as input and produces a full-system-idle indication as output. This
state machine is driven out of RCU's quiescent-state-forcing
mechanism, which invokes rcu_sysidle_check_cpu() to collect per-CPU
idle state and then rcu_sysidle_report() to drive the state machine.
The full-system-idle state is sampled using rcu_sys_is_idle(), which
also drives the state machine if RCU is idle (and does so by forcing
RCU to become non-idle). This function returns true if all but the
timekeeping CPU (tick_do_timer_cpu) are idle and have been idle long
enough to avoid memory contention on the full_sysidle_state state
variable. The rcu_sysidle_force_exit() may be called externally
to reset the state machine back into non-idle state.
For large systems the state machine is driven out of RCU's
force-quiescent-state logic, which provides good scalability at the price
of millisecond-scale latencies on the transition to full-system-idle
state. This is not so good for battery-powered systems, which are usually
small enough that they don't need to care about scalability, but which
do care deeply about energy efficiency. Small systems therefore drive
the state machine directly out of the idle-entry code. The number of
CPUs in a "small" system is defined by a new NO_HZ_FULL_SYSIDLE_SMALL
Kconfig parameter, which defaults to 8. Note that this is a build-time
definition.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
[ paulmck: Use true and false for boolean constants per Lai Jiangshan. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Simplify logic and provide better comments for memory barriers,
based on review comments and questions by Lai Jiangshan. ]