Scheduler updates:

- Revert the printk format based wchan() symbol resolution as it can leak
    the raw value in case that the symbol is not resolvable.
 
  - Make wchan() more robust and work with all kind of unwinders by
    enforcing that the task stays blocked while unwinding is in progress.
 
  - Prevent sched_fork() from accessing an invalid sched_task_group
 
  - Improve asymmetric packing logic
 
  - Extend scheduler statistics to RT and DL scheduling classes and add
    statistics for bandwith burst to the SCHED_FAIR class.
 
  - Properly account SCHED_IDLE entities
 
  - Prevent a potential deadlock when initial priority is assigned to a
    newly created kthread. A recent change to plug a race between cpuset and
    __sched_setscheduler() introduced a new lock dependency which is now
    triggered. Break the lock dependency chain by moving the priority
    assignment to the thread function.
 
  - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
 
  - Improve idle balancing in general and especially for NOHZ enabled
    systems.
 
  - Provide proper interfaces for live patching so it does not have to
    fiddle with scheduler internals.
 
  - Add cluster aware scheduling support.
 
  - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
    scheduler options and delaying mmdrop)
 
  - The usual small tweaks and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
 iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
 /1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
 aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
 3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
 ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
 vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
 mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
 V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
 s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
 i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
 d2qWG7UvoseT+g==
 =fgtS
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - Revert the printk format based wchan() symbol resolution as it can
   leak the raw value in case that the symbol is not resolvable.

 - Make wchan() more robust and work with all kind of unwinders by
   enforcing that the task stays blocked while unwinding is in progress.

 - Prevent sched_fork() from accessing an invalid sched_task_group

 - Improve asymmetric packing logic

 - Extend scheduler statistics to RT and DL scheduling classes and add
   statistics for bandwith burst to the SCHED_FAIR class.

 - Properly account SCHED_IDLE entities

 - Prevent a potential deadlock when initial priority is assigned to a
   newly created kthread. A recent change to plug a race between cpuset
   and __sched_setscheduler() introduced a new lock dependency which is
   now triggered. Break the lock dependency chain by moving the priority
   assignment to the thread function.

 - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.

 - Improve idle balancing in general and especially for NOHZ enabled
   systems.

 - Provide proper interfaces for live patching so it does not have to
   fiddle with scheduler internals.

 - Add cluster aware scheduling support.

 - A small set of tweaks for RT (irqwork, wait_task_inactive(), various
   scheduler options and delaying mmdrop)

 - The usual small tweaks and improvements all over the place

* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  sched/fair: Cleanup newidle_balance
  sched/fair: Remove sysctl_sched_migration_cost condition
  sched/fair: Wait before decaying max_newidle_lb_cost
  sched/fair: Skip update_blocked_averages if we are defering load balance
  sched/fair: Account update_blocked_averages in newidle_balance cost
  x86: Fix __get_wchan() for !STACKTRACE
  sched,x86: Fix L2 cache mask
  sched/core: Remove rq_relock()
  sched: Improve wake_up_all_idle_cpus() take #2
  irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
  irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
  irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
  sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
  sched: Add cluster scheduler level for x86
  sched: Add cluster scheduler level in core and related Kconfig for ARM64
  topology: Represent clusters of CPUs within a die
  sched: Disable -Wunused-but-set-variable
  sched: Add wrapper for get_wchan() to keep task blocked
  x86: Fix get_wchan() to support the ORC unwinder
  proc: Use task_is_running() for wchan in /proc/$pid/stat
  ...
This commit is contained in:
Linus Torvalds 2021-11-01 13:48:52 -07:00
commit 9a7e0a90a4
105 changed files with 1683 additions and 790 deletions

View File

@ -42,6 +42,12 @@ Description: the CPU core ID of cpuX. Typically it is the hardware platform's
architecture and platform dependent.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/cluster_id
Description: the cluster ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
Values: integer
What: /sys/devices/system/cpu/cpuX/topology/book_id
Description: the book ID of cpuX. Typically it is the hardware platform's
identifier (rather than the kernel's). The actual value is
@ -85,6 +91,15 @@ Description: human-readable list of CPUs within the same die.
The format is like 0-3, 8-11, 14,17.
Values: decimal list.
What: /sys/devices/system/cpu/cpuX/topology/cluster_cpus
Description: internal kernel map of CPUs within the same cluster.
Values: hexadecimal bitmask.
What: /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
Description: human-readable list of CPUs within the same cluster.
The format is like 0-3, 8-11, 14,17.
Values: decimal list.
What: /sys/devices/system/cpu/cpuX/topology/book_siblings
Description: internal kernel map of cpuX's hardware threads within the same
book_id. it's only used on s390.

View File

@ -1016,6 +1016,8 @@ All time durations are in microseconds.
- nr_periods
- nr_throttled
- throttled_usec
- nr_bursts
- burst_usec
cpu.weight
A read-write single value file which exists on non-root
@ -1047,6 +1049,12 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
cpu.max.burst
A read-write single value file which exists on non-root
cgroups. The default is "0".
The burst in the range [0, $MAX].
cpu.pressure
A read-write nested-keyed file.

View File

@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
#define topology_cluster_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
#define topology_cluster_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
1) topology_physical_package_id: -1
2) topology_die_id: -1
3) topology_core_id: 0
4) topology_sibling_cpumask: just the given CPU
5) topology_core_cpumask: just the given CPU
6) topology_die_cpumask: just the given CPU
3) topology_cluster_id: -1
4) topology_core_id: 0
5) topology_sibling_cpumask: just the given CPU
6) topology_core_cpumask: just the given CPU
7) topology_cluster_cpumask: just the given CPU
8) topology_die_cpumask: just the given CPU
For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
default definitions for topology_book_id() and topology_book_cpumask().

View File

@ -22,9 +22,52 @@ cfs_quota units at each period boundary. As threads consume this bandwidth it
is transferred to cpu-local "silos" on a demand basis. The amount transferred
within each of these updates is tunable and described as the "slice".
Burst feature
-------------
This feature borrows time now against our future underrun, at the cost of
increased interference against the other system users. All nicely bounded.
Traditional (UP-EDF) bandwidth control is something like:
(U = \Sum u_i) <= 1
This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.
The burst feature observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.
For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.
That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.
At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).
The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/
Management
----------
Quota and period are managed within the cpu subsystem via cgroupfs.
Quota, period and burst are managed within the cpu subsystem via cgroupfs.
.. note::
The cgroupfs files described in this section are only applicable
@ -32,29 +75,37 @@ Quota and period are managed within the cpu subsystem via cgroupfs.
:ref:`Documentation/admin-guide/cgroup-v2.rst <cgroup-v2-cpu>`.
- cpu.cfs_quota_us: the total available run-time within a period (in
microseconds)
- cpu.cfs_quota_us: run-time replenished within a period (in microseconds)
- cpu.cfs_period_us: the length of a period (in microseconds)
- cpu.stat: exports throttling statistics [explained further below]
- cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds)
The default values are::
cpu.cfs_period_us=100ms
cpu.cfs_quota=-1
cpu.cfs_quota_us=-1
cpu.cfs_burst_us=0
A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
bandwidth restriction in place, such a group is described as an unconstrained
bandwidth group. This represents the traditional work-conserving behavior for
CFS.
Writing any (valid) positive value(s) will enact the specified bandwidth limit.
The minimum quota allowed for the quota or period is 1ms. There is also an
upper bound on the period length of 1s. Additional restrictions exist when
bandwidth limits are used in a hierarchical fashion, these are explained in
more detail below.
Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will
enact the specified bandwidth limit. The minimum quota allowed for the quota or
period is 1ms. There is also an upper bound on the period length of 1s.
Additional restrictions exist when bandwidth limits are used in a hierarchical
fashion, these are explained in more detail below.
Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
and return the group to an unconstrained state once more.
A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate
any unused bandwidth. It makes the traditional bandwidth control behavior for
CFS unchanged. Writing any (valid) positive value(s) no larger than
cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth
accumulation.
Any updates to a group's bandwidth specification will result in it becoming
unthrottled if it is in a constrained state.
@ -74,7 +125,7 @@ for more fine-grained consumption.
Statistics
----------
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
A group's bandwidth statistics are exported via 5 fields in cpu.stat.
cpu.stat:
@ -82,6 +133,9 @@ cpu.stat:
- nr_throttled: Number of times the group has been throttled/limited.
- throttled_time: The total time duration (in nanoseconds) for which entities
of the group have been throttled.
- nr_bursts: Number of periods burst occurs.
- burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used
above quota in respective periods
This interface is read-only.
@ -179,3 +233,15 @@ Examples
By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity.
4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU
additionally, in case accumulation has been done.
With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU.
And 10ms burst will be equivalent to 20% of 1 CPU.
# echo 20000 > cpu.cfs_quota_us /* quota = 20ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
# echo 10000 > cpu.cfs_burst_us /* burst = 10ms */
Larger buffer setting (no larger than quota) allows greater burst capacity.

View File

@ -42,7 +42,7 @@ extern void start_thread(struct pt_regs *, unsigned long, unsigned long);
struct task_struct;
extern void release_thread(struct task_struct *);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->pc)

View File

@ -376,12 +376,11 @@ thread_saved_pc(struct task_struct *t)
}
unsigned long
get_wchan(struct task_struct *p)
__get_wchan(struct task_struct *p)
{
unsigned long schedule_frame;
unsigned long pc;
if (!p || p == current || task_is_running(p))
return 0;
/*
* This one depends on the frame size of schedule(). Do a
* "disass schedule" in gdb to find the frame size. Also, the

View File

@ -70,7 +70,7 @@ struct task_struct;
extern void start_thread(struct pt_regs * regs, unsigned long pc,
unsigned long usp);
extern unsigned int get_wchan(struct task_struct *p);
extern unsigned int __get_wchan(struct task_struct *p);
#endif /* !__ASSEMBLY__ */

View File

@ -15,7 +15,7 @@
* = specifics of data structs where trace is saved(CONFIG_STACKTRACE etc)
*
* vineetg: March 2009
* -Implemented correct versions of thread_saved_pc() and get_wchan()
* -Implemented correct versions of thread_saved_pc() and __get_wchan()
*
* rajeshwarr: 2008
* -Initial implementation
@ -248,7 +248,7 @@ void show_stack(struct task_struct *tsk, unsigned long *sp, const char *loglvl)
* Of course just returning schedule( ) would be pointless so unwind until
* the function is not in schedular code
*/
unsigned int get_wchan(struct task_struct *tsk)
unsigned int __get_wchan(struct task_struct *tsk)
{
return arc_unwind_core(tsk, NULL, __get_first_nonsched, NULL);
}

View File

@ -84,7 +84,7 @@ struct task_struct;
/* Free all resources held by a thread. */
extern void release_thread(struct task_struct *);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define task_pt_regs(p) \
((struct pt_regs *)(THREAD_START_SP + task_stack_page(p)) - 1)

View File

@ -276,13 +276,11 @@ int copy_thread(unsigned long clone_flags, unsigned long stack_start,
return 0;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
struct stackframe frame;
unsigned long stack_page;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
frame.fp = thread_saved_fp(p);
frame.sp = thread_saved_sp(p);

View File

@ -988,6 +988,15 @@ config SCHED_MC
making when dealing with multi-core CPU chips at a cost of slightly
increased overhead in some places. If unsure say N here.
config SCHED_CLUSTER
bool "Cluster scheduler support"
help
Cluster scheduler support improves the CPU scheduler's decision
making when dealing with machines that have clusters of CPUs.
Cluster usually means a couple of CPUs which are placed closely
by sharing mid-level caches, last-level cache tags or internal
busses.
config SCHED_SMT
bool "SMT scheduler support"
help

View File

@ -257,7 +257,7 @@ struct task_struct;
/* Free all resources held by a thread. */
extern void release_thread(struct task_struct *);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
void update_sctlr_el1(u64 sctlr);

View File

@ -528,13 +528,11 @@ __notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
return last;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
struct stackframe frame;
unsigned long stack_page, ret = 0;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
stack_page = (unsigned long)try_get_task_stack(p);
if (!stack_page)

View File

@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
cpu_topology[cpu].thread_id = -1;
cpu_topology[cpu].core_id = topology_id;
}
topology_id = find_acpi_cpu_topology_cluster(cpu);
cpu_topology[cpu].cluster_id = topology_id;
topology_id = find_acpi_cpu_topology_package(cpu);
cpu_topology[cpu].package_id = topology_id;

View File

@ -81,7 +81,7 @@ static inline void release_thread(struct task_struct *dead_task)
extern int kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->pc)
#define KSTK_ESP(tsk) (task_pt_regs(tsk)->usp)

View File

@ -111,12 +111,11 @@ static bool save_wchan(unsigned long pc, void *arg)
return false;
}
unsigned long get_wchan(struct task_struct *task)
unsigned long __get_wchan(struct task_struct *task)
{
unsigned long pc = 0;
if (likely(task && task != current && !task_is_running(task)))
walk_stackframe(task, NULL, save_wchan, &pc);
walk_stackframe(task, NULL, save_wchan, &pc);
return pc;
}

View File

@ -105,7 +105,7 @@ static inline void release_thread(struct task_struct *dead_task)
{
}
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) \
({ \

View File

@ -128,15 +128,12 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
return 0;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long fp, pc;
unsigned long stack_page;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
stack_page = (unsigned long)p;
fp = ((struct pt_regs *)p->thread.ksp)->er6;
do {

View File

@ -64,7 +64,7 @@ struct thread_struct {
extern void release_thread(struct task_struct *dead_task);
/* Get wait channel for task P. */
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
/* The following stuff is pretty HEXAGON specific. */

View File

@ -130,13 +130,11 @@ void flush_thread(void)
* is an identification of the point at which the scheduler
* was invoked by a blocked thread.
*/
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long fp, pc;
unsigned long stack_page;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
stack_page = (unsigned long)task_stack_page(p);
fp = ((struct hexagon_switch_stack *)p->thread.switch_sp)->fp;

View File

@ -330,7 +330,7 @@ struct task_struct;
#define release_thread(dead_task)
/* Get wait channel for task P. */
extern unsigned long get_wchan (struct task_struct *p);
extern unsigned long __get_wchan (struct task_struct *p);
/* Return instruction pointer of blocked task TSK. */
#define KSTK_EIP(tsk) \

View File

@ -523,15 +523,12 @@ exit_thread (struct task_struct *tsk)
}
unsigned long
get_wchan (struct task_struct *p)
__get_wchan (struct task_struct *p)
{
struct unw_frame_info info;
unsigned long ip;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
/*
* Note: p may not be a blocked task (it could be current or
* another process running on some other CPU. Rather than

View File

@ -150,7 +150,7 @@ static inline void release_thread(struct task_struct *dead_task)
{
}
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) \
({ \

View File

@ -263,13 +263,11 @@ int dump_fpu (struct pt_regs *regs, struct user_m68kfp_struct *fpu)
}
EXPORT_SYMBOL(dump_fpu);
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long fp, pc;
unsigned long stack_page;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
stack_page = (unsigned long)task_stack_page(p);
fp = ((struct switch_stack *)p->thread.ksp)->a6;

View File

@ -68,7 +68,7 @@ static inline void release_thread(struct task_struct *dead_task)
{
}
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
/* The size allocated for kernel stacks. This _must_ be a power of two! */
# define KERNEL_STACK_SIZE 0x2000

View File

@ -112,7 +112,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, unsigned long arg,
return 0;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
/* TBD (used by procfs) */
return 0;

View File

@ -369,7 +369,7 @@ static inline void flush_thread(void)
{
}
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define __KSTK_TOS(tsk) ((unsigned long)task_stack_page(tsk) + \
THREAD_SIZE - 32 - sizeof(struct pt_regs))

View File

@ -511,7 +511,7 @@ static int __init frame_info_init(void)
/*
* Without schedule() frame info, result given by
* thread_saved_pc() and get_wchan() are not reliable.
* thread_saved_pc() and __get_wchan() are not reliable.
*/
if (schedule_mfi.pc_offset < 0)
printk("Can't analyze schedule() prologue at %p\n", schedule);
@ -652,9 +652,9 @@ unsigned long unwind_stack(struct task_struct *task, unsigned long *sp,
#endif
/*
* get_wchan - a maintenance nightmare^W^Wpain in the ass ...
* __get_wchan - a maintenance nightmare^W^Wpain in the ass ...
*/
unsigned long get_wchan(struct task_struct *task)
unsigned long __get_wchan(struct task_struct *task)
{
unsigned long pc = 0;
#ifdef CONFIG_KALLSYMS
@ -662,8 +662,6 @@ unsigned long get_wchan(struct task_struct *task)
unsigned long ra = 0;
#endif
if (!task || task == current || task_is_running(task))
goto out;
if (!task_stack_page(task))
goto out;

View File

@ -83,7 +83,7 @@ extern struct task_struct *last_task_used_math;
/* Prepare to copy thread state - unlazy all lazy status */
#define prepare_to_copy(tsk) do { } while (0)
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define cpu_relax() barrier()

View File

@ -233,15 +233,12 @@ int dump_fpu(struct pt_regs *regs, elf_fpregset_t * fpu)
EXPORT_SYMBOL(dump_fpu);
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long fp, lr;
unsigned long stack_start, stack_end;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
if (IS_ENABLED(CONFIG_FRAME_POINTER)) {
stack_start = (unsigned long)end_of_stack(p);
stack_end = (unsigned long)task_stack_page(p) + THREAD_SIZE;
@ -258,5 +255,3 @@ unsigned long get_wchan(struct task_struct *p)
}
return 0;
}
EXPORT_SYMBOL(get_wchan);

View File

@ -69,7 +69,7 @@ static inline void release_thread(struct task_struct *dead_task)
{
}
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
#define task_pt_regs(p) \
((struct pt_regs *)(THREAD_SIZE + task_stack_page(p)) - 1)

View File

@ -217,15 +217,12 @@ void dump(struct pt_regs *fp)
pr_emerg("\n\n");
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long fp, pc;
unsigned long stack_page;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
stack_page = (unsigned long)p;
fp = ((struct switch_stack *)p->thread.ksp)->fp; /* ;dgt2 */
do {

View File

@ -73,7 +73,7 @@ struct thread_struct {
void start_thread(struct pt_regs *regs, unsigned long nip, unsigned long sp);
void release_thread(struct task_struct *);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define cpu_relax() barrier()

View File

@ -263,7 +263,7 @@ void dump_elf_thread(elf_greg_t *dest, struct pt_regs* regs)
dest[35] = 0;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
/* TODO */

View File

@ -273,7 +273,7 @@ struct mm_struct;
/* Free all resources held by a thread. */
extern void release_thread(struct task_struct *);
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) ((tsk)->thread.regs.iaoq[0])
#define KSTK_ESP(tsk) ((tsk)->thread.regs.gr[30])

View File

@ -240,15 +240,12 @@ copy_thread(unsigned long clone_flags, unsigned long usp,
}
unsigned long
get_wchan(struct task_struct *p)
__get_wchan(struct task_struct *p)
{
struct unwind_frame_info info;
unsigned long ip;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
/*
* These bracket the sleeping functions..
*/

View File

@ -300,7 +300,7 @@ struct thread_struct {
#define task_pt_regs(tsk) ((tsk)->thread.regs)
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) ((tsk)->thread.regs? (tsk)->thread.regs->nip: 0)
#define KSTK_ESP(tsk) ((tsk)->thread.regs? (tsk)->thread.regs->gpr[1]: 0)

View File

@ -2111,14 +2111,11 @@ int validate_sp(unsigned long sp, struct task_struct *p,
EXPORT_SYMBOL(validate_sp);
static unsigned long __get_wchan(struct task_struct *p)
static unsigned long ___get_wchan(struct task_struct *p)
{
unsigned long ip, sp;
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
sp = p->thread.ksp;
if (!validate_sp(sp, p, STACK_FRAME_OVERHEAD))
return 0;
@ -2137,14 +2134,14 @@ static unsigned long __get_wchan(struct task_struct *p)
return 0;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long ret;
if (!try_get_task_stack(p))
return 0;
ret = __get_wchan(p);
ret = ___get_wchan(p);
put_task_stack(p);

View File

@ -66,7 +66,7 @@ static inline void release_thread(struct task_struct *dead_task)
{
}
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
static inline void wait_for_interrupt(void)

View File

@ -128,16 +128,14 @@ static bool save_wchan(void *arg, unsigned long pc)
return true;
}
unsigned long get_wchan(struct task_struct *task)
unsigned long __get_wchan(struct task_struct *task)
{
unsigned long pc = 0;
if (likely(task && task != current && !task_is_running(task))) {
if (!try_get_task_stack(task))
return 0;
walk_stackframe(task, NULL, save_wchan, &pc);
put_task_stack(task);
}
if (!try_get_task_stack(task))
return 0;
walk_stackframe(task, NULL, save_wchan, &pc);
put_task_stack(task);
return pc;
}

View File

@ -192,7 +192,7 @@ static inline void release_thread(struct task_struct *tsk) { }
void guarded_storage_release(struct task_struct *tsk);
void gs_load_bc_cb(struct pt_regs *regs);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
#define task_pt_regs(tsk) ((struct pt_regs *) \
(task_stack_page(tsk) + THREAD_SIZE) - 1)
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->psw.addr)

View File

@ -181,12 +181,12 @@ void execve_tail(void)
asm volatile("sfpc %0" : : "d" (0));
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
struct unwind_state state;
unsigned long ip = 0;
if (!p || p == current || task_is_running(p) || !task_stack_page(p))
if (!task_stack_page(p))
return 0;
if (!try_get_task_stack(p))

View File

@ -180,7 +180,7 @@ static inline void show_code(struct pt_regs *regs)
}
#endif
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->pc)
#define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[15])

View File

@ -182,13 +182,10 @@ __switch_to(struct task_struct *prev, struct task_struct *next)
return prev;
}
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long pc;
if (!p || p == current || task_is_running(p))
return 0;
/*
* The same comment as on the Alpha applies here, too ...
*/

View File

@ -89,7 +89,7 @@ static inline void start_thread(struct pt_regs * regs, unsigned long pc,
/* Free all resources held by a thread. */
#define release_thread(tsk) do { } while(0)
unsigned long get_wchan(struct task_struct *);
unsigned long __get_wchan(struct task_struct *);
#define task_pt_regs(tsk) ((tsk)->thread.kregs)
#define KSTK_EIP(tsk) ((tsk)->thread.kregs->pc)

View File

@ -183,7 +183,7 @@ do { \
/* Free all resources held by a thread. */
#define release_thread(tsk) do { } while (0)
unsigned long get_wchan(struct task_struct *task);
unsigned long __get_wchan(struct task_struct *task);
#define task_pt_regs(tsk) (task_thread_info(tsk)->kregs)
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->tpc)

View File

@ -365,7 +365,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
return 0;
}
unsigned long get_wchan(struct task_struct *task)
unsigned long __get_wchan(struct task_struct *task)
{
unsigned long pc, fp, bias = 0;
unsigned long task_base = (unsigned long) task;
@ -373,9 +373,6 @@ unsigned long get_wchan(struct task_struct *task)
struct reg_window32 *rw;
int count = 0;
if (!task || task == current || task_is_running(task))
goto out;
fp = task_thread_info(task)->ksp + bias;
do {
/* Bogus frame pointer? */

View File

@ -663,7 +663,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
return 0;
}
unsigned long get_wchan(struct task_struct *task)
unsigned long __get_wchan(struct task_struct *task)
{
unsigned long pc, fp, bias = 0;
struct thread_info *tp;
@ -671,9 +671,6 @@ unsigned long get_wchan(struct task_struct *task)
unsigned long ret = 0;
int count = 0;
if (!task || task == current || task_is_running(task))
goto out;
tp = task_thread_info(task);
bias = STACK_BIAS;
fp = task_thread_info(task)->ksp + bias;

View File

@ -106,6 +106,6 @@ extern struct cpuinfo_um boot_cpu_data;
#define cache_line_size() (boot_cpu_data.cache_alignment)
#define KSTK_REG(tsk, reg) get_thread_reg(reg, &tsk->thread.switch_buf)
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
#endif

View File

@ -364,14 +364,11 @@ unsigned long arch_align_stack(unsigned long sp)
}
#endif
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long stack_page, sp, ip;
bool seen_sched = 0;
if ((p == NULL) || (p == current) || task_is_running(p))
return 0;
stack_page = (unsigned long) task_stack_page(p);
/* Bail if the process has no kernel stack for some reason */
if (stack_page == 0)

View File

@ -1001,6 +1001,17 @@ config NR_CPUS
This is purely to save memory: each supported CPU adds about 8KB
to the kernel image.
config SCHED_CLUSTER
bool "Cluster scheduler support"
depends on SMP
default y
help
Cluster scheduler support improves the CPU scheduler's decision
making when dealing with machines that have clusters of CPUs.
Cluster usually means a couple of CPUs which are placed closely
by sharing mid-level caches, last-level cache tags or internal
busses.
config SCHED_SMT
def_bool y if SMP

View File

@ -589,7 +589,7 @@ static inline void load_sp0(unsigned long sp0)
/* Free all resources held by a thread. */
extern void release_thread(struct task_struct *);
unsigned long get_wchan(struct task_struct *p);
unsigned long __get_wchan(struct task_struct *p);
/*
* Generic CPUID function

View File

@ -16,7 +16,9 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
/* cpus sharing the last level cache: */
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
return per_cpu(cpu_llc_shared_map, cpu);
}
static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
{
return per_cpu(cpu_l2c_shared_map, cpu);
}
DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);

View File

@ -103,6 +103,7 @@ static inline void setup_node_to_cpumask_map(void) { }
#include <asm-generic/topology.h>
extern const struct cpumask *cpu_coregroup_mask(int cpu);
extern const struct cpumask *cpu_clustergroup_mask(int cpu);
#define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id)
#define topology_physical_package_id(cpu) (cpu_data(cpu).phys_proc_id)
@ -113,7 +114,9 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
extern unsigned int __max_die_per_package;
#ifdef CONFIG_SMP
#define topology_cluster_id(cpu) (per_cpu(cpu_l2c_id, cpu))
#define topology_die_cpumask(cpu) (per_cpu(cpu_die_map, cpu))
#define topology_cluster_cpumask(cpu) (cpu_clustergroup_mask(cpu))
#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
#define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu))

View File

@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
l2 = new_l2;
#ifdef CONFIG_SMP
per_cpu(cpu_llc_id, cpu) = l2_id;
per_cpu(cpu_l2c_id, cpu) = l2_id;
#endif
}

View File

@ -85,6 +85,9 @@ u16 get_llc_id(unsigned int cpu)
}
EXPORT_SYMBOL_GPL(get_llc_id);
/* L2 cache ID of each logical CPU */
DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
/* correctly size the local cpu masks */
void __init setup_cpu_local_masks(void)
{

View File

@ -198,7 +198,7 @@ void sched_set_itmt_core_prio(int prio, int core_cpu)
* of the priority chain and only used when
* all other high priority cpus are out of capacity.
*/
smt_prio = prio * smp_num_siblings / i;
smt_prio = prio * smp_num_siblings / (i * i);
per_cpu(sched_core_priority, cpu) = smt_prio;
i++;
}

View File

@ -43,6 +43,7 @@
#include <asm/io_bitmap.h>
#include <asm/proto.h>
#include <asm/frame.h>
#include <asm/unwind.h>
#include "process.h"
@ -942,60 +943,22 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
* because the task might wake up and we might look at a stack
* changing under us.
*/
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long start, bottom, top, sp, fp, ip, ret = 0;
int count = 0;
struct unwind_state state;
unsigned long addr = 0;
if (p == current || task_is_running(p))
return 0;
for (unwind_start(&state, p, NULL, NULL); !unwind_done(&state);
unwind_next_frame(&state)) {
addr = unwind_get_return_address(&state);
if (!addr)
break;
if (in_sched_functions(addr))
continue;
break;
}
if (!try_get_task_stack(p))
return 0;
start = (unsigned long)task_stack_page(p);
if (!start)
goto out;
/*
* Layout of the stack page:
*
* ----------- topmax = start + THREAD_SIZE - sizeof(unsigned long)
* PADDING
* ----------- top = topmax - TOP_OF_KERNEL_STACK_PADDING
* stack
* ----------- bottom = start
*
* The tasks stack pointer points at the location where the
* framepointer is stored. The data on the stack is:
* ... IP FP ... IP FP
*
* We need to read FP and IP, so we need to adjust the upper
* bound by another unsigned long.
*/
top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
top -= 2 * sizeof(unsigned long);
bottom = start;
sp = READ_ONCE(p->thread.sp);
if (sp < bottom || sp > top)
goto out;
fp = READ_ONCE_NOCHECK(((struct inactive_task_frame *)sp)->bp);
do {
if (fp < bottom || fp > top)
goto out;
ip = READ_ONCE_NOCHECK(*(unsigned long *)(fp + sizeof(unsigned long)));
if (!in_sched_functions(ip)) {
ret = ip;
goto out;
}
fp = READ_ONCE_NOCHECK(*(unsigned long *)fp);
} while (count++ < 16 && !task_is_running(p));
out:
put_task_stack(p);
return ret;
return addr;
}
long do_arch_prctl_common(struct task_struct *task, int option,

View File

@ -101,6 +101,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
/* Per CPU bogomips and other parameters */
DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
EXPORT_PER_CPU_SYMBOL(cpu_info);
@ -464,6 +466,21 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
return false;
}
static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
/* If the arch didn't set up l2c_id, fall back to SMT */
if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
return match_smt(c, o);
/* Do not match if L2 cache id does not match: */
if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
return false;
return topology_sane(c, o, "l2c");
}
/*
* Unlike the other levels, we do not enforce keeping a
* multicore group inside a NUMA node. If this happens, we will
@ -523,7 +540,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
}
#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
static inline int x86_sched_itmt_flags(void)
{
return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@ -541,12 +558,21 @@ static int x86_smt_flags(void)
return cpu_smt_flags() | x86_sched_itmt_flags();
}
#endif
#ifdef CONFIG_SCHED_CLUSTER
static int x86_cluster_flags(void)
{
return cpu_cluster_flags() | x86_sched_itmt_flags();
}
#endif
#endif
static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_CLUSTER
{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
#endif
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
@ -557,6 +583,9 @@ static struct sched_domain_topology_level x86_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
#endif
#ifdef CONFIG_SCHED_CLUSTER
{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
#endif
#ifdef CONFIG_SCHED_MC
{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
#endif
@ -584,6 +613,7 @@ void set_cpu_sibling_map(int cpu)
if (!has_mp) {
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
c->booted_cores = 1;
@ -602,6 +632,9 @@ void set_cpu_sibling_map(int cpu)
if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
if ((i == cpu) || (has_mp && match_l2c(c, o)))
link_mask(cpu_l2c_shared_mask, cpu, i);
if ((i == cpu) || (has_mp && match_die(c, o)))
link_mask(topology_die_cpumask, cpu, i);
}
@ -652,6 +685,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
return cpu_llc_shared_mask(cpu);
}
const struct cpumask *cpu_clustergroup_mask(int cpu)
{
return cpu_l2c_shared_mask(cpu);
}
static void impress_friends(void)
{
int cpu;
@ -1335,6 +1373,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
}
/*
@ -1564,7 +1603,10 @@ static void remove_siblinginfo(int cpu)
for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
cpumask_clear(cpu_llc_shared_mask(cpu));
cpumask_clear(cpu_l2c_shared_mask(cpu));
cpumask_clear(topology_sibling_cpumask(cpu));
cpumask_clear(topology_core_cpumask(cpu));
cpumask_clear(topology_die_cpumask(cpu));

View File

@ -215,7 +215,7 @@ struct mm_struct;
/* Free all resources held by a thread. */
#define release_thread(thread) do { } while(0)
extern unsigned long get_wchan(struct task_struct *p);
extern unsigned long __get_wchan(struct task_struct *p);
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->pc)
#define KSTK_ESP(tsk) (task_pt_regs(tsk)->areg[1])

View File

@ -298,15 +298,12 @@ int copy_thread(unsigned long clone_flags, unsigned long usp_thread_fn,
* These bracket the sleeping functions..
*/
unsigned long get_wchan(struct task_struct *p)
unsigned long __get_wchan(struct task_struct *p)
{
unsigned long sp, pc;
unsigned long stack_page = (unsigned long) task_stack_page(p);
int count = 0;
if (!p || p == current || task_is_running(p))
return 0;
sp = p->thread.sp;
pc = MAKE_PC_FROM_RA(p->thread.ra, p->thread.sp);

View File

@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
ACPI_PPTT_PHYSICAL_PACKAGE);
}
/**
* find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
* @cpu: Kernel logical CPU number
*
* Determine a topology unique cluster ID for the given CPU/thread.
* This ID can then be used to group peers, which will have matching ids.
*
* The cluster, if present is the level of topology above CPUs. In a
* multi-thread CPU, it will be the level above the CPU, not the thread.
* It may not exist in single CPU systems. In simple multi-CPU systems,
* it may be equal to the package topology level.
*
* Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
* or there is no toplogy level above the CPU..
* Otherwise returns a value which represents the package for this CPU.
*/
int find_acpi_cpu_topology_cluster(unsigned int cpu)
{
struct acpi_table_header *table;
acpi_status status;
struct acpi_pptt_processor *cpu_node, *cluster_node;
u32 acpi_cpu_id;
int retval;
int is_thread;
status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
if (ACPI_FAILURE(status)) {
acpi_pptt_warn_missing();
return -ENOENT;
}
acpi_cpu_id = get_acpi_id_for_cpu(cpu);
cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
if (cpu_node == NULL || !cpu_node->parent) {
retval = -ENOENT;
goto put_table;
}
is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
cluster_node = fetch_pptt_node(table, cpu_node->parent);
if (cluster_node == NULL) {
retval = -ENOENT;
goto put_table;
}
if (is_thread) {
if (!cluster_node->parent) {
retval = -ENOENT;
goto put_table;
}
cluster_node = fetch_pptt_node(table, cluster_node->parent);
if (cluster_node == NULL) {
retval = -ENOENT;
goto put_table;
}
}
if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
retval = cluster_node->acpi_processor_id;
else
retval = ACPI_PTR_DIFF(cluster_node, table);
put_table:
acpi_put_table(table);
return retval;
}
/**
* find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
* @cpu: Kernel logical CPU number

View File

@ -600,6 +600,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
return core_mask;
}
const struct cpumask *cpu_clustergroup_mask(int cpu)
{
return &cpu_topology[cpu].cluster_sibling;
}
void update_siblings_masks(unsigned int cpuid)
{
struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@ -617,6 +622,12 @@ void update_siblings_masks(unsigned int cpuid)
if (cpuid_topo->package_id != cpu_topo->package_id)
continue;
if (cpuid_topo->cluster_id == cpu_topo->cluster_id &&
cpuid_topo->cluster_id != -1) {
cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
}
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
@ -635,6 +646,9 @@ static void clear_cpu_topology(int cpu)
cpumask_clear(&cpu_topo->llc_sibling);
cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
cpumask_clear(&cpu_topo->cluster_sibling);
cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
cpumask_clear(&cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
cpumask_clear(&cpu_topo->thread_sibling);
@ -650,6 +664,7 @@ void __init reset_cpu_topology(void)
cpu_topo->thread_id = -1;
cpu_topo->core_id = -1;
cpu_topo->cluster_id = -1;
cpu_topo->package_id = -1;
cpu_topo->llc_id = -1;

View File

@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
define_id_show_func(die_id);
static DEVICE_ATTR_RO(die_id);
define_id_show_func(cluster_id);
static DEVICE_ATTR_RO(cluster_id);
define_id_show_func(core_id);
static DEVICE_ATTR_RO(core_id);
@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
static BIN_ATTR_RO(core_siblings, 0);
static BIN_ATTR_RO(core_siblings_list, 0);
define_siblings_read_func(cluster_cpus, cluster_cpumask);
static BIN_ATTR_RO(cluster_cpus, 0);
static BIN_ATTR_RO(cluster_cpus_list, 0);
define_siblings_read_func(die_cpus, die_cpumask);
static BIN_ATTR_RO(die_cpus, 0);
static BIN_ATTR_RO(die_cpus_list, 0);
@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
&bin_attr_thread_siblings_list,
&bin_attr_core_siblings,
&bin_attr_core_siblings_list,
&bin_attr_cluster_cpus,
&bin_attr_cluster_cpus_list,
&bin_attr_die_cpus,
&bin_attr_die_cpus_list,
&bin_attr_package_cpus,
@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
static struct attribute *default_attrs[] = {
&dev_attr_physical_package_id.attr,
&dev_attr_die_id.attr,
&dev_attr_cluster_id.attr,
&dev_attr_core_id.attr,
#ifdef CONFIG_SCHED_BOOK
&dev_attr_book_id.attr,

View File

@ -541,7 +541,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
}
if (permitted && (!whole || num_threads < 2))
wchan = get_wchan(task);
wchan = !task_is_running(task);
if (!whole) {
min_flt = task->min_flt;
maj_flt = task->maj_flt;
@ -606,10 +606,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
*
* This works with older implementations of procps as well.
*/
if (wchan)
seq_puts(m, " 1");
else
seq_puts(m, " 0");
seq_put_decimal_ull(m, " ", wchan);
seq_put_decimal_ull(m, " ", 0);
seq_put_decimal_ull(m, " ", 0);

View File

@ -67,6 +67,7 @@
#include <linux/mm.h>
#include <linux/swap.h>
#include <linux/rcupdate.h>
#include <linux/kallsyms.h>
#include <linux/stacktrace.h>
#include <linux/resource.h>
#include <linux/module.h>
@ -386,17 +387,19 @@ static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long wchan;
char symname[KSYM_NAME_LEN];
if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
wchan = get_wchan(task);
else
wchan = 0;
if (!ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS))
goto print0;
if (wchan)
seq_printf(m, "%ps", (void *) wchan);
else
seq_putc(m, '0');
wchan = get_wchan(task);
if (wchan && !lookup_symbol_name(wchan, symname)) {
seq_puts(m, symname);
return 0;
}
print0:
seq_putc(m, '0');
return 0;
}
#endif /* CONFIG_KALLSYMS */

View File

@ -24,7 +24,7 @@
#ifdef arch_idle_time
static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
{
u64 idle;
@ -46,7 +46,7 @@ static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
#else
static u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
u64 get_idle_time(struct kernel_cpustat *kcs, int cpu)
{
u64 idle, idle_usecs = -1ULL;

View File

@ -12,18 +12,22 @@ static int uptime_proc_show(struct seq_file *m, void *v)
{
struct timespec64 uptime;
struct timespec64 idle;
u64 nsec;
u64 idle_nsec;
u32 rem;
int i;
nsec = 0;
for_each_possible_cpu(i)
nsec += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE];
idle_nsec = 0;
for_each_possible_cpu(i) {
struct kernel_cpustat kcs;
kcpustat_cpu_fetch(&kcs, i);
idle_nsec += get_idle_time(&kcs, i);
}
ktime_get_boottime_ts64(&uptime);
timens_add_boottime(&uptime);
idle.tv_sec = div_u64_rem(nsec, NSEC_PER_SEC, &rem);
idle.tv_sec = div_u64_rem(idle_nsec, NSEC_PER_SEC, &rem);
idle.tv_nsec = rem;
seq_printf(m, "%lu.%02lu %lu.%02lu\n",
(unsigned long) uptime.tv_sec,

View File

@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
#ifdef CONFIG_ACPI_PPTT
int acpi_pptt_cpu_is_thread(unsigned int cpu);
int find_acpi_cpu_topology(unsigned int cpu, int level);
int find_acpi_cpu_topology_cluster(unsigned int cpu);
int find_acpi_cpu_topology_package(unsigned int cpu);
int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
{
return -EINVAL;
}
static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
{
return -EINVAL;
}
static inline int find_acpi_cpu_topology_package(unsigned int cpu)
{
return -EINVAL;

View File

@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
struct cpu_topology {
int thread_id;
int core_id;
int cluster_id;
int package_id;
int llc_id;
cpumask_t thread_sibling;
cpumask_t core_sibling;
cpumask_t cluster_sibling;
cpumask_t llc_sibling;
};
@ -73,13 +75,16 @@ struct cpu_topology {
extern struct cpu_topology cpu_topology[NR_CPUS];
#define topology_physical_package_id(cpu) (cpu_topology[cpu].package_id)
#define topology_cluster_id(cpu) (cpu_topology[cpu].cluster_id)
#define topology_core_id(cpu) (cpu_topology[cpu].core_id)
#define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling)
#define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling)
#define topology_cluster_cpumask(cpu) (&cpu_topology[cpu].cluster_sibling)
#define topology_llc_cpumask(cpu) (&cpu_topology[cpu].llc_sibling)
void init_cpu_topology(void);
void store_cpu_topology(unsigned int cpuid);
const struct cpumask *cpu_coregroup_mask(int cpu);
const struct cpumask *cpu_clustergroup_mask(int cpu);
void update_siblings_masks(unsigned int cpu);
void remove_cpu_topology(unsigned int cpuid);
void reset_cpu_topology(void);

View File

@ -3,6 +3,7 @@
#define _LINUX_IRQ_WORK_H
#include <linux/smp_types.h>
#include <linux/rcuwait.h>
/*
* An entry can be in one of four states:
@ -16,11 +17,13 @@
struct irq_work {
struct __call_single_node node;
void (*func)(struct irq_work *);
struct rcuwait irqwait;
};
#define __IRQ_WORK_INIT(_func, _flags) (struct irq_work){ \
.node = { .u_flags = (_flags), }, \
.func = (_func), \
.irqwait = __RCUWAIT_INITIALIZER(irqwait), \
}
#define IRQ_WORK_INIT(_func) __IRQ_WORK_INIT(_func, 0)
@ -46,6 +49,11 @@ static inline bool irq_work_is_busy(struct irq_work *work)
return atomic_read(&work->node.a_flags) & IRQ_WORK_BUSY;
}
static inline bool irq_work_is_hard(struct irq_work *work)
{
return atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ;
}
bool irq_work_queue(struct irq_work *work);
bool irq_work_queue_on(struct irq_work *work, int cpu);

View File

@ -102,6 +102,7 @@ extern void account_system_index_time(struct task_struct *, u64,
enum cpu_usage_stat);
extern void account_steal_time(u64);
extern void account_idle_time(u64);
extern u64 get_idle_time(struct kernel_cpustat *kcs, int cpu);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
static inline void account_process_tick(struct task_struct *tsk, int user)

View File

@ -12,6 +12,7 @@
#include <linux/completion.h>
#include <linux/cpumask.h>
#include <linux/uprobes.h>
#include <linux/rcupdate.h>
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
@ -649,6 +650,9 @@ struct mm_struct {
bool tlb_flush_batched;
#endif
struct uprobes_state uprobes_state;
#ifdef CONFIG_PREEMPT_RT
struct rcu_head delayed_drop;
#endif
#ifdef CONFIG_HUGETLB_PAGE
atomic_long_t hugetlb_usage;
#endif

View File

@ -503,6 +503,8 @@ struct sched_statistics {
u64 block_start;
u64 block_max;
s64 sum_block_runtime;
u64 exec_max;
u64 slice_max;
@ -522,7 +524,7 @@ struct sched_statistics {
u64 nr_wakeups_passive;
u64 nr_wakeups_idle;
#endif
};
} ____cacheline_aligned;
struct sched_entity {
/* For load-balancing: */
@ -538,8 +540,6 @@ struct sched_entity {
u64 nr_migrations;
struct sched_statistics statistics;
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
@ -775,10 +775,10 @@ struct task_struct {
int normal_prio;
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
const struct sched_class *sched_class;
#ifdef CONFIG_SCHED_CORE
struct rb_node core_node;
@ -803,6 +803,8 @@ struct task_struct {
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
struct sched_statistics stats;
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
@ -2154,6 +2156,7 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
#endif /* CONFIG_SMP */
extern bool sched_task_on_rq(struct task_struct *p);
extern unsigned long get_wchan(struct task_struct *p);
/*
* In order to reduce various lock holder preemption latencies provide an

View File

@ -11,7 +11,11 @@ enum cpu_idle_type {
CPU_MAX_IDLE_TYPES
};
#ifdef CONFIG_SMP
extern void wake_up_if_idle(int cpu);
#else
static inline void wake_up_if_idle(int cpu) { }
#endif
/*
* Idle thread specific functions to determine the need_resched

View File

@ -49,6 +49,35 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}
#ifdef CONFIG_PREEMPT_RT
/*
* RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
* by far the least expensive way to do that.
*/
static inline void __mmdrop_delayed(struct rcu_head *rhp)
{
struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
__mmdrop(mm);
}
/*
* Invoked from finish_task_switch(). Delegates the heavy lifting on RT
* kernels via RCU.
*/
static inline void mmdrop_sched(struct mm_struct *mm)
{
/* Provides a full memory barrier. See mmdrop() */
if (atomic_dec_and_test(&mm->mm_count))
call_rcu(&mm->delayed_drop, __mmdrop_delayed);
}
#else
static inline void mmdrop_sched(struct mm_struct *mm)
{
mmdrop(mm);
}
#endif
/**
* mmget() - Pin the address space associated with a &struct mm_struct.
* @mm: The address space to pin.

View File

@ -54,7 +54,8 @@ extern asmlinkage void schedule_tail(struct task_struct *prev);
extern void init_idle(struct task_struct *idle, int cpu);
extern int sched_fork(unsigned long clone_flags, struct task_struct *p);
extern void sched_post_fork(struct task_struct *p);
extern void sched_post_fork(struct task_struct *p,
struct kernel_clone_args *kargs);
extern void sched_dead(struct task_struct *p);
void __noreturn do_task_dead(void);

View File

@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
}
#endif
#ifdef CONFIG_SCHED_CLUSTER
static inline int cpu_cluster_flags(void)
{
return SD_SHARE_PKG_RESOURCES;
}
#endif
#ifdef CONFIG_SCHED_MC
static inline int cpu_core_flags(void)
{
@ -98,7 +105,7 @@ struct sched_domain {
/* idle_balance() stats */
u64 max_newidle_lb_cost;
unsigned long next_decay_max_lb_cost;
unsigned long last_decay_max_lb_cost;
u64 avg_scan_cost; /* select_idle_sibling */

View File

@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
#ifndef topology_die_id
#define topology_die_id(cpu) ((void)(cpu), -1)
#endif
#ifndef topology_cluster_id
#define topology_cluster_id(cpu) ((void)(cpu), -1)
#endif
#ifndef topology_core_id
#define topology_core_id(cpu) ((void)(cpu), 0)
#endif
@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
#ifndef topology_core_cpumask
#define topology_core_cpumask(cpu) cpumask_of(cpu)
#endif
#ifndef topology_cluster_cpumask
#define topology_cluster_cpumask(cpu) cpumask_of(cpu)
#endif
#ifndef topology_die_cpumask
#define topology_die_cpumask(cpu) cpumask_of(cpu)
#endif
@ -206,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
}
#endif
#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
static inline const struct cpumask *cpu_cluster_mask(int cpu)
{
return topology_cluster_cpumask(cpu);
}
#endif
static inline const struct cpumask *cpu_cpu_mask(int cpu)
{
return cpumask_of_node(cpu_to_node(cpu));

View File

@ -1160,6 +1160,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
(wait)->flags = 0; \
} while (0)
bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg);
typedef int (*task_call_f)(struct task_struct *p, void *arg);
extern int task_call_func(struct task_struct *p, task_call_f func, void *arg);
#endif /* _LINUX_WAIT_H */

View File

@ -2,10 +2,11 @@
choice
prompt "Preemption Model"
default PREEMPT_NONE
default PREEMPT_NONE_BEHAVIOUR
config PREEMPT_NONE
config PREEMPT_NONE_BEHAVIOUR
bool "No Forced Preemption (Server)"
select PREEMPT_NONE if !PREEMPT_DYNAMIC
help
This is the traditional Linux preemption model, geared towards
throughput. It will still provide good latencies most of the
@ -17,9 +18,10 @@ config PREEMPT_NONE
raw processing power of the kernel, irrespective of scheduling
latencies.
config PREEMPT_VOLUNTARY
config PREEMPT_VOLUNTARY_BEHAVIOUR
bool "Voluntary Kernel Preemption (Desktop)"
depends on !ARCH_NO_PREEMPT
select PREEMPT_VOLUNTARY if !PREEMPT_DYNAMIC
help
This option reduces the latency of the kernel by adding more
"explicit preemption points" to the kernel code. These new
@ -35,12 +37,10 @@ config PREEMPT_VOLUNTARY
Select this if you are building a kernel for a desktop system.
config PREEMPT
config PREEMPT_BEHAVIOUR
bool "Preemptible Kernel (Low-Latency Desktop)"
depends on !ARCH_NO_PREEMPT
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
select PREEMPT_DYNAMIC if HAVE_PREEMPT_DYNAMIC
select PREEMPT
help
This option reduces the latency of the kernel by making
all kernel code (that is not executing in a critical section)
@ -58,7 +58,7 @@ config PREEMPT
config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT
depends on EXPERT && ARCH_SUPPORTS_RT && !PREEMPT_DYNAMIC
select PREEMPTION
help
This option turns the kernel into a real-time kernel by replacing
@ -75,6 +75,17 @@ config PREEMPT_RT
endchoice
config PREEMPT_NONE
bool
config PREEMPT_VOLUNTARY
bool
config PREEMPT
bool
select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
config PREEMPT_COUNT
bool
@ -83,7 +94,10 @@ config PREEMPTION
select PREEMPT_COUNT
config PREEMPT_DYNAMIC
bool
bool "Preemption behaviour defined on boot"
depends on HAVE_PREEMPT_DYNAMIC
select PREEMPT
default y
help
This option allows to define the preemption model on the kernel
command line parameter and thus override the default preemption

View File

@ -63,6 +63,7 @@
#include <linux/rcuwait.h>
#include <linux/compat.h>
#include <linux/io_uring.h>
#include <linux/kprobes.h>
#include <linux/uaccess.h>
#include <asm/unistd.h>
@ -167,6 +168,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
{
struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
kprobe_flush_task(tsk);
perf_event_delayed_put(tsk);
trace_sched_process_free(tsk);
put_task_struct(tsk);

View File

@ -2404,7 +2404,7 @@ static __latent_entropy struct task_struct *copy_process(
write_unlock_irq(&tasklist_lock);
proc_fork_connector(p);
sched_post_fork(p);
sched_post_fork(p, args);
cgroup_post_fork(p, args);
perf_event_fork(p);

View File

@ -18,11 +18,36 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/smp.h>
#include <linux/smpboot.h>
#include <asm/processor.h>
#include <linux/kasan.h>
static DEFINE_PER_CPU(struct llist_head, raised_list);
static DEFINE_PER_CPU(struct llist_head, lazy_list);
static DEFINE_PER_CPU(struct task_struct *, irq_workd);
static void wake_irq_workd(void)
{
struct task_struct *tsk = __this_cpu_read(irq_workd);
if (!llist_empty(this_cpu_ptr(&lazy_list)) && tsk)
wake_up_process(tsk);
}
#ifdef CONFIG_SMP
static void irq_work_wake(struct irq_work *entry)
{
wake_irq_workd();
}
static DEFINE_PER_CPU(struct irq_work, irq_work_wakeup) =
IRQ_WORK_INIT_HARD(irq_work_wake);
#endif
static int irq_workd_should_run(unsigned int cpu)
{
return !llist_empty(this_cpu_ptr(&lazy_list));
}
/*
* Claim the entry so that no one else will poke at it.
@ -52,15 +77,29 @@ void __weak arch_irq_work_raise(void)
/* Enqueue on current CPU, work must already be claimed and preempt disabled */
static void __irq_work_queue_local(struct irq_work *work)
{
struct llist_head *list;
bool rt_lazy_work = false;
bool lazy_work = false;
int work_flags;
work_flags = atomic_read(&work->node.a_flags);
if (work_flags & IRQ_WORK_LAZY)
lazy_work = true;
else if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
!(work_flags & IRQ_WORK_HARD_IRQ))
rt_lazy_work = true;
if (lazy_work || rt_lazy_work)
list = this_cpu_ptr(&lazy_list);
else
list = this_cpu_ptr(&raised_list);
if (!llist_add(&work->node.llist, list))
return;
/* If the work is "lazy", handle it from next tick if any */
if (atomic_read(&work->node.a_flags) & IRQ_WORK_LAZY) {
if (llist_add(&work->node.llist, this_cpu_ptr(&lazy_list)) &&
tick_nohz_tick_stopped())
arch_irq_work_raise();
} else {
if (llist_add(&work->node.llist, this_cpu_ptr(&raised_list)))
arch_irq_work_raise();
}
if (!lazy_work || tick_nohz_tick_stopped())
arch_irq_work_raise();
}
/* Enqueue the irq work @work on the current CPU */
@ -104,17 +143,34 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (cpu != smp_processor_id()) {
/* Arch remote IPI send/receive backend aren't NMI safe */
WARN_ON_ONCE(in_nmi());
/*
* On PREEMPT_RT the items which are not marked as
* IRQ_WORK_HARD_IRQ are added to the lazy list and a HARD work
* item is used on the remote CPU to wake the thread.
*/
if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
!(atomic_read(&work->node.a_flags) & IRQ_WORK_HARD_IRQ)) {
if (!llist_add(&work->node.llist, &per_cpu(lazy_list, cpu)))
goto out;
work = &per_cpu(irq_work_wakeup, cpu);
if (!irq_work_claim(work))
goto out;
}
__smp_call_single_queue(cpu, &work->node.llist);
} else {
__irq_work_queue_local(work);
}
out:
preempt_enable();
return true;
#endif /* CONFIG_SMP */
}
bool irq_work_needs_cpu(void)
{
struct llist_head *raised, *lazy;
@ -160,6 +216,10 @@ void irq_work_single(void *arg)
* else claimed it meanwhile.
*/
(void)atomic_cmpxchg(&work->node.a_flags, flags, flags & ~IRQ_WORK_BUSY);
if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
!arch_irq_work_has_interrupt())
rcuwait_wake_up(&work->irqwait);
}
static void irq_work_run_list(struct llist_head *list)
@ -167,7 +227,12 @@ static void irq_work_run_list(struct llist_head *list)
struct irq_work *work, *tmp;
struct llist_node *llnode;
BUG_ON(!irqs_disabled());
/*
* On PREEMPT_RT IRQ-work which is not marked as HARD will be processed
* in a per-CPU thread in preemptible context. Only the items which are
* marked as IRQ_WORK_HARD_IRQ will be processed in hardirq context.
*/
BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT));
if (llist_empty(list))
return;
@ -184,7 +249,10 @@ static void irq_work_run_list(struct llist_head *list)
void irq_work_run(void)
{
irq_work_run_list(this_cpu_ptr(&raised_list));
irq_work_run_list(this_cpu_ptr(&lazy_list));
if (!IS_ENABLED(CONFIG_PREEMPT_RT))
irq_work_run_list(this_cpu_ptr(&lazy_list));
else
wake_irq_workd();
}
EXPORT_SYMBOL_GPL(irq_work_run);
@ -194,7 +262,11 @@ void irq_work_tick(void)
if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
irq_work_run_list(raised);
irq_work_run_list(this_cpu_ptr(&lazy_list));
if (!IS_ENABLED(CONFIG_PREEMPT_RT))
irq_work_run_list(this_cpu_ptr(&lazy_list));
else
wake_irq_workd();
}
/*
@ -204,8 +276,42 @@ void irq_work_tick(void)
void irq_work_sync(struct irq_work *work)
{
lockdep_assert_irqs_enabled();
might_sleep();
if ((IS_ENABLED(CONFIG_PREEMPT_RT) && !irq_work_is_hard(work)) ||
!arch_irq_work_has_interrupt()) {
rcuwait_wait_event(&work->irqwait, !irq_work_is_busy(work),
TASK_UNINTERRUPTIBLE);
return;
}
while (irq_work_is_busy(work))
cpu_relax();
}
EXPORT_SYMBOL_GPL(irq_work_sync);
static void run_irq_workd(unsigned int cpu)
{
irq_work_run_list(this_cpu_ptr(&lazy_list));
}
static void irq_workd_setup(unsigned int cpu)
{
sched_set_fifo_low(current);
}
static struct smp_hotplug_thread irqwork_threads = {
.store = &irq_workd,
.setup = irq_workd_setup,
.thread_should_run = irq_workd_should_run,
.thread_fn = run_irq_workd,
.thread_comm = "irq_work/%u",
};
static __init int irq_work_init_threads(void)
{
if (IS_ENABLED(CONFIG_PREEMPT_RT))
BUG_ON(smpboot_register_percpu_thread(&irqwork_threads));
return 0;
}
early_initcall(irq_work_init_threads);

View File

@ -1250,10 +1250,10 @@ void kprobe_busy_end(void)
}
/*
* This function is called from finish_task_switch when task tk becomes dead,
* so that we can recycle any function-return probe instances associated
* with this task. These left over instances represent probed functions
* that have been called but will never return.
* This function is called from delayed_put_task_struct() when a task is
* dead and cleaned up to recycle any function-return probe instances
* associated with this task. These left over instances represent probed
* functions that have been called but will never return.
*/
void kprobe_flush_task(struct task_struct *tk)
{

View File

@ -270,6 +270,7 @@ EXPORT_SYMBOL_GPL(kthread_parkme);
static int kthread(void *_create)
{
static const struct sched_param param = { .sched_priority = 0 };
/* Copy data: it's on kthread's stack */
struct kthread_create_info *create = _create;
int (*threadfn)(void *data) = create->threadfn;
@ -300,6 +301,13 @@ static int kthread(void *_create)
init_completion(&self->parked);
current->vfork_done = &self->exited;
/*
* The new thread inherited kthreadd's priority and CPU mask. Reset
* back to default in case they have been changed.
*/
sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_FLAG_KTHREAD));
/* OK, tell user we're spawned, wait for stop or wakeup */
__set_current_state(TASK_UNINTERRUPTIBLE);
create->result = current;
@ -397,7 +405,6 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
}
task = create->result;
if (!IS_ERR(task)) {
static const struct sched_param param = { .sched_priority = 0 };
char name[TASK_COMM_LEN];
/*
@ -406,13 +413,6 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
*/
vsnprintf(name, sizeof(name), namefmt, args);
set_task_comm(task, name);
/*
* root may have changed our (kthreadd's) priority or CPU mask.
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
set_cpus_allowed_ptr(task,
housekeeping_cpumask(HK_FLAG_KTHREAD));
}
kfree(create);
return task;

View File

@ -13,7 +13,6 @@
#include "core.h"
#include "patch.h"
#include "transition.h"
#include "../sched/sched.h"
#define MAX_STACK_ENTRIES 100
#define STACK_ERR_BUF_SIZE 128
@ -240,7 +239,7 @@ static int klp_check_stack_func(struct klp_func *func, unsigned long *entries,
* Determine whether it's safe to transition the task to the target patch state
* by looking for any to-be-patched or to-be-unpatched functions on its stack.
*/
static int klp_check_stack(struct task_struct *task, char *err_buf)
static int klp_check_stack(struct task_struct *task, const char **oldname)
{
static unsigned long entries[MAX_STACK_ENTRIES];
struct klp_object *obj;
@ -248,12 +247,8 @@ static int klp_check_stack(struct task_struct *task, char *err_buf)
int ret, nr_entries;
ret = stack_trace_save_tsk_reliable(task, entries, ARRAY_SIZE(entries));
if (ret < 0) {
snprintf(err_buf, STACK_ERR_BUF_SIZE,
"%s: %s:%d has an unreliable stack\n",
__func__, task->comm, task->pid);
return ret;
}
if (ret < 0)
return -EINVAL;
nr_entries = ret;
klp_for_each_object(klp_transition_patch, obj) {
@ -262,11 +257,8 @@ static int klp_check_stack(struct task_struct *task, char *err_buf)
klp_for_each_func(obj, func) {
ret = klp_check_stack_func(func, entries, nr_entries);
if (ret) {
snprintf(err_buf, STACK_ERR_BUF_SIZE,
"%s: %s:%d is sleeping on function %s\n",
__func__, task->comm, task->pid,
func->old_name);
return ret;
*oldname = func->old_name;
return -EADDRINUSE;
}
}
}
@ -274,6 +266,22 @@ static int klp_check_stack(struct task_struct *task, char *err_buf)
return 0;
}
static int klp_check_and_switch_task(struct task_struct *task, void *arg)
{
int ret;
if (task_curr(task) && task != current)
return -EBUSY;
ret = klp_check_stack(task, arg);
if (ret)
return ret;
clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
task->patch_state = klp_target_state;
return 0;
}
/*
* Try to safely switch a task to the target patch state. If it's currently
* running, or it's sleeping on a to-be-patched or to-be-unpatched function, or
@ -281,13 +289,8 @@ static int klp_check_stack(struct task_struct *task, char *err_buf)
*/
static bool klp_try_switch_task(struct task_struct *task)
{
static char err_buf[STACK_ERR_BUF_SIZE];
struct rq *rq;
struct rq_flags flags;
const char *old_name;
int ret;
bool success = false;
err_buf[0] = '\0';
/* check if this task has already switched over */
if (task->patch_state == klp_target_state)
@ -305,36 +308,31 @@ static bool klp_try_switch_task(struct task_struct *task)
* functions. If all goes well, switch the task to the target patch
* state.
*/
rq = task_rq_lock(task, &flags);
ret = task_call_func(task, klp_check_and_switch_task, &old_name);
switch (ret) {
case 0: /* success */
break;
if (task_running(rq, task) && task != current) {
snprintf(err_buf, STACK_ERR_BUF_SIZE,
"%s: %s:%d is running\n", __func__, task->comm,
task->pid);
goto done;
case -EBUSY: /* klp_check_and_switch_task() */
pr_debug("%s: %s:%d is running\n",
__func__, task->comm, task->pid);
break;
case -EINVAL: /* klp_check_and_switch_task() */
pr_debug("%s: %s:%d has an unreliable stack\n",
__func__, task->comm, task->pid);
break;
case -EADDRINUSE: /* klp_check_and_switch_task() */
pr_debug("%s: %s:%d is sleeping on function %s\n",
__func__, task->comm, task->pid, old_name);
break;
default:
pr_debug("%s: Unknown error code (%d) when trying to switch %s:%d\n",
__func__, ret, task->comm, task->pid);
break;
}
ret = klp_check_stack(task, err_buf);
if (ret)
goto done;
success = true;
clear_tsk_thread_flag(task, TIF_PATCH_PENDING);
task->patch_state = klp_target_state;
done:
task_rq_unlock(rq, task, &flags);
/*
* Due to console deadlock issues, pr_debug() can't be used while
* holding the task rq lock. Instead we have to use a temporary buffer
* and print the debug message after releasing the lock.
*/
if (err_buf[0] != '\0')
pr_debug("%s", err_buf);
return success;
return !ret;
}
/*
@ -415,8 +413,11 @@ void klp_try_complete_transition(void)
for_each_possible_cpu(cpu) {
task = idle_task(cpu);
if (cpu_online(cpu)) {
if (!klp_try_switch_task(task))
if (!klp_try_switch_task(task)) {
complete = false;
/* Make idle task go through the main loop. */
wake_up_if_idle(cpu);
}
} else if (task->patch_state != klp_target_state) {
/* offline idle tasks can be switched immediately */
clear_tsk_thread_flag(task, TIF_PATCH_PENDING);

View File

@ -928,7 +928,7 @@ reset_ipi:
}
/* Callback function for scheduler to check locked-down task. */
static bool trc_inspect_reader(struct task_struct *t, void *arg)
static int trc_inspect_reader(struct task_struct *t, void *arg)
{
int cpu = task_cpu(t);
bool in_qs = false;
@ -939,7 +939,7 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
// If no chance of heavyweight readers, do it the hard way.
if (!ofl && !IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB))
return false;
return -EINVAL;
// If heavyweight readers are enabled on the remote task,
// we can inspect its state despite its currently running.
@ -947,7 +947,7 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
n_heavy_reader_attempts++;
if (!ofl && // Check for "running" idle tasks on offline CPUs.
!rcu_dynticks_zero_in_eqs(cpu, &t->trc_reader_nesting))
return false; // No quiescent state, do it the hard way.
return -EINVAL; // No quiescent state, do it the hard way.
n_heavy_reader_updates++;
if (ofl)
n_heavy_reader_ofl_updates++;
@ -962,7 +962,7 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
t->trc_reader_checked = true;
if (in_qs)
return true; // Already in quiescent state, done!!!
return 0; // Already in quiescent state, done!!!
// The task is in a read-side critical section, so set up its
// state so that it will awaken the grace-period kthread upon exit
@ -970,7 +970,7 @@ static bool trc_inspect_reader(struct task_struct *t, void *arg)
atomic_inc(&trc_n_readers_need_end); // One more to wait on.
WARN_ON_ONCE(READ_ONCE(t->trc_reader_special.b.need_qs));
WRITE_ONCE(t->trc_reader_special.b.need_qs, true);
return true;
return 0;
}
/* Attempt to extract the state for the specified task. */
@ -992,7 +992,7 @@ static void trc_wait_for_one_reader(struct task_struct *t,
// Attempt to nail down the task for inspection.
get_task_struct(t);
if (try_invoke_on_locked_down_task(t, trc_inspect_reader, NULL)) {
if (!task_call_func(t, trc_inspect_reader, NULL)) {
put_task_struct(t);
return;
}

View File

@ -240,16 +240,16 @@ struct rcu_stall_chk_rdr {
* Report out the state of a not-running task that is stalling the
* current RCU grace period.
*/
static bool check_slow_task(struct task_struct *t, void *arg)
static int check_slow_task(struct task_struct *t, void *arg)
{
struct rcu_stall_chk_rdr *rscrp = arg;
if (task_curr(t))
return false; // It is running, so decline to inspect it.
return -EBUSY; // It is running, so decline to inspect it.
rscrp->nesting = t->rcu_read_lock_nesting;
rscrp->rs = t->rcu_read_unlock_special;
rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
return true;
return 0;
}
/*
@ -283,7 +283,7 @@ static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
while (i) {
t = ts[--i];
if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr))
if (task_call_func(t, check_slow_task, &rscr))
pr_cont(" P%d", t->pid);
else
pr_cont(" P%d/%d:%c%c%c%c",

View File

@ -3,6 +3,10 @@ ifdef CONFIG_FUNCTION_TRACER
CFLAGS_REMOVE_clock.o = $(CC_FLAGS_FTRACE)
endif
# The compilers are complaining about unused variables inside an if(0) scope
# block. This is daft, shut them up.
ccflags-y += $(call cc-disable-warning, unused-but-set-variable)
# These files are disabled because they produce non-interesting flaky coverage
# that is not a function of syscall inputs. E.g. involuntary context switches.
KCOV_INSTRUMENT := n

View File

@ -74,7 +74,11 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
* Number of tasks to iterate in a single balance run.
* Limited because this is done with IRQs disabled.
*/
#ifdef CONFIG_PREEMPT_RT
const_debug unsigned int sysctl_sched_nr_migrate = 8;
#else
const_debug unsigned int sysctl_sched_nr_migrate = 32;
#endif
/*
* period over which we measure -rt task CPU usage in us.
@ -1962,6 +1966,25 @@ bool sched_task_on_rq(struct task_struct *p)
return task_on_rq_queued(p);
}
unsigned long get_wchan(struct task_struct *p)
{
unsigned long ip = 0;
unsigned int state;
if (!p || p == current)
return 0;
/* Only get wchan if task is blocked and we can keep it that way. */
raw_spin_lock_irq(&p->pi_lock);
state = READ_ONCE(p->__state);
smp_rmb(); /* see try_to_wake_up() */
if (state != TASK_RUNNING && state != TASK_WAKING && !p->on_rq)
ip = __get_wchan(p);
raw_spin_unlock_irq(&p->pi_lock);
return ip;
}
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
if (!(flags & ENQUEUE_NOCLOCK))
@ -3251,7 +3274,7 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state
ktime_t to = NSEC_PER_SEC / HZ;
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_hrtimeout(&to, HRTIMER_MODE_REL);
schedule_hrtimeout(&to, HRTIMER_MODE_REL_HARD);
continue;
}
@ -3489,11 +3512,11 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
#ifdef CONFIG_SMP
if (cpu == rq->cpu) {
__schedstat_inc(rq->ttwu_local);
__schedstat_inc(p->se.statistics.nr_wakeups_local);
__schedstat_inc(p->stats.nr_wakeups_local);
} else {
struct sched_domain *sd;
__schedstat_inc(p->se.statistics.nr_wakeups_remote);
__schedstat_inc(p->stats.nr_wakeups_remote);
rcu_read_lock();
for_each_domain(rq->cpu, sd) {
if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
@ -3505,14 +3528,14 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
}
if (wake_flags & WF_MIGRATED)
__schedstat_inc(p->se.statistics.nr_wakeups_migrate);
__schedstat_inc(p->stats.nr_wakeups_migrate);
#endif /* CONFIG_SMP */
__schedstat_inc(rq->ttwu_count);
__schedstat_inc(p->se.statistics.nr_wakeups);
__schedstat_inc(p->stats.nr_wakeups);
if (wake_flags & WF_SYNC)
__schedstat_inc(p->se.statistics.nr_wakeups_sync);
__schedstat_inc(p->stats.nr_wakeups_sync);
}
/*
@ -3691,15 +3714,11 @@ void wake_up_if_idle(int cpu)
if (!is_idle_task(rcu_dereference(rq->curr)))
goto out;
if (set_nr_if_polling(rq->idle)) {
trace_sched_wake_idle_without_ipi(cpu);
} else {
rq_lock_irqsave(rq, &rf);
if (is_idle_task(rq->curr))
smp_send_reschedule(cpu);
/* Else CPU is not idle, do nothing here: */
rq_unlock_irqrestore(rq, &rf);
}
rq_lock_irqsave(rq, &rf);
if (is_idle_task(rq->curr))
resched_curr(rq);
/* Else CPU is not idle, do nothing here: */
rq_unlock_irqrestore(rq, &rf);
out:
rcu_read_unlock();
@ -4106,46 +4125,61 @@ out:
}
/**
* try_invoke_on_locked_down_task - Invoke a function on task in fixed state
* task_call_func - Invoke a function on task in fixed state
* @p: Process for which the function is to be invoked, can be @current.
* @func: Function to invoke.
* @arg: Argument to function.
*
* If the specified task can be quickly locked into a definite state
* (either sleeping or on a given runqueue), arrange to keep it in that
* state while invoking @func(@arg). This function can use ->on_rq and
* task_curr() to work out what the state is, if required. Given that
* @func can be invoked with a runqueue lock held, it had better be quite
* lightweight.
* Fix the task in it's current state by avoiding wakeups and or rq operations
* and call @func(@arg) on it. This function can use ->on_rq and task_curr()
* to work out what the state is, if required. Given that @func can be invoked
* with a runqueue lock held, it had better be quite lightweight.
*
* Returns:
* @false if the task slipped out from under the locks.
* @true if the task was locked onto a runqueue or is sleeping.
* However, @func can override this by returning @false.
* Whatever @func returns
*/
bool try_invoke_on_locked_down_task(struct task_struct *p, bool (*func)(struct task_struct *t, void *arg), void *arg)
int task_call_func(struct task_struct *p, task_call_f func, void *arg)
{
struct rq *rq = NULL;
unsigned int state;
struct rq_flags rf;
bool ret = false;
struct rq *rq;
int ret;
raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
if (p->on_rq) {
state = READ_ONCE(p->__state);
/*
* Ensure we load p->on_rq after p->__state, otherwise it would be
* possible to, falsely, observe p->on_rq == 0.
*
* See try_to_wake_up() for a longer comment.
*/
smp_rmb();
/*
* Since pi->lock blocks try_to_wake_up(), we don't need rq->lock when
* the task is blocked. Make sure to check @state since ttwu() can drop
* locks at the end, see ttwu_queue_wakelist().
*/
if (state == TASK_RUNNING || state == TASK_WAKING || p->on_rq)
rq = __task_rq_lock(p, &rf);
if (task_rq(p) == rq)
ret = func(p, arg);
/*
* At this point the task is pinned; either:
* - blocked and we're holding off wakeups (pi->lock)
* - woken, and we're holding off enqueue (rq->lock)
* - queued, and we're holding off schedule (rq->lock)
* - running, and we're holding off de-schedule (rq->lock)
*
* The called function (@func) can use: task_curr(), p->on_rq and
* p->__state to differentiate between these states.
*/
ret = func(p, arg);
if (rq)
rq_unlock(rq, &rf);
} else {
switch (READ_ONCE(p->__state)) {
case TASK_RUNNING:
case TASK_WAKING:
break;
default:
smp_rmb(); // See smp_rmb() comment in try_to_wake_up().
if (!p->on_rq)
ret = func(p, arg);
}
}
raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
return ret;
}
@ -4196,7 +4230,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
memset(&p->stats, 0, sizeof(p->stats));
#endif
RB_CLEAR_NODE(&p->dl.rb_node);
@ -4328,8 +4362,6 @@ int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
*/
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
__sched_fork(clone_flags, p);
/*
* We mark the process as NEW here. This guarantees that
@ -4375,24 +4407,6 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
init_entity_runnable_average(&p->se);
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
* is ran before sched_fork().
*
* Silence PROVE_RCU.
*/
raw_spin_lock_irqsave(&p->pi_lock, flags);
rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
*/
__set_task_cpu(p, smp_processor_id());
if (p->sched_class->task_fork)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
#ifdef CONFIG_SCHED_INFO
if (likely(sched_info_on()))
memset(&p->sched_info, 0, sizeof(p->sched_info));
@ -4408,8 +4422,29 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
return 0;
}
void sched_post_fork(struct task_struct *p)
void sched_post_fork(struct task_struct *p, struct kernel_clone_args *kargs)
{
unsigned long flags;
#ifdef CONFIG_CGROUP_SCHED
struct task_group *tg;
#endif
raw_spin_lock_irqsave(&p->pi_lock, flags);
#ifdef CONFIG_CGROUP_SCHED
tg = container_of(kargs->cset->subsys[cpu_cgrp_id],
struct task_group, css);
p->sched_task_group = autogroup_task_group(p, tg);
#endif
rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
*/
__set_task_cpu(p, smp_processor_id());
if (p->sched_class->task_fork)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
uclamp_post_fork(p);
}
@ -4836,18 +4871,12 @@ static struct rq *finish_task_switch(struct task_struct *prev)
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
mmdrop(mm);
mmdrop_sched(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
*/
kprobe_flush_task(prev);
/* Task is done with its stack. */
put_task_stack(prev);
@ -5580,8 +5609,7 @@ restart:
return p;
}
/* The idle class should always have a runnable task: */
BUG();
BUG(); /* The idle class should always have a runnable task. */
}
#ifdef CONFIG_SCHED_CORE
@ -5603,54 +5631,18 @@ static inline bool cookie_match(struct task_struct *a, struct task_struct *b)
return a->core_cookie == b->core_cookie;
}
// XXX fairness/fwd progress conditions
/*
* Returns
* - NULL if there is no runnable task for this class.
* - the highest priority task for this runqueue if it matches
* rq->core->core_cookie or its priority is greater than max.
* - Else returns idle_task.
*/
static struct task_struct *
pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max, bool in_fi)
static inline struct task_struct *pick_task(struct rq *rq)
{
struct task_struct *class_pick, *cookie_pick;
unsigned long cookie = rq->core->core_cookie;
const struct sched_class *class;
struct task_struct *p;
class_pick = class->pick_task(rq);
if (!class_pick)
return NULL;
if (!cookie) {
/*
* If class_pick is tagged, return it only if it has
* higher priority than max.
*/
if (max && class_pick->core_cookie &&
prio_less(class_pick, max, in_fi))
return idle_sched_class.pick_task(rq);
return class_pick;
for_each_class(class) {
p = class->pick_task(rq);
if (p)
return p;
}
/*
* If class_pick is idle or matches cookie, return early.
*/
if (cookie_equals(class_pick, cookie))
return class_pick;
cookie_pick = sched_core_find(rq, cookie);
/*
* If class > max && class > cookie, it is the highest priority task on
* the core (so far) and it must be selected, otherwise we must go with
* the cookie pick in order to satisfy the constraint.
*/
if (prio_less(cookie_pick, class_pick, in_fi) &&
(!max || prio_less(max, class_pick, in_fi)))
return class_pick;
return cookie_pick;
BUG(); /* The idle class should always have a runnable task. */
}
extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_fi);
@ -5658,11 +5650,12 @@ extern void task_vruntime_update(struct rq *rq, struct task_struct *p, bool in_f
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct task_struct *next, *max = NULL;
const struct sched_class *class;
struct task_struct *next, *p, *max = NULL;
const struct cpumask *smt_mask;
bool fi_before = false;
int i, j, cpu, occ = 0;
unsigned long cookie;
int i, cpu, occ = 0;
struct rq *rq_i;
bool need_sync;
if (!sched_core_enabled(rq))
@ -5735,12 +5728,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
* and there are no cookied tasks running on siblings.
*/
if (!need_sync) {
for_each_class(class) {
next = class->pick_task(rq);
if (next)
break;
}
next = pick_task(rq);
if (!next->core_cookie) {
rq->core_pick = NULL;
/*
@ -5753,76 +5741,51 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
}
}
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
rq_i->core_pick = NULL;
/*
* For each thread: do the regular task pick and find the max prio task
* amongst them.
*
* Tie-break prio towards the current CPU
*/
for_each_cpu_wrap(i, smt_mask, cpu) {
rq_i = cpu_rq(i);
if (i != cpu)
update_rq_clock(rq_i);
p = rq_i->core_pick = pick_task(rq_i);
if (!max || prio_less(max, p, fi_before))
max = p;
}
cookie = rq->core->core_cookie = max->core_cookie;
/*
* Try and select tasks for each sibling in descending sched_class
* order.
* For each thread: try and find a runnable task that matches @max or
* force idle.
*/
for_each_class(class) {
again:
for_each_cpu_wrap(i, smt_mask, cpu) {
struct rq *rq_i = cpu_rq(i);
struct task_struct *p;
for_each_cpu(i, smt_mask) {
rq_i = cpu_rq(i);
p = rq_i->core_pick;
if (rq_i->core_pick)
continue;
/*
* If this sibling doesn't yet have a suitable task to
* run; ask for the most eligible task, given the
* highest priority task already selected for this
* core.
*/
p = pick_task(rq_i, class, max, fi_before);
if (!cookie_equals(p, cookie)) {
p = NULL;
if (cookie)
p = sched_core_find(rq_i, cookie);
if (!p)
continue;
p = idle_sched_class.pick_task(rq_i);
}
if (!is_task_rq_idle(p))
occ++;
rq_i->core_pick = p;
rq_i->core_pick = p;
if (rq_i->idle == p && rq_i->nr_running) {
if (p == rq_i->idle) {
if (rq_i->nr_running) {
rq->core->core_forceidle = true;
if (!fi_before)
rq->core->core_forceidle_seq++;
}
/*
* If this new candidate is of higher priority than the
* previous; and they're incompatible; we need to wipe
* the slate and start over. pick_task makes sure that
* p's priority is more than max if it doesn't match
* max's cookie.
*
* NOTE: this is a linear max-filter and is thus bounded
* in execution time.
*/
if (!max || !cookie_match(max, p)) {
struct task_struct *old_max = max;
rq->core->core_cookie = p->core_cookie;
max = p;
if (old_max) {
rq->core->core_forceidle = false;
for_each_cpu(j, smt_mask) {
if (j == i)
continue;
cpu_rq(j)->core_pick = NULL;
}
occ = 1;
goto again;
}
}
} else {
occ++;
}
}
@ -5842,7 +5805,7 @@ again:
* non-matching user state.
*/
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);
rq_i = cpu_rq(i);
/*
* An online sibling might have gone offline before a task
@ -6319,20 +6282,14 @@ static inline void sched_submit_work(struct task_struct *tsk)
task_flags = tsk->flags;
/*
* If a worker went to sleep, notify and ask workqueue whether
* it wants to wake up a task to maintain concurrency.
* As this function is called inside the schedule() context,
* we disable preemption to avoid it calling schedule() again
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
* If a worker goes to sleep, notify and ask workqueue whether it
* wants to wake up a task to maintain concurrency.
*/
if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
preempt_disable();
if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
else
io_wq_worker_sleeping(tsk);
preempt_enable_no_resched();
}
if (tsk_is_pi_blocked(tsk))
@ -6586,12 +6543,13 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
*/
enum {
preempt_dynamic_none = 0,
preempt_dynamic_undefined = -1,
preempt_dynamic_none,
preempt_dynamic_voluntary,
preempt_dynamic_full,
};
int preempt_dynamic_mode = preempt_dynamic_full;
int preempt_dynamic_mode = preempt_dynamic_undefined;
int sched_dynamic_mode(const char *str)
{
@ -6664,7 +6622,27 @@ static int __init setup_preempt_mode(char *str)
}
__setup("preempt=", setup_preempt_mode);
#endif /* CONFIG_PREEMPT_DYNAMIC */
static void __init preempt_dynamic_init(void)
{
if (preempt_dynamic_mode == preempt_dynamic_undefined) {
if (IS_ENABLED(CONFIG_PREEMPT_NONE_BEHAVIOUR)) {
sched_dynamic_update(preempt_dynamic_none);
} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY_BEHAVIOUR)) {
sched_dynamic_update(preempt_dynamic_voluntary);
} else {
/* Default static call setting, nothing to do */
WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT_BEHAVIOUR));
preempt_dynamic_mode = preempt_dynamic_full;
pr_info("Dynamic Preempt: full\n");
}
}
}
#else /* !CONFIG_PREEMPT_DYNAMIC */
static inline void preempt_dynamic_init(void) { }
#endif /* #ifdef CONFIG_PREEMPT_DYNAMIC */
/*
* This is the entry point to schedule() from kernel preemption
@ -9466,6 +9444,8 @@ void __init sched_init(void)
init_uclamp();
preempt_dynamic_init();
scheduler_running = 1;
}
@ -9640,9 +9620,9 @@ void normalize_rt_tasks(void)
continue;
p->se.exec_start = 0;
schedstat_set(p->se.statistics.wait_start, 0);
schedstat_set(p->se.statistics.sleep_start, 0);
schedstat_set(p->se.statistics.block_start, 0);
schedstat_set(p->stats.wait_start, 0);
schedstat_set(p->stats.sleep_start, 0);
schedstat_set(p->stats.block_start, 0);
if (!dl_task(p) && !rt_task(p)) {
/*
@ -10484,15 +10464,21 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
if (schedstat_enabled() && tg != &root_task_group) {
struct sched_statistics *stats;
u64 ws = 0;
int i;
for_each_possible_cpu(i)
ws += schedstat_val(tg->se[i]->statistics.wait_sum);
for_each_possible_cpu(i) {
stats = __schedstats_from_se(tg->se[i]);
ws += schedstat_val(stats->wait_sum);
}
seq_printf(sf, "wait_sum %llu\n", ws);
}
seq_printf(sf, "nr_bursts %d\n", cfs_b->nr_burst);
seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time);
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */
@ -10608,16 +10594,20 @@ static int cpu_extra_stat_show(struct seq_file *sf,
{
struct task_group *tg = css_tg(css);
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
u64 throttled_usec;
u64 throttled_usec, burst_usec;
throttled_usec = cfs_b->throttled_time;
do_div(throttled_usec, NSEC_PER_USEC);
burst_usec = cfs_b->burst_time;
do_div(burst_usec, NSEC_PER_USEC);
seq_printf(sf, "nr_periods %d\n"
"nr_throttled %d\n"
"throttled_usec %llu\n",
"throttled_usec %llu\n"
"nr_bursts %d\n"
"burst_usec %llu\n",
cfs_b->nr_periods, cfs_b->nr_throttled,
throttled_usec);
throttled_usec, cfs_b->nr_burst, burst_usec);
}
#endif
return 0;

View File

@ -11,7 +11,7 @@ struct sched_core_cookie {
refcount_t refcnt;
};
unsigned long sched_core_alloc_cookie(void)
static unsigned long sched_core_alloc_cookie(void)
{
struct sched_core_cookie *ck = kmalloc(sizeof(*ck), GFP_KERNEL);
if (!ck)
@ -23,7 +23,7 @@ unsigned long sched_core_alloc_cookie(void)
return (unsigned long)ck;
}
void sched_core_put_cookie(unsigned long cookie)
static void sched_core_put_cookie(unsigned long cookie)
{
struct sched_core_cookie *ptr = (void *)cookie;
@ -33,7 +33,7 @@ void sched_core_put_cookie(unsigned long cookie)
}
}
unsigned long sched_core_get_cookie(unsigned long cookie)
static unsigned long sched_core_get_cookie(unsigned long cookie)
{
struct sched_core_cookie *ptr = (void *)cookie;
@ -53,7 +53,8 @@ unsigned long sched_core_get_cookie(unsigned long cookie)
*
* Returns: the old cookie
*/
unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie)
static unsigned long sched_core_update_cookie(struct task_struct *p,
unsigned long cookie)
{
unsigned long old_cookie;
struct rq_flags rf;

View File

@ -1265,8 +1265,10 @@ static void update_curr_dl(struct rq *rq)
return;
}
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
trace_sched_stat_runtime(curr, delta_exec, 0);
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
@ -1472,6 +1474,82 @@ static inline bool __dl_less(struct rb_node *a, const struct rb_node *b)
return dl_time_before(__node_2_dle(a)->deadline, __node_2_dle(b)->deadline);
}
static inline struct sched_statistics *
__schedstats_from_dl_se(struct sched_dl_entity *dl_se)
{
return &dl_task_of(dl_se)->stats;
}
static inline void
update_stats_wait_start_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se)
{
struct sched_statistics *stats;
if (!schedstat_enabled())
return;
stats = __schedstats_from_dl_se(dl_se);
__update_stats_wait_start(rq_of_dl_rq(dl_rq), dl_task_of(dl_se), stats);
}
static inline void
update_stats_wait_end_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se)
{
struct sched_statistics *stats;
if (!schedstat_enabled())
return;
stats = __schedstats_from_dl_se(dl_se);
__update_stats_wait_end(rq_of_dl_rq(dl_rq), dl_task_of(dl_se), stats);
}
static inline void
update_stats_enqueue_sleeper_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se)
{
struct sched_statistics *stats;
if (!schedstat_enabled())
return;
stats = __schedstats_from_dl_se(dl_se);
__update_stats_enqueue_sleeper(rq_of_dl_rq(dl_rq), dl_task_of(dl_se), stats);
}
static inline void
update_stats_enqueue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
int flags)
{
if (!schedstat_enabled())
return;
if (flags & ENQUEUE_WAKEUP)
update_stats_enqueue_sleeper_dl(dl_rq, dl_se);
}
static inline void
update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se,
int flags)
{
struct task_struct *p = dl_task_of(dl_se);
if (!schedstat_enabled())
return;
if ((flags & DEQUEUE_SLEEP)) {
unsigned int state;
state = READ_ONCE(p->__state);
if (state & TASK_INTERRUPTIBLE)
__schedstat_set(p->stats.sleep_start,
rq_clock(rq_of_dl_rq(dl_rq)));
if (state & TASK_UNINTERRUPTIBLE)
__schedstat_set(p->stats.block_start,
rq_clock(rq_of_dl_rq(dl_rq)));
}
}
static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
{
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
@ -1502,6 +1580,8 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
{
BUG_ON(on_dl_rq(dl_se));
update_stats_enqueue_dl(dl_rq_of_se(dl_se), dl_se, flags);
/*
* If this is a wakeup or a new instance, the scheduling
* parameters of the task might need updating. Otherwise,
@ -1598,6 +1678,9 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
return;
}
check_schedstat_required();
update_stats_wait_start_dl(dl_rq_of_se(&p->dl), &p->dl);
enqueue_dl_entity(&p->dl, flags);
if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@ -1606,6 +1689,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
{
update_stats_dequeue_dl(&rq->dl, &p->dl, flags);
dequeue_dl_entity(&p->dl);
dequeue_pushable_dl_task(rq, p);
}
@ -1825,7 +1909,12 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
{
struct sched_dl_entity *dl_se = &p->dl;
struct dl_rq *dl_rq = &rq->dl;
p->se.exec_start = rq_clock_task(rq);
if (on_dl_rq(&p->dl))
update_stats_wait_end_dl(dl_rq, dl_se);
/* You can't push away the running task */
dequeue_pushable_dl_task(rq, p);
@ -1882,6 +1971,12 @@ static struct task_struct *pick_next_task_dl(struct rq *rq)
static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
{
struct sched_dl_entity *dl_se = &p->dl;
struct dl_rq *dl_rq = &rq->dl;
if (on_dl_rq(&p->dl))
update_stats_wait_start_dl(dl_rq, dl_se);
update_curr_dl(rq);
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);

View File

@ -311,6 +311,7 @@ static __init int sched_init_debug(void)
debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
@ -448,9 +449,11 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
struct sched_entity *se = tg->se[cpu];
#define P(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)F)
#define P_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)schedstat_val(F))
#define P_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld\n", \
#F, (long long)schedstat_val(stats->F))
#define PN(F) SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)F))
#define PN_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(F)))
#define PN_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld.%06ld\n", \
#F, SPLIT_NS((long long)schedstat_val(stats->F)))
if (!se)
return;
@ -460,16 +463,19 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
PN(se->sum_exec_runtime);
if (schedstat_enabled()) {
PN_SCHEDSTAT(se->statistics.wait_start);
PN_SCHEDSTAT(se->statistics.sleep_start);
PN_SCHEDSTAT(se->statistics.block_start);
PN_SCHEDSTAT(se->statistics.sleep_max);
PN_SCHEDSTAT(se->statistics.block_max);
PN_SCHEDSTAT(se->statistics.exec_max);
PN_SCHEDSTAT(se->statistics.slice_max);
PN_SCHEDSTAT(se->statistics.wait_max);
PN_SCHEDSTAT(se->statistics.wait_sum);
P_SCHEDSTAT(se->statistics.wait_count);
struct sched_statistics *stats;
stats = __schedstats_from_se(se);
PN_SCHEDSTAT(wait_start);
PN_SCHEDSTAT(sleep_start);
PN_SCHEDSTAT(block_start);
PN_SCHEDSTAT(sleep_max);
PN_SCHEDSTAT(block_max);
PN_SCHEDSTAT(exec_max);
PN_SCHEDSTAT(slice_max);
PN_SCHEDSTAT(wait_max);
PN_SCHEDSTAT(wait_sum);
P_SCHEDSTAT(wait_count);
}
P(se->load.weight);
@ -535,10 +541,11 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
(long long)(p->nvcsw + p->nivcsw),
p->prio);
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
SPLIT_NS(schedstat_val_or_zero(p->se.statistics.wait_sum)),
SEQ_printf(m, "%9lld.%06ld %9lld.%06ld %9lld.%06ld %9lld.%06ld",
SPLIT_NS(schedstat_val_or_zero(p->stats.wait_sum)),
SPLIT_NS(p->se.sum_exec_runtime),
SPLIT_NS(schedstat_val_or_zero(p->se.statistics.sum_sleep_runtime)));
SPLIT_NS(schedstat_val_or_zero(p->stats.sum_sleep_runtime)),
SPLIT_NS(schedstat_val_or_zero(p->stats.sum_block_runtime)));
#ifdef CONFIG_NUMA_BALANCING
SEQ_printf(m, " %d %d", task_node(p), task_numa_group_id(p));
@ -614,6 +621,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->nr_spread_over);
SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
SEQ_printf(m, " .%-30s: %d\n", "h_nr_running", cfs_rq->h_nr_running);
SEQ_printf(m, " .%-30s: %d\n", "idle_nr_running",
cfs_rq->idle_nr_running);
SEQ_printf(m, " .%-30s: %d\n", "idle_h_nr_running",
cfs_rq->idle_h_nr_running);
SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
@ -810,6 +819,7 @@ static void sched_debug_header(struct seq_file *m)
SEQ_printf(m, " .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
PN(sysctl_sched_latency);
PN(sysctl_sched_min_granularity);
PN(sysctl_sched_idle_min_granularity);
PN(sysctl_sched_wakeup_granularity);
P(sysctl_sched_child_runs_first);
P(sysctl_sched_features);
@ -954,8 +964,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
"---------------------------------------------------------"
"----------\n");
#define P_SCHEDSTAT(F) __PS(#F, schedstat_val(p->F))
#define PN_SCHEDSTAT(F) __PSN(#F, schedstat_val(p->F))
#define P_SCHEDSTAT(F) __PS(#F, schedstat_val(p->stats.F))
#define PN_SCHEDSTAT(F) __PSN(#F, schedstat_val(p->stats.F))
PN(se.exec_start);
PN(se.vruntime);
@ -968,33 +978,34 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
if (schedstat_enabled()) {
u64 avg_atom, avg_per_cpu;
PN_SCHEDSTAT(se.statistics.sum_sleep_runtime);
PN_SCHEDSTAT(se.statistics.wait_start);
PN_SCHEDSTAT(se.statistics.sleep_start);
PN_SCHEDSTAT(se.statistics.block_start);
PN_SCHEDSTAT(se.statistics.sleep_max);
PN_SCHEDSTAT(se.statistics.block_max);
PN_SCHEDSTAT(se.statistics.exec_max);
PN_SCHEDSTAT(se.statistics.slice_max);
PN_SCHEDSTAT(se.statistics.wait_max);
PN_SCHEDSTAT(se.statistics.wait_sum);
P_SCHEDSTAT(se.statistics.wait_count);
PN_SCHEDSTAT(se.statistics.iowait_sum);
P_SCHEDSTAT(se.statistics.iowait_count);
P_SCHEDSTAT(se.statistics.nr_migrations_cold);
P_SCHEDSTAT(se.statistics.nr_failed_migrations_affine);
P_SCHEDSTAT(se.statistics.nr_failed_migrations_running);
P_SCHEDSTAT(se.statistics.nr_failed_migrations_hot);
P_SCHEDSTAT(se.statistics.nr_forced_migrations);
P_SCHEDSTAT(se.statistics.nr_wakeups);
P_SCHEDSTAT(se.statistics.nr_wakeups_sync);
P_SCHEDSTAT(se.statistics.nr_wakeups_migrate);
P_SCHEDSTAT(se.statistics.nr_wakeups_local);
P_SCHEDSTAT(se.statistics.nr_wakeups_remote);
P_SCHEDSTAT(se.statistics.nr_wakeups_affine);
P_SCHEDSTAT(se.statistics.nr_wakeups_affine_attempts);
P_SCHEDSTAT(se.statistics.nr_wakeups_passive);
P_SCHEDSTAT(se.statistics.nr_wakeups_idle);
PN_SCHEDSTAT(sum_sleep_runtime);
PN_SCHEDSTAT(sum_block_runtime);
PN_SCHEDSTAT(wait_start);
PN_SCHEDSTAT(sleep_start);
PN_SCHEDSTAT(block_start);
PN_SCHEDSTAT(sleep_max);
PN_SCHEDSTAT(block_max);
PN_SCHEDSTAT(exec_max);
PN_SCHEDSTAT(slice_max);
PN_SCHEDSTAT(wait_max);
PN_SCHEDSTAT(wait_sum);
P_SCHEDSTAT(wait_count);
PN_SCHEDSTAT(iowait_sum);
P_SCHEDSTAT(iowait_count);
P_SCHEDSTAT(nr_migrations_cold);
P_SCHEDSTAT(nr_failed_migrations_affine);
P_SCHEDSTAT(nr_failed_migrations_running);
P_SCHEDSTAT(nr_failed_migrations_hot);
P_SCHEDSTAT(nr_forced_migrations);
P_SCHEDSTAT(nr_wakeups);
P_SCHEDSTAT(nr_wakeups_sync);
P_SCHEDSTAT(nr_wakeups_migrate);
P_SCHEDSTAT(nr_wakeups_local);
P_SCHEDSTAT(nr_wakeups_remote);
P_SCHEDSTAT(nr_wakeups_affine);
P_SCHEDSTAT(nr_wakeups_affine_attempts);
P_SCHEDSTAT(nr_wakeups_passive);
P_SCHEDSTAT(nr_wakeups_idle);
avg_atom = p->se.sum_exec_runtime;
if (nr_switches)
@ -1060,7 +1071,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
void proc_sched_set_task(struct task_struct *p)
{
#ifdef CONFIG_SCHEDSTATS
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
memset(&p->stats, 0, sizeof(p->stats));
#endif
}

View File

@ -59,6 +59,14 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
unsigned int sysctl_sched_min_granularity = 750000ULL;
static unsigned int normalized_sysctl_sched_min_granularity = 750000ULL;
/*
* Minimal preemption granularity for CPU-bound SCHED_IDLE tasks.
* Applies only when SCHED_IDLE tasks compete with normal tasks.
*
* (default: 0.75 msec)
*/
unsigned int sysctl_sched_idle_min_granularity = 750000ULL;
/*
* This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
*/
@ -665,6 +673,8 @@ static u64 __sched_period(unsigned long nr_running)
return sysctl_sched_latency;
}
static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
/*
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
@ -674,6 +684,8 @@ static u64 __sched_period(unsigned long nr_running)
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
unsigned int nr_running = cfs_rq->nr_running;
struct sched_entity *init_se = se;
unsigned int min_gran;
u64 slice;
if (sched_feat(ALT_PERIOD))
@ -684,12 +696,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
for_each_sched_entity(se) {
struct load_weight *load;
struct load_weight lw;
struct cfs_rq *qcfs_rq;
cfs_rq = cfs_rq_of(se);
load = &cfs_rq->load;
qcfs_rq = cfs_rq_of(se);
load = &qcfs_rq->load;
if (unlikely(!se->on_rq)) {
lw = cfs_rq->load;
lw = qcfs_rq->load;
update_load_add(&lw, se->load.weight);
load = &lw;
@ -697,8 +710,14 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
slice = __calc_delta(slice, se->load.weight, load);
}
if (sched_feat(BASE_SLICE))
slice = max(slice, (u64)sysctl_sched_min_granularity);
if (sched_feat(BASE_SLICE)) {
if (se_is_idle(init_se) && !sched_idle_cfs_rq(cfs_rq))
min_gran = sysctl_sched_idle_min_granularity;
else
min_gran = sysctl_sched_min_granularity;
slice = max_t(u64, slice, min_gran);
}
return slice;
}
@ -837,8 +856,13 @@ static void update_curr(struct cfs_rq *cfs_rq)
curr->exec_start = now;
schedstat_set(curr->statistics.exec_max,
max(delta_exec, curr->statistics.exec_max));
if (schedstat_enabled()) {
struct sched_statistics *stats;
stats = __schedstats_from_se(curr);
__schedstat_set(stats->exec_max,
max(delta_exec, stats->exec_max));
}
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq->exec_clock, delta_exec);
@ -863,137 +887,70 @@ static void update_curr_fair(struct rq *rq)
}
static inline void
update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_stats_wait_start_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
u64 wait_start, prev_wait_start;
struct sched_statistics *stats;
struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
wait_start = rq_clock(rq_of(cfs_rq));
prev_wait_start = schedstat_val(se->statistics.wait_start);
stats = __schedstats_from_se(se);
if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
likely(wait_start > prev_wait_start))
wait_start -= prev_wait_start;
if (entity_is_task(se))
p = task_of(se);
__schedstat_set(se->statistics.wait_start, wait_start);
__update_stats_wait_start(rq_of(cfs_rq), p, stats);
}
static inline void
update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_stats_wait_end_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct task_struct *p;
u64 delta;
struct sched_statistics *stats;
struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
stats = __schedstats_from_se(se);
/*
* When the sched_schedstat changes from 0 to 1, some sched se
* maybe already in the runqueue, the se->statistics.wait_start
* will be 0.So it will let the delta wrong. We need to avoid this
* scenario.
*/
if (unlikely(!schedstat_val(se->statistics.wait_start)))
if (unlikely(!schedstat_val(stats->wait_start)))
return;
delta = rq_clock(rq_of(cfs_rq)) - schedstat_val(se->statistics.wait_start);
if (entity_is_task(se)) {
if (entity_is_task(se))
p = task_of(se);
if (task_on_rq_migrating(p)) {
/*
* Preserve migrating task's wait time so wait_start
* time stamp can be adjusted to accumulate wait time
* prior to migration.
*/
__schedstat_set(se->statistics.wait_start, delta);
return;
}
trace_sched_stat_wait(p, delta);
}
__schedstat_set(se->statistics.wait_max,
max(schedstat_val(se->statistics.wait_max), delta));
__schedstat_inc(se->statistics.wait_count);
__schedstat_add(se->statistics.wait_sum, delta);
__schedstat_set(se->statistics.wait_start, 0);
__update_stats_wait_end(rq_of(cfs_rq), p, stats);
}
static inline void
update_stats_enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_stats_enqueue_sleeper_fair(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct sched_statistics *stats;
struct task_struct *tsk = NULL;
u64 sleep_start, block_start;
if (!schedstat_enabled())
return;
sleep_start = schedstat_val(se->statistics.sleep_start);
block_start = schedstat_val(se->statistics.block_start);
stats = __schedstats_from_se(se);
if (entity_is_task(se))
tsk = task_of(se);
if (sleep_start) {
u64 delta = rq_clock(rq_of(cfs_rq)) - sleep_start;
if ((s64)delta < 0)
delta = 0;
if (unlikely(delta > schedstat_val(se->statistics.sleep_max)))
__schedstat_set(se->statistics.sleep_max, delta);
__schedstat_set(se->statistics.sleep_start, 0);
__schedstat_add(se->statistics.sum_sleep_runtime, delta);
if (tsk) {
account_scheduler_latency(tsk, delta >> 10, 1);
trace_sched_stat_sleep(tsk, delta);
}
}
if (block_start) {
u64 delta = rq_clock(rq_of(cfs_rq)) - block_start;
if ((s64)delta < 0)
delta = 0;
if (unlikely(delta > schedstat_val(se->statistics.block_max)))
__schedstat_set(se->statistics.block_max, delta);
__schedstat_set(se->statistics.block_start, 0);
__schedstat_add(se->statistics.sum_sleep_runtime, delta);
if (tsk) {
if (tsk->in_iowait) {
__schedstat_add(se->statistics.iowait_sum, delta);
__schedstat_inc(se->statistics.iowait_count);
trace_sched_stat_iowait(tsk, delta);
}
trace_sched_stat_blocked(tsk, delta);
/*
* Blocking time is in units of nanosecs, so shift by
* 20 to get a milliseconds-range estimation of the
* amount of time that the task spent sleeping:
*/
if (unlikely(prof_on == SLEEP_PROFILING)) {
profile_hits(SLEEP_PROFILING,
(void *)get_wchan(tsk),
delta >> 20);
}
account_scheduler_latency(tsk, delta >> 10, 0);
}
}
__update_stats_enqueue_sleeper(rq_of(cfs_rq), tsk, stats);
}
/*
* Task is being enqueued - update stats:
*/
static inline void
update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_stats_enqueue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
if (!schedstat_enabled())
return;
@ -1003,14 +960,14 @@ update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* a dequeue/enqueue event is a NOP)
*/
if (se != cfs_rq->curr)
update_stats_wait_start(cfs_rq, se);
update_stats_wait_start_fair(cfs_rq, se);
if (flags & ENQUEUE_WAKEUP)
update_stats_enqueue_sleeper(cfs_rq, se);
update_stats_enqueue_sleeper_fair(cfs_rq, se);
}
static inline void
update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
if (!schedstat_enabled())
@ -1021,7 +978,7 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* waiting task:
*/
if (se != cfs_rq->curr)
update_stats_wait_end(cfs_rq, se);
update_stats_wait_end_fair(cfs_rq, se);
if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) {
struct task_struct *tsk = task_of(se);
@ -1030,10 +987,10 @@ update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
/* XXX racy against TTWU */
state = READ_ONCE(tsk->__state);
if (state & TASK_INTERRUPTIBLE)
__schedstat_set(se->statistics.sleep_start,
__schedstat_set(tsk->stats.sleep_start,
rq_clock(rq_of(cfs_rq)));
if (state & TASK_UNINTERRUPTIBLE)
__schedstat_set(se->statistics.block_start,
__schedstat_set(tsk->stats.block_start,
rq_clock(rq_of(cfs_rq)));
}
}
@ -1081,11 +1038,12 @@ struct numa_group {
unsigned long total_faults;
unsigned long max_faults_cpu;
/*
* faults[] array is split into two regions: faults_mem and faults_cpu.
*
* Faults_cpu is used to decide whether memory should move
* towards the CPU. As a consequence, these stats are weighted
* more by CPU use than by memory faults.
*/
unsigned long *faults_cpu;
unsigned long faults[];
};
@ -1259,8 +1217,8 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
static inline unsigned long group_faults_cpu(struct numa_group *group, int nid)
{
return group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 0)] +
group->faults_cpu[task_faults_idx(NUMA_MEM, nid, 1)];
return group->faults[task_faults_idx(NUMA_CPU, nid, 0)] +
group->faults[task_faults_idx(NUMA_CPU, nid, 1)];
}
static inline unsigned long group_faults_priv(struct numa_group *ng)
@ -2116,7 +2074,7 @@ static void numa_migrate_preferred(struct task_struct *p)
}
/*
* Find out how many nodes on the workload is actively running on. Do this by
* Find out how many nodes the workload is actively running on. Do this by
* tracking the nodes from which NUMA hinting faults are triggered. This can
* be different from the set of nodes where the workload's memory is currently
* located.
@ -2170,7 +2128,7 @@ static void update_task_scan_period(struct task_struct *p,
/*
* If there were no record hinting faults then either the task is
* completely idle or all activity is areas that are not of interest
* completely idle or all activity is in areas that are not of interest
* to automatic numa balancing. Related to that, if there were failed
* migration then it implies we are migrating too quickly or the local
* node is overloaded. In either case, scan slower
@ -2427,7 +2385,7 @@ static void task_numa_placement(struct task_struct *p)
* is at the beginning of the numa_faults array.
*/
ng->faults[mem_idx] += diff;
ng->faults_cpu[mem_idx] += f_diff;
ng->faults[cpu_idx] += f_diff;
ng->total_faults += diff;
group_faults += ng->faults[mem_idx];
}
@ -2481,7 +2439,8 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
if (unlikely(!deref_curr_numa_group(p))) {
unsigned int size = sizeof(struct numa_group) +
4*nr_node_ids*sizeof(unsigned long);
NR_NUMA_HINT_FAULT_STATS *
nr_node_ids * sizeof(unsigned long);
grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@ -2492,9 +2451,6 @@ static void task_numa_group(struct task_struct *p, int cpupid, int flags,
grp->max_faults_cpu = 0;
spin_lock_init(&grp->lock);
grp->gid = p->pid;
/* Second half of the array tracks nids where faults happen */
grp->faults_cpu = grp->faults + NR_NUMA_HINT_FAULT_TYPES *
nr_node_ids;
for (i = 0; i < NR_NUMA_HINT_FAULT_STATS * nr_node_ids; i++)
grp->faults[i] = p->numa_faults[i];
@ -2995,6 +2951,8 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
#endif
cfs_rq->nr_running++;
if (se_is_idle(se))
cfs_rq->idle_nr_running++;
}
static void
@ -3008,6 +2966,8 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
#endif
cfs_rq->nr_running--;
if (se_is_idle(se))
cfs_rq->idle_nr_running--;
}
/*
@ -4207,7 +4167,12 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
/* sleeps up to a single latency don't count. */
if (!initial) {
unsigned long thresh = sysctl_sched_latency;
unsigned long thresh;
if (se_is_idle(se))
thresh = sysctl_sched_min_granularity;
else
thresh = sysctl_sched_latency;
/*
* Halve their sleep time's effect, to allow
@ -4225,26 +4190,6 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
static inline void check_schedstat_required(void)
{
#ifdef CONFIG_SCHEDSTATS
if (schedstat_enabled())
return;
/* Force schedstat enabled if a dependent tracepoint is active */
if (trace_sched_stat_wait_enabled() ||
trace_sched_stat_sleep_enabled() ||
trace_sched_stat_iowait_enabled() ||
trace_sched_stat_blocked_enabled() ||
trace_sched_stat_runtime_enabled()) {
printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, "
"stat_blocked and stat_runtime require the "
"kernel parameter schedstats=enable or "
"kernel.sched_schedstats=1\n");
}
#endif
}
static inline bool cfs_bandwidth_used(void);
/*
@ -4318,7 +4263,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
place_entity(cfs_rq, se, 0);
check_schedstat_required();
update_stats_enqueue(cfs_rq, se, flags);
update_stats_enqueue_fair(cfs_rq, se, flags);
check_spread(cfs_rq, se);
if (!curr)
__enqueue_entity(cfs_rq, se);
@ -4402,7 +4347,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_load_avg(cfs_rq, se, UPDATE_TG);
se_update_runnable(se);
update_stats_dequeue(cfs_rq, se, flags);
update_stats_dequeue_fair(cfs_rq, se, flags);
clear_buddies(cfs_rq, se);
@ -4487,7 +4432,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
update_stats_wait_end(cfs_rq, se);
update_stats_wait_end_fair(cfs_rq, se);
__dequeue_entity(cfs_rq, se);
update_load_avg(cfs_rq, se, UPDATE_TG);
}
@ -4502,9 +4447,12 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
struct sched_statistics *stats;
stats = __schedstats_from_se(se);
__schedstat_set(stats->slice_max,
max((u64)stats->slice_max,
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}
se->prev_sum_exec_runtime = se->sum_exec_runtime;
@ -4586,7 +4534,7 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
check_spread(cfs_rq, prev);
if (prev->on_rq) {
update_stats_wait_start(cfs_rq, prev);
update_stats_wait_start_fair(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
@ -4687,11 +4635,20 @@ static inline u64 sched_cfs_bandwidth_slice(void)
*/
void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
s64 runtime;
if (unlikely(cfs_b->quota == RUNTIME_INF))
return;
cfs_b->runtime += cfs_b->quota;
runtime = cfs_b->runtime_snap - cfs_b->runtime;
if (runtime > 0) {
cfs_b->burst_time += runtime;
cfs_b->nr_burst++;
}
cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
cfs_b->runtime_snap = cfs_b->runtime;
}
static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
@ -5577,6 +5534,17 @@ static int sched_idle_rq(struct rq *rq)
rq->nr_running);
}
/*
* Returns true if cfs_rq only has SCHED_IDLE entities enqueued. Note the use
* of idle_nr_running, which does not consider idle descendants of normal
* entities.
*/
static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq)
{
return cfs_rq->nr_running &&
cfs_rq->nr_running == cfs_rq->idle_nr_running;
}
#ifdef CONFIG_SMP
static int sched_idle_cpu(int cpu)
{
@ -5787,6 +5755,7 @@ static struct {
cpumask_var_t idle_cpus_mask;
atomic_t nr_cpus;
int has_blocked; /* Idle CPUS has blocked load */
int needs_update; /* Newly idle CPUs need their next_balance collated */
unsigned long next_balance; /* in jiffy units */
unsigned long next_blocked; /* Next update of blocked load in jiffies */
} nohz ____cacheline_aligned;
@ -5997,12 +5966,12 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);
schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
schedstat_inc(p->stats.nr_wakeups_affine_attempts);
if (target == nr_cpumask_bits)
return prev_cpu;
schedstat_inc(sd->ttwu_move_affine);
schedstat_inc(p->se.statistics.nr_wakeups_affine);
schedstat_inc(p->stats.nr_wakeups_affine);
return target;
}
@ -6443,11 +6412,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
(available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
asym_fits_capacity(task_util, recent_used_cpu)) {
/*
* Replace recent_used_cpu with prev as it is a potential
* candidate for the next wake:
*/
p->recent_used_cpu = prev;
return recent_used_cpu;
}
@ -7806,7 +7770,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
schedstat_inc(p->stats.nr_failed_migrations_affine);
env->flags |= LBF_SOME_PINNED;
@ -7840,7 +7804,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
env->flags &= ~LBF_ALL_PINNED;
if (task_running(env->src_rq, p)) {
schedstat_inc(p->se.statistics.nr_failed_migrations_running);
schedstat_inc(p->stats.nr_failed_migrations_running);
return 0;
}
@ -7862,12 +7826,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (tsk_cache_hot == 1) {
schedstat_inc(env->sd->lb_hot_gained[env->idle]);
schedstat_inc(p->se.statistics.nr_forced_migrations);
schedstat_inc(p->stats.nr_forced_migrations);
}
return 1;
}
schedstat_inc(p->se.statistics.nr_failed_migrations_hot);
schedstat_inc(p->stats.nr_failed_migrations_hot);
return 0;
}
@ -8601,6 +8565,99 @@ group_type group_classify(unsigned int imbalance_pct,
return group_has_spare;
}
/**
* asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
* @dst_cpu: Destination CPU of the load balancing
* @sds: Load-balancing data with statistics of the local group
* @sgs: Load-balancing statistics of the candidate busiest group
* @sg: The candidate busiest group
*
* Check the state of the SMT siblings of both @sds::local and @sg and decide
* if @dst_cpu can pull tasks.
*
* If @dst_cpu does not have SMT siblings, it can pull tasks if two or more of
* the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull tasks
* only if @dst_cpu has higher priority.
*
* If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
* busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher priority.
* Bigger imbalances in the number of busy CPUs will be dealt with in
* update_sd_pick_busiest().
*
* If @sg does not have SMT siblings, only pull tasks if all of the SMT siblings
* of @dst_cpu are idle and @sg has lower priority.
*/
static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
struct sg_lb_stats *sgs,
struct sched_group *sg)
{
#ifdef CONFIG_SCHED_SMT
bool local_is_smt, sg_is_smt;
int sg_busy_cpus;
local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
if (!local_is_smt) {
/*
* If we are here, @dst_cpu is idle and does not have SMT
* siblings. Pull tasks if candidate group has two or more
* busy CPUs.
*/
if (sg_busy_cpus >= 2) /* implies sg_is_smt */
return true;
/*
* @dst_cpu does not have SMT siblings. @sg may have SMT
* siblings and only one is busy. In such case, @dst_cpu
* can help if it has higher priority and is idle (i.e.,
* it has no running tasks).
*/
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
}
/* @dst_cpu has SMT siblings. */
if (sg_is_smt) {
int local_busy_cpus = sds->local->group_weight -
sds->local_stat.idle_cpus;
int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
if (busy_cpus_delta == 1)
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
return false;
}
/*
* @sg does not have SMT siblings. Ensure that @sds::local does not end
* up with more than one busy SMT sibling and only pull tasks if there
* are not busy CPUs (i.e., no CPU has running tasks).
*/
if (!sds->local_stat.sum_nr_running)
return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
return false;
#else
/* Always return false so that callers deal with non-SMT cases. */
return false;
#endif
}
static inline bool
sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
struct sched_group *group)
{
/* Only do SMT checks if either local or candidate have SMT siblings */
if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
(group->flags & SD_SHARE_CPUCAPACITY))
return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
}
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@ -8609,6 +8666,7 @@ group_type group_classify(unsigned int imbalance_pct,
* @sg_status: Holds flag indicating the status of the sched_group
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sd_lb_stats *sds,
struct sched_group *group,
struct sg_lb_stats *sgs,
int *sg_status)
@ -8617,7 +8675,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
memset(sgs, 0, sizeof(*sgs));
local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
local_group = group == sds->local;
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
@ -8660,18 +8718,17 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}
/* Check if dst CPU is idle and preferred to this group */
if (env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE &&
sgs->sum_h_nr_running &&
sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
sgs->group_asym_packing = 1;
}
sgs->group_capacity = group->sgc->capacity;
sgs->group_weight = group->group_weight;
/* Check if dst CPU is idle and preferred to this group */
if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
sched_asym(env, sds, sgs, group)) {
sgs->group_asym_packing = 1;
}
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
/* Computing avg_load makes sense only when group is overloaded */
@ -9180,7 +9237,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_group_capacity(env->sd, env->dst_cpu);
}
update_sg_lb_stats(env, sg, sgs, &sg_status);
update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
if (local_group)
goto next_group;
@ -9603,6 +9660,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
nr_running == 1)
continue;
/* Make sure we only pull tasks from a CPU of lower priority */
if ((env->sd->flags & SD_ASYM_PACKING) &&
sched_asym_prefer(i, env->dst_cpu) &&
nr_running == 1)
continue;
switch (env->migration_type) {
case migrate_load:
/*
@ -10176,6 +10239,30 @@ void update_max_interval(void)
max_load_balance_interval = HZ*num_online_cpus()/10;
}
static inline bool update_newidle_cost(struct sched_domain *sd, u64 cost)
{
if (cost > sd->max_newidle_lb_cost) {
/*
* Track max cost of a domain to make sure to not delay the
* next wakeup on the CPU.
*/
sd->max_newidle_lb_cost = cost;
sd->last_decay_max_lb_cost = jiffies;
} else if (time_after(jiffies, sd->last_decay_max_lb_cost + HZ)) {
/*
* Decay the newidle max times by ~1% per second to ensure that
* it is not outdated and the current max cost is actually
* shorter.
*/
sd->max_newidle_lb_cost = (sd->max_newidle_lb_cost * 253) / 256;
sd->last_decay_max_lb_cost = jiffies;
return true;
}
return false;
}
/*
* It checks each scheduling domain to see if it is due to be balanced,
* and initiates a balancing operation if so.
@ -10199,14 +10286,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
for_each_domain(cpu, sd) {
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains. Decay ~1% per second.
* visit to all the domains.
*/
if (time_after(jiffies, sd->next_decay_max_lb_cost)) {
sd->max_newidle_lb_cost =
(sd->max_newidle_lb_cost * 253) / 256;
sd->next_decay_max_lb_cost = jiffies + HZ;
need_decay = 1;
}
need_decay = update_newidle_cost(sd, 0);
max_cost += sd->max_newidle_lb_cost;
/*
@ -10375,7 +10457,7 @@ static void nohz_balancer_kick(struct rq *rq)
goto out;
if (rq->nr_running >= 2) {
flags = NOHZ_KICK_MASK;
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto out;
}
@ -10389,7 +10471,7 @@ static void nohz_balancer_kick(struct rq *rq)
* on.
*/
if (rq->cfs.h_nr_running >= 1 && check_cpu_capacity(rq, sd)) {
flags = NOHZ_KICK_MASK;
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
}
@ -10403,7 +10485,7 @@ static void nohz_balancer_kick(struct rq *rq)
*/
for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
if (sched_asym_prefer(i, cpu)) {
flags = NOHZ_KICK_MASK;
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
}
@ -10416,7 +10498,7 @@ static void nohz_balancer_kick(struct rq *rq)
* to run the misfit task on.
*/
if (check_misfit_status(rq, sd)) {
flags = NOHZ_KICK_MASK;
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
@ -10443,13 +10525,16 @@ static void nohz_balancer_kick(struct rq *rq)
*/
nr_busy = atomic_read(&sds->nr_busy_cpus);
if (nr_busy > 1) {
flags = NOHZ_KICK_MASK;
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
goto unlock;
}
}
unlock:
rcu_read_unlock();
out:
if (READ_ONCE(nohz.needs_update))
flags |= NOHZ_NEXT_KICK;
if (flags)
kick_ilb(flags);
}
@ -10546,12 +10631,13 @@ void nohz_balance_enter_idle(int cpu)
/*
* Ensures that if nohz_idle_balance() fails to observe our
* @idle_cpus_mask store, it must observe the @has_blocked
* store.
* and @needs_update stores.
*/
smp_mb__after_atomic();
set_cpu_sd_state_idle(cpu);
WRITE_ONCE(nohz.needs_update, 1);
out:
/*
* Each time a cpu enter idle, we assume that it has blocked load and
@ -10600,12 +10686,17 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
/*
* We assume there will be no idle load after this update and clear
* the has_blocked flag. If a cpu enters idle in the mean time, it will
* set the has_blocked flag and trig another update of idle load.
* set the has_blocked flag and trigger another update of idle load.
* Because a cpu that becomes idle, is added to idle_cpus_mask before
* setting the flag, we are sure to not clear the state and not
* check the load of an idle cpu.
*
* Same applies to idle_cpus_mask vs needs_update.
*/
WRITE_ONCE(nohz.has_blocked, 0);
if (flags & NOHZ_STATS_KICK)
WRITE_ONCE(nohz.has_blocked, 0);
if (flags & NOHZ_NEXT_KICK)
WRITE_ONCE(nohz.needs_update, 0);
/*
* Ensures that if we miss the CPU, we must see the has_blocked
@ -10627,13 +10718,17 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
* balancing owner will pick it up.
*/
if (need_resched()) {
has_blocked_load = true;
if (flags & NOHZ_STATS_KICK)
has_blocked_load = true;
if (flags & NOHZ_NEXT_KICK)
WRITE_ONCE(nohz.needs_update, 1);
goto abort;
}
rq = cpu_rq(balance_cpu);
has_blocked_load |= update_nohz_stats(rq);
if (flags & NOHZ_STATS_KICK)
has_blocked_load |= update_nohz_stats(rq);
/*
* If time for next balance is due,
@ -10664,8 +10759,9 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
if (likely(update_next_balance))
nohz.next_balance = next_balance;
WRITE_ONCE(nohz.next_blocked,
now + msecs_to_jiffies(LOAD_AVG_PERIOD));
if (flags & NOHZ_STATS_KICK)
WRITE_ONCE(nohz.next_blocked,
now + msecs_to_jiffies(LOAD_AVG_PERIOD));
abort:
/* There is still blocked load, enable periodic update */
@ -10763,9 +10859,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
{
unsigned long next_balance = jiffies + HZ;
int this_cpu = this_rq->cpu;
u64 t0, t1, curr_cost = 0;
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
update_misfit_status(NULL, this_rq);
@ -10796,47 +10892,49 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!READ_ONCE(this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();
goto out;
}
rcu_read_unlock();
raw_spin_rq_unlock(this_rq);
t0 = sched_clock_cpu(this_cpu);
update_blocked_averages(this_cpu);
rcu_read_lock();
for_each_domain(this_cpu, sd) {
int continue_balancing = 1;
u64 t0, domain_cost;
u64 domain_cost;
if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
update_next_balance(sd, &next_balance);
if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
break;
}
if (sd->flags & SD_BALANCE_NEWIDLE) {
t0 = sched_clock_cpu(this_cpu);
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
&continue_balancing);
domain_cost = sched_clock_cpu(this_cpu) - t0;
if (domain_cost > sd->max_newidle_lb_cost)
sd->max_newidle_lb_cost = domain_cost;
t1 = sched_clock_cpu(this_cpu);
domain_cost = t1 - t0;
update_newidle_cost(sd, domain_cost);
curr_cost += domain_cost;
t0 = t1;
}
update_next_balance(sd, &next_balance);
/*
* Stop searching for tasks to pull if there are
* now runnable tasks on this rq.
@ -11394,7 +11492,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
if (!cfs_rq)
goto err;
se = kzalloc_node(sizeof(struct sched_entity),
se = kzalloc_node(sizeof(struct sched_entity_stats),
GFP_KERNEL, cpu_to_node(i));
if (!se)
goto err_free_rq;
@ -11560,7 +11658,7 @@ int sched_group_set_idle(struct task_group *tg, long idle)
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
struct sched_entity *se = tg->se[i];
struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
struct cfs_rq *parent_cfs_rq, *grp_cfs_rq = tg->cfs_rq[i];
bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
long idle_task_delta;
struct rq_flags rf;
@ -11571,6 +11669,14 @@ int sched_group_set_idle(struct task_group *tg, long idle)
if (WARN_ON_ONCE(was_idle == cfs_rq_is_idle(grp_cfs_rq)))
goto next_cpu;
if (se->on_rq) {
parent_cfs_rq = cfs_rq_of(se);
if (cfs_rq_is_idle(grp_cfs_rq))
parent_cfs_rq->idle_nr_running++;
else
parent_cfs_rq->idle_nr_running--;
}
idle_task_delta = grp_cfs_rq->h_nr_running -
grp_cfs_rq->idle_h_nr_running;
if (!cfs_rq_is_idle(grp_cfs_rq))

View File

@ -46,11 +46,16 @@ SCHED_FEAT(DOUBLE_TICK, false)
*/
SCHED_FEAT(NONTASK_CAPACITY, true)
#ifdef CONFIG_PREEMPT_RT
SCHED_FEAT(TTWU_QUEUE, false)
#else
/*
* Queue remote wakeups on the target CPU and process them
* using the scheduler IPI. Reduces rq->lock contention/bounces.
*/
SCHED_FEAT(TTWU_QUEUE, true)
#endif
/*
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.

View File

@ -1009,8 +1009,10 @@ static void update_curr_rt(struct rq *rq)
if (unlikely((s64)delta_exec <= 0))
return;
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));
schedstat_set(curr->stats.exec_max,
max(curr->stats.exec_max, delta_exec));
trace_sched_stat_runtime(curr, delta_exec, 0);
curr->se.sum_exec_runtime += delta_exec;
account_group_exec_runtime(curr, delta_exec);
@ -1271,6 +1273,112 @@ static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_arr
rt_se->on_list = 0;
}
static inline struct sched_statistics *
__schedstats_from_rt_se(struct sched_rt_entity *rt_se)
{
#ifdef CONFIG_RT_GROUP_SCHED
/* schedstats is not supported for rt group. */
if (!rt_entity_is_task(rt_se))
return NULL;
#endif
return &rt_task_of(rt_se)->stats;
}
static inline void
update_stats_wait_start_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
{
struct sched_statistics *stats;
struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
if (rt_entity_is_task(rt_se))
p = rt_task_of(rt_se);
stats = __schedstats_from_rt_se(rt_se);
if (!stats)
return;
__update_stats_wait_start(rq_of_rt_rq(rt_rq), p, stats);
}
static inline void
update_stats_enqueue_sleeper_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
{
struct sched_statistics *stats;
struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
if (rt_entity_is_task(rt_se))
p = rt_task_of(rt_se);
stats = __schedstats_from_rt_se(rt_se);
if (!stats)
return;
__update_stats_enqueue_sleeper(rq_of_rt_rq(rt_rq), p, stats);
}
static inline void
update_stats_enqueue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
int flags)
{
if (!schedstat_enabled())
return;
if (flags & ENQUEUE_WAKEUP)
update_stats_enqueue_sleeper_rt(rt_rq, rt_se);
}
static inline void
update_stats_wait_end_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
{
struct sched_statistics *stats;
struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
if (rt_entity_is_task(rt_se))
p = rt_task_of(rt_se);
stats = __schedstats_from_rt_se(rt_se);
if (!stats)
return;
__update_stats_wait_end(rq_of_rt_rq(rt_rq), p, stats);
}
static inline void
update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se,
int flags)
{
struct task_struct *p = NULL;
if (!schedstat_enabled())
return;
if (rt_entity_is_task(rt_se))
p = rt_task_of(rt_se);
if ((flags & DEQUEUE_SLEEP) && p) {
unsigned int state;
state = READ_ONCE(p->__state);
if (state & TASK_INTERRUPTIBLE)
__schedstat_set(p->stats.sleep_start,
rq_clock(rq_of_rt_rq(rt_rq)));
if (state & TASK_UNINTERRUPTIBLE)
__schedstat_set(p->stats.block_start,
rq_clock(rq_of_rt_rq(rt_rq)));
}
}
static void __enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
@ -1344,6 +1452,8 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
update_stats_enqueue_rt(rt_rq_of_se(rt_se), rt_se, flags);
dequeue_rt_stack(rt_se, flags);
for_each_sched_rt_entity(rt_se)
__enqueue_rt_entity(rt_se, flags);
@ -1354,6 +1464,8 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
{
struct rq *rq = rq_of_rt_se(rt_se);
update_stats_dequeue_rt(rt_rq_of_se(rt_se), rt_se, flags);
dequeue_rt_stack(rt_se, flags);
for_each_sched_rt_entity(rt_se) {
@ -1376,6 +1488,9 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
if (flags & ENQUEUE_WAKEUP)
rt_se->timeout = 0;
check_schedstat_required();
update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);
enqueue_rt_entity(rt_se, flags);
if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
@ -1576,7 +1691,12 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
{
struct sched_rt_entity *rt_se = &p->rt;
struct rt_rq *rt_rq = &rq->rt;
p->se.exec_start = rq_clock_task(rq);
if (on_rt_rq(&p->rt))
update_stats_wait_end_rt(rt_rq, rt_se);
/* The running task is never eligible for pushing */
dequeue_pushable_task(rq, p);
@ -1650,6 +1770,12 @@ static struct task_struct *pick_next_task_rt(struct rq *rq)
static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
{
struct sched_rt_entity *rt_se = &p->rt;
struct rt_rq *rt_rq = &rq->rt;
if (on_rt_rq(&p->rt))
update_stats_wait_start_rt(rt_rq, rt_se);
update_curr_rt(rq);
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);

View File

@ -368,6 +368,7 @@ struct cfs_bandwidth {
u64 quota;
u64 runtime;
u64 burst;
u64 runtime_snap;
s64 hierarchical_quota;
u8 idle;
@ -380,7 +381,9 @@ struct cfs_bandwidth {
/* Statistics: */
int nr_periods;
int nr_throttled;
int nr_burst;
u64 throttled_time;
u64 burst_time;
#endif
};
@ -529,6 +532,7 @@ struct cfs_rq {
struct load_weight load;
unsigned int nr_running;
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_nr_running; /* SCHED_IDLE */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
u64 exec_clock;
@ -1253,11 +1257,6 @@ extern void sched_core_dequeue(struct rq *rq, struct task_struct *p);
extern void sched_core_get(void);
extern void sched_core_put(void);
extern unsigned long sched_core_alloc_cookie(void);
extern void sched_core_put_cookie(unsigned long cookie);
extern unsigned long sched_core_get_cookie(unsigned long cookie);
extern unsigned long sched_core_update_cookie(struct task_struct *p, unsigned long cookie);
#else /* !CONFIG_SCHED_CORE */
static inline bool sched_core_enabled(struct rq *rq)
@ -1421,11 +1420,6 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
extern void update_rq_clock(struct rq *rq);
static inline u64 __rq_clock_broken(struct rq *rq)
{
return READ_ONCE(rq->clock);
}
/*
* rq::clock_update_flags bits
*
@ -1620,14 +1614,6 @@ rq_lock(struct rq *rq, struct rq_flags *rf)
rq_pin_lock(rq, rf);
}
static inline void
rq_relock(struct rq *rq, struct rq_flags *rf)
__acquires(rq->lock)
{
raw_spin_rq_lock(rq);
rq_repin_lock(rq, rf);
}
static inline void
rq_unlock_irqrestore(struct rq *rq, struct rq_flags *rf)
__releases(rq->lock)
@ -1808,6 +1794,7 @@ struct sched_group {
unsigned int group_weight;
struct sched_group_capacity *sgc;
int asym_prefer_cpu; /* CPU of highest priority in group */
int flags;
/*
* The CPUs this group covers.
@ -2401,6 +2388,7 @@ extern const_debug unsigned int sysctl_sched_migration_cost;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
extern unsigned int sysctl_sched_idle_min_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern int sysctl_resched_latency_warn_ms;
extern int sysctl_resched_latency_warn_once;
@ -2708,12 +2696,18 @@ extern void cfs_bandwidth_usage_dec(void);
#define NOHZ_BALANCE_KICK_BIT 0
#define NOHZ_STATS_KICK_BIT 1
#define NOHZ_NEWILB_KICK_BIT 2
#define NOHZ_NEXT_KICK_BIT 3
/* Run rebalance_domains() */
#define NOHZ_BALANCE_KICK BIT(NOHZ_BALANCE_KICK_BIT)
/* Update blocked load */
#define NOHZ_STATS_KICK BIT(NOHZ_STATS_KICK_BIT)
/* Update blocked load when entering idle */
#define NOHZ_NEWILB_KICK BIT(NOHZ_NEWILB_KICK_BIT)
/* Update nohz.next_balance */
#define NOHZ_NEXT_KICK BIT(NOHZ_NEXT_KICK_BIT)
#define NOHZ_KICK_MASK (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK)
#define NOHZ_KICK_MASK (NOHZ_BALANCE_KICK | NOHZ_STATS_KICK | NOHZ_NEXT_KICK)
#define nohz_flags(cpu) (&cpu_rq(cpu)->nohz_flags)

View File

@ -4,6 +4,110 @@
*/
#include "sched.h"
void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
{
u64 wait_start, prev_wait_start;
wait_start = rq_clock(rq);
prev_wait_start = schedstat_val(stats->wait_start);
if (p && likely(wait_start > prev_wait_start))
wait_start -= prev_wait_start;
__schedstat_set(stats->wait_start, wait_start);
}
void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
{
u64 delta = rq_clock(rq) - schedstat_val(stats->wait_start);
if (p) {
if (task_on_rq_migrating(p)) {
/*
* Preserve migrating task's wait time so wait_start
* time stamp can be adjusted to accumulate wait time
* prior to migration.
*/
__schedstat_set(stats->wait_start, delta);
return;
}
trace_sched_stat_wait(p, delta);
}
__schedstat_set(stats->wait_max,
max(schedstat_val(stats->wait_max), delta));
__schedstat_inc(stats->wait_count);
__schedstat_add(stats->wait_sum, delta);
__schedstat_set(stats->wait_start, 0);
}
void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats)
{
u64 sleep_start, block_start;
sleep_start = schedstat_val(stats->sleep_start);
block_start = schedstat_val(stats->block_start);
if (sleep_start) {
u64 delta = rq_clock(rq) - sleep_start;
if ((s64)delta < 0)
delta = 0;
if (unlikely(delta > schedstat_val(stats->sleep_max)))
__schedstat_set(stats->sleep_max, delta);
__schedstat_set(stats->sleep_start, 0);
__schedstat_add(stats->sum_sleep_runtime, delta);
if (p) {
account_scheduler_latency(p, delta >> 10, 1);
trace_sched_stat_sleep(p, delta);
}
}
if (block_start) {
u64 delta = rq_clock(rq) - block_start;
if ((s64)delta < 0)
delta = 0;
if (unlikely(delta > schedstat_val(stats->block_max)))
__schedstat_set(stats->block_max, delta);
__schedstat_set(stats->block_start, 0);
__schedstat_add(stats->sum_sleep_runtime, delta);
__schedstat_add(stats->sum_block_runtime, delta);
if (p) {
if (p->in_iowait) {
__schedstat_add(stats->iowait_sum, delta);
__schedstat_inc(stats->iowait_count);
trace_sched_stat_iowait(p, delta);
}
trace_sched_stat_blocked(p, delta);
/*
* Blocking time is in units of nanosecs, so shift by
* 20 to get a milliseconds-range estimation of the
* amount of time that the task spent sleeping:
*/
if (unlikely(prof_on == SLEEP_PROFILING)) {
profile_hits(SLEEP_PROFILING,
(void *)get_wchan(p),
delta >> 20);
}
account_scheduler_latency(p, delta >> 10, 0);
}
}
}
/*
* Current schedstat API version.
*

View File

@ -2,6 +2,8 @@
#ifdef CONFIG_SCHEDSTATS
extern struct static_key_false sched_schedstats;
/*
* Expects runqueue lock to be held for atomicity of update
*/
@ -40,7 +42,31 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long delta)
#define schedstat_val(var) (var)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)
void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats);
void __update_stats_wait_end(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats);
void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats);
static inline void
check_schedstat_required(void)
{
if (schedstat_enabled())
return;
/* Force schedstat enabled if a dependent tracepoint is active */
if (trace_sched_stat_wait_enabled() ||
trace_sched_stat_sleep_enabled() ||
trace_sched_stat_iowait_enabled() ||
trace_sched_stat_blocked_enabled() ||
trace_sched_stat_runtime_enabled())
printk_deferred_once("Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1\n");
}
#else /* !CONFIG_SCHEDSTATS: */
static inline void rq_sched_info_arrive (struct rq *rq, unsigned long long delta) { }
static inline void rq_sched_info_dequeue(struct rq *rq, unsigned long long delta) { }
static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delta) { }
@ -53,8 +79,31 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_set(var, val) do { } while (0)
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0
# define __update_stats_wait_start(rq, p, stats) do { } while (0)
# define __update_stats_wait_end(rq, p, stats) do { } while (0)
# define __update_stats_enqueue_sleeper(rq, p, stats) do { } while (0)
# define check_schedstat_required() do { } while (0)
#endif /* CONFIG_SCHEDSTATS */
#ifdef CONFIG_FAIR_GROUP_SCHED
struct sched_entity_stats {
struct sched_entity se;
struct sched_statistics stats;
} __no_randomize_layout;
#endif
static inline struct sched_statistics *
__schedstats_from_se(struct sched_entity *se)
{
#ifdef CONFIG_FAIR_GROUP_SCHED
if (!entity_is_task(se))
return &container_of(se, struct sched_entity_stats, se)->stats;
#endif
return &task_of(se)->stats;
}
#ifdef CONFIG_PSI
/*
* PSI tracks state that persists across sleeps, such as iowaits and

Some files were not shown because too many files have changed in this diff Show More