Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - Dynamic tick (nohz) updates, perhaps most notably changes to force the tick on when needed due to lengthy in-kernel execution on CPUs on which RCU is waiting. - Linux-kernel memory consistency model updates. - Replace rcu_swap_protected() with rcu_prepace_pointer(). - Torture-test updates. - Documentation updates. - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer() net/sched: Replace rcu_swap_protected() with rcu_replace_pointer() net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer() net/core: Replace rcu_swap_protected() with rcu_replace_pointer() bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer() fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer() drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer() drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer() x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer() rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer() rcu: Suppress levelspread uninitialized messages rcu: Fix uninitialized variable in nocb_gp_wait() rcu: Update descriptions for rcu_future_grace_period tracepoint rcu: Update descriptions for rcu_nocb_wake tracepoint rcu: Remove obsolete descriptions for rcu_barrier tracepoint rcu: Ensure that ->rcu_urgent_qs is set before resched IPI workqueue: Convert for_each_wq to use built-in list check rcu: Several rcu_segcblist functions can be static rcu: Remove unused function hlist_bl_del_init_rcu() Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch() ...
This commit is contained in:
commit
1ae78780ed
File diff suppressed because it is too large
Load Diff
1163
Documentation/RCU/Design/Data-Structures/Data-Structures.rst
Normal file
1163
Documentation/RCU/Design/Data-Structures/Data-Structures.rst
Normal file
File diff suppressed because it is too large
Load Diff
@ -1,668 +0,0 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
||||
"http://www.w3.org/TR/html4/loose.dtd">
|
||||
<html>
|
||||
<head><title>A Tour Through TREE_RCU's Expedited Grace Periods</title>
|
||||
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
||||
|
||||
<h2>Introduction</h2>
|
||||
|
||||
This document describes RCU's expedited grace periods.
|
||||
Unlike RCU's normal grace periods, which accept long latencies to attain
|
||||
high efficiency and minimal disturbance, expedited grace periods accept
|
||||
lower efficiency and significant disturbance to attain shorter latencies.
|
||||
|
||||
<p>
|
||||
There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
|
||||
third RCU-bh flavor having been implemented in terms of the other two.
|
||||
Each of the two implementations is covered in its own section.
|
||||
|
||||
<ol>
|
||||
<li> <a href="#Expedited Grace Period Design">
|
||||
Expedited Grace Period Design</a>
|
||||
<li> <a href="#RCU-preempt Expedited Grace Periods">
|
||||
RCU-preempt Expedited Grace Periods</a>
|
||||
<li> <a href="#RCU-sched Expedited Grace Periods">
|
||||
RCU-sched Expedited Grace Periods</a>
|
||||
<li> <a href="#Expedited Grace Period and CPU Hotplug">
|
||||
Expedited Grace Period and CPU Hotplug</a>
|
||||
<li> <a href="#Expedited Grace Period Refinements">
|
||||
Expedited Grace Period Refinements</a>
|
||||
</ol>
|
||||
|
||||
<h2><a name="Expedited Grace Period Design">
|
||||
Expedited Grace Period Design</a></h2>
|
||||
|
||||
<p>
|
||||
The expedited RCU grace periods cannot be accused of being subtle,
|
||||
given that they for all intents and purposes hammer every CPU that
|
||||
has not yet provided a quiescent state for the current expedited
|
||||
grace period.
|
||||
The one saving grace is that the hammer has grown a bit smaller
|
||||
over time: The old call to <tt>try_stop_cpus()</tt> has been
|
||||
replaced with a set of calls to <tt>smp_call_function_single()</tt>,
|
||||
each of which results in an IPI to the target CPU.
|
||||
The corresponding handler function checks the CPU's state, motivating
|
||||
a faster quiescent state where possible, and triggering a report
|
||||
of that quiescent state.
|
||||
As always for RCU, once everything has spent some time in a quiescent
|
||||
state, the expedited grace period has completed.
|
||||
|
||||
<p>
|
||||
The details of the <tt>smp_call_function_single()</tt> handler's
|
||||
operation depend on the RCU flavor, as described in the following
|
||||
sections.
|
||||
|
||||
<h2><a name="RCU-preempt Expedited Grace Periods">
|
||||
RCU-preempt Expedited Grace Periods</a></h2>
|
||||
|
||||
<p>
|
||||
<tt>CONFIG_PREEMPT=y</tt> kernels implement RCU-preempt.
|
||||
The overall flow of the handling of a given CPU by an RCU-preempt
|
||||
expedited grace period is shown in the following diagram:
|
||||
|
||||
<p><img src="ExpRCUFlow.svg" alt="ExpRCUFlow.svg" width="55%">
|
||||
|
||||
<p>
|
||||
The solid arrows denote direct action, for example, a function call.
|
||||
The dotted arrows denote indirect action, for example, an IPI
|
||||
or a state that is reached after some time.
|
||||
|
||||
<p>
|
||||
If a given CPU is offline or idle, <tt>synchronize_rcu_expedited()</tt>
|
||||
will ignore it because idle and offline CPUs are already residing
|
||||
in quiescent states.
|
||||
Otherwise, the expedited grace period will use
|
||||
<tt>smp_call_function_single()</tt> to send the CPU an IPI, which
|
||||
is handled by <tt>rcu_exp_handler()</tt>.
|
||||
|
||||
<p>
|
||||
However, because this is preemptible RCU, <tt>rcu_exp_handler()</tt>
|
||||
can check to see if the CPU is currently running in an RCU read-side
|
||||
critical section.
|
||||
If not, the handler can immediately report a quiescent state.
|
||||
Otherwise, it sets flags so that the outermost <tt>rcu_read_unlock()</tt>
|
||||
invocation will provide the needed quiescent-state report.
|
||||
This flag-setting avoids the previous forced preemption of all
|
||||
CPUs that might have RCU read-side critical sections.
|
||||
In addition, this flag-setting is done so as to avoid increasing
|
||||
the overhead of the common-case fastpath through the scheduler.
|
||||
|
||||
<p>
|
||||
Again because this is preemptible RCU, an RCU read-side critical section
|
||||
can be preempted.
|
||||
When that happens, RCU will enqueue the task, which will the continue to
|
||||
block the current expedited grace period until it resumes and finds its
|
||||
outermost <tt>rcu_read_unlock()</tt>.
|
||||
The CPU will report a quiescent state just after enqueuing the task because
|
||||
the CPU is no longer blocking the grace period.
|
||||
It is instead the preempted task doing the blocking.
|
||||
The list of blocked tasks is managed by <tt>rcu_preempt_ctxt_queue()</tt>,
|
||||
which is called from <tt>rcu_preempt_note_context_switch()</tt>, which
|
||||
in turn is called from <tt>rcu_note_context_switch()</tt>, which in
|
||||
turn is called from the scheduler.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
Why not just have the expedited grace period check the
|
||||
state of all the CPUs?
|
||||
After all, that would avoid all those real-time-unfriendly IPIs.
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Because we want the RCU read-side critical sections to run fast,
|
||||
which means no memory barriers.
|
||||
Therefore, it is not possible to safely check the state from some
|
||||
other CPU.
|
||||
And even if it was possible to safely check the state, it would
|
||||
still be necessary to IPI the CPU to safely interact with the
|
||||
upcoming <tt>rcu_read_unlock()</tt> invocation, which means that
|
||||
the remote state testing would not help the worst-case
|
||||
latency that real-time applications care about.
|
||||
|
||||
<p><font color="ffffff">One way to prevent your real-time
|
||||
application from getting hit with these IPIs is to
|
||||
build your kernel with <tt>CONFIG_NO_HZ_FULL=y</tt>.
|
||||
RCU would then perceive the CPU running your application
|
||||
as being idle, and it would be able to safely detect that
|
||||
state without needing to IPI the CPU.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
Please note that this is just the overall flow:
|
||||
Additional complications can arise due to races with CPUs going idle
|
||||
or offline, among other things.
|
||||
|
||||
<h2><a name="RCU-sched Expedited Grace Periods">
|
||||
RCU-sched Expedited Grace Periods</a></h2>
|
||||
|
||||
<p>
|
||||
<tt>CONFIG_PREEMPT=n</tt> kernels implement RCU-sched.
|
||||
The overall flow of the handling of a given CPU by an RCU-sched
|
||||
expedited grace period is shown in the following diagram:
|
||||
|
||||
<p><img src="ExpSchedFlow.svg" alt="ExpSchedFlow.svg" width="55%">
|
||||
|
||||
<p>
|
||||
As with RCU-preempt, RCU-sched's
|
||||
<tt>synchronize_rcu_expedited()</tt> ignores offline and
|
||||
idle CPUs, again because they are in remotely detectable
|
||||
quiescent states.
|
||||
However, because the
|
||||
<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
|
||||
leave no trace of their invocation, in general it is not possible to tell
|
||||
whether or not the current CPU is in an RCU read-side critical section.
|
||||
The best that RCU-sched's <tt>rcu_exp_handler()</tt> can do is to check
|
||||
for idle, on the off-chance that the CPU went idle while the IPI
|
||||
was in flight.
|
||||
If the CPU is idle, then <tt>rcu_exp_handler()</tt> reports
|
||||
the quiescent state.
|
||||
|
||||
<p> Otherwise, the handler forces a future context switch by setting the
|
||||
NEED_RESCHED flag of the current task's thread flag and the CPU preempt
|
||||
counter.
|
||||
At the time of the context switch, the CPU reports the quiescent state.
|
||||
Should the CPU go offline first, it will report the quiescent state
|
||||
at that time.
|
||||
|
||||
<h2><a name="Expedited Grace Period and CPU Hotplug">
|
||||
Expedited Grace Period and CPU Hotplug</a></h2>
|
||||
|
||||
<p>
|
||||
The expedited nature of expedited grace periods require a much tighter
|
||||
interaction with CPU hotplug operations than is required for normal
|
||||
grace periods.
|
||||
In addition, attempting to IPI offline CPUs will result in splats, but
|
||||
failing to IPI online CPUs can result in too-short grace periods.
|
||||
Neither option is acceptable in production kernels.
|
||||
|
||||
<p>
|
||||
The interaction between expedited grace periods and CPU hotplug operations
|
||||
is carried out at several levels:
|
||||
|
||||
<ol>
|
||||
<li> The number of CPUs that have ever been online is tracked
|
||||
by the <tt>rcu_state</tt> structure's <tt>->ncpus</tt>
|
||||
field.
|
||||
The <tt>rcu_state</tt> structure's <tt>->ncpus_snap</tt>
|
||||
field tracks the number of CPUs that have ever been online
|
||||
at the beginning of an RCU expedited grace period.
|
||||
Note that this number never decreases, at least in the absence
|
||||
of a time machine.
|
||||
<li> The identities of the CPUs that have ever been online is
|
||||
tracked by the <tt>rcu_node</tt> structure's
|
||||
<tt>->expmaskinitnext</tt> field.
|
||||
The <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt>
|
||||
field tracks the identities of the CPUs that were online
|
||||
at least once at the beginning of the most recent RCU
|
||||
expedited grace period.
|
||||
The <tt>rcu_state</tt> structure's <tt>->ncpus</tt> and
|
||||
<tt>->ncpus_snap</tt> fields are used to detect when
|
||||
new CPUs have come online for the first time, that is,
|
||||
when the <tt>rcu_node</tt> structure's <tt>->expmaskinitnext</tt>
|
||||
field has changed since the beginning of the last RCU
|
||||
expedited grace period, which triggers an update of each
|
||||
<tt>rcu_node</tt> structure's <tt>->expmaskinit</tt>
|
||||
field from its <tt>->expmaskinitnext</tt> field.
|
||||
<li> Each <tt>rcu_node</tt> structure's <tt>->expmaskinit</tt>
|
||||
field is used to initialize that structure's
|
||||
<tt>->expmask</tt> at the beginning of each RCU
|
||||
expedited grace period.
|
||||
This means that only those CPUs that have been online at least
|
||||
once will be considered for a given grace period.
|
||||
<li> Any CPU that goes offline will clear its bit in its leaf
|
||||
<tt>rcu_node</tt> structure's <tt>->qsmaskinitnext</tt>
|
||||
field, so any CPU with that bit clear can safely be ignored.
|
||||
However, it is possible for a CPU coming online or going offline
|
||||
to have this bit set for some time while <tt>cpu_online</tt>
|
||||
returns <tt>false</tt>.
|
||||
<li> For each non-idle CPU that RCU believes is currently online, the grace
|
||||
period invokes <tt>smp_call_function_single()</tt>.
|
||||
If this succeeds, the CPU was fully online.
|
||||
Failure indicates that the CPU is in the process of coming online
|
||||
or going offline, in which case it is necessary to wait for a
|
||||
short time period and try again.
|
||||
The purpose of this wait (or series of waits, as the case may be)
|
||||
is to permit a concurrent CPU-hotplug operation to complete.
|
||||
<li> In the case of RCU-sched, one of the last acts of an outgoing CPU
|
||||
is to invoke <tt>rcu_report_dead()</tt>, which
|
||||
reports a quiescent state for that CPU.
|
||||
However, this is likely paranoia-induced redundancy. <!-- @@@ -->
|
||||
</ol>
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
Why all the dancing around with multiple counters and masks
|
||||
tracking CPUs that were once online?
|
||||
Why not just have a single set of masks tracking the currently
|
||||
online CPUs and be done with it?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Maintaining single set of masks tracking the online CPUs <i>sounds</i>
|
||||
easier, at least until you try working out all the race conditions
|
||||
between grace-period initialization and CPU-hotplug operations.
|
||||
For example, suppose initialization is progressing down the
|
||||
tree while a CPU-offline operation is progressing up the tree.
|
||||
This situation can result in bits set at the top of the tree
|
||||
that have no counterparts at the bottom of the tree.
|
||||
Those bits will never be cleared, which will result in
|
||||
grace-period hangs.
|
||||
In short, that way lies madness, to say nothing of a great many
|
||||
bugs, hangs, and deadlocks.
|
||||
|
||||
<p><font color="ffffff">
|
||||
In contrast, the current multi-mask multi-counter scheme ensures
|
||||
that grace-period initialization will always see consistent masks
|
||||
up and down the tree, which brings significant simplifications
|
||||
over the single-mask method.
|
||||
|
||||
<p><font color="ffffff">
|
||||
This is an instance of
|
||||
<a href="http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1992/cucs-039-92.ps.gz"><font color="ffffff">
|
||||
deferring work in order to avoid synchronization</a>.
|
||||
Lazily recording CPU-hotplug events at the beginning of the next
|
||||
grace period greatly simplifies maintenance of the CPU-tracking
|
||||
bitmasks in the <tt>rcu_node</tt> tree.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<h2><a name="Expedited Grace Period Refinements">
|
||||
Expedited Grace Period Refinements</a></h2>
|
||||
|
||||
<ol>
|
||||
<li> <a href="#Idle-CPU Checks">Idle-CPU checks</a>.
|
||||
<li> <a href="#Batching via Sequence Counter">
|
||||
Batching via sequence counter</a>.
|
||||
<li> <a href="#Funnel Locking and Wait/Wakeup">
|
||||
Funnel locking and wait/wakeup</a>.
|
||||
<li> <a href="#Use of Workqueues">Use of Workqueues</a>.
|
||||
<li> <a href="#Stall Warnings">Stall warnings</a>.
|
||||
<li> <a href="#Mid-Boot Operation">Mid-boot operation</a>.
|
||||
</ol>
|
||||
|
||||
<h3><a name="Idle-CPU Checks">Idle-CPU Checks</a></h3>
|
||||
|
||||
<p>
|
||||
Each expedited grace period checks for idle CPUs when initially forming
|
||||
the mask of CPUs to be IPIed and again just before IPIing a CPU
|
||||
(both checks are carried out by <tt>sync_rcu_exp_select_cpus()</tt>).
|
||||
If the CPU is idle at any time between those two times, the CPU will
|
||||
not be IPIed.
|
||||
Instead, the task pushing the grace period forward will include the
|
||||
idle CPUs in the mask passed to <tt>rcu_report_exp_cpu_mult()</tt>.
|
||||
|
||||
<p>
|
||||
For RCU-sched, there is an additional check:
|
||||
If the IPI has interrupted the idle loop, then
|
||||
<tt>rcu_exp_handler()</tt> invokes <tt>rcu_report_exp_rdp()</tt>
|
||||
to report the corresponding quiescent state.
|
||||
|
||||
<p>
|
||||
For RCU-preempt, there is no specific check for idle in the
|
||||
IPI handler (<tt>rcu_exp_handler()</tt>), but because
|
||||
RCU read-side critical sections are not permitted within the
|
||||
idle loop, if <tt>rcu_exp_handler()</tt> sees that the CPU is within
|
||||
RCU read-side critical section, the CPU cannot possibly be idle.
|
||||
Otherwise, <tt>rcu_exp_handler()</tt> invokes
|
||||
<tt>rcu_report_exp_rdp()</tt> to report the corresponding quiescent
|
||||
state, regardless of whether or not that quiescent state was due to
|
||||
the CPU being idle.
|
||||
|
||||
<p>
|
||||
In summary, RCU expedited grace periods check for idle when building
|
||||
the bitmask of CPUs that must be IPIed, just before sending each IPI,
|
||||
and (either explicitly or implicitly) within the IPI handler.
|
||||
|
||||
<h3><a name="Batching via Sequence Counter">
|
||||
Batching via Sequence Counter</a></h3>
|
||||
|
||||
<p>
|
||||
If each grace-period request was carried out separately, expedited
|
||||
grace periods would have abysmal scalability and
|
||||
problematic high-load characteristics.
|
||||
Because each grace-period operation can serve an unlimited number of
|
||||
updates, it is important to <i>batch</i> requests, so that a single
|
||||
expedited grace-period operation will cover all requests in the
|
||||
corresponding batch.
|
||||
|
||||
<p>
|
||||
This batching is controlled by a sequence counter named
|
||||
<tt>->expedited_sequence</tt> in the <tt>rcu_state</tt> structure.
|
||||
This counter has an odd value when there is an expedited grace period
|
||||
in progress and an even value otherwise, so that dividing the counter
|
||||
value by two gives the number of completed grace periods.
|
||||
During any given update request, the counter must transition from
|
||||
even to odd and then back to even, thus indicating that a grace
|
||||
period has elapsed.
|
||||
Therefore, if the initial value of the counter is <tt>s</tt>,
|
||||
the updater must wait until the counter reaches at least the
|
||||
value <tt>(s+3)&~0x1</tt>.
|
||||
This counter is managed by the following access functions:
|
||||
|
||||
<ol>
|
||||
<li> <tt>rcu_exp_gp_seq_start()</tt>, which marks the start of
|
||||
an expedited grace period.
|
||||
<li> <tt>rcu_exp_gp_seq_end()</tt>, which marks the end of an
|
||||
expedited grace period.
|
||||
<li> <tt>rcu_exp_gp_seq_snap()</tt>, which obtains a snapshot of
|
||||
the counter.
|
||||
<li> <tt>rcu_exp_gp_seq_done()</tt>, which returns <tt>true</tt>
|
||||
if a full expedited grace period has elapsed since the
|
||||
corresponding call to <tt>rcu_exp_gp_seq_snap()</tt>.
|
||||
</ol>
|
||||
|
||||
<p>
|
||||
Again, only one request in a given batch need actually carry out
|
||||
a grace-period operation, which means there must be an efficient
|
||||
way to identify which of many concurrent reqeusts will initiate
|
||||
the grace period, and that there be an efficient way for the
|
||||
remaining requests to wait for that grace period to complete.
|
||||
However, that is the topic of the next section.
|
||||
|
||||
<h3><a name="Funnel Locking and Wait/Wakeup">
|
||||
Funnel Locking and Wait/Wakeup</a></h3>
|
||||
|
||||
<p>
|
||||
The natural way to sort out which of a batch of updaters will initiate
|
||||
the expedited grace period is to use the <tt>rcu_node</tt> combining
|
||||
tree, as implemented by the <tt>exp_funnel_lock()</tt> function.
|
||||
The first updater corresponding to a given grace period arriving
|
||||
at a given <tt>rcu_node</tt> structure records its desired grace-period
|
||||
sequence number in the <tt>->exp_seq_rq</tt> field and moves up
|
||||
to the next level in the tree.
|
||||
Otherwise, if the <tt>->exp_seq_rq</tt> field already contains
|
||||
the sequence number for the desired grace period or some later one,
|
||||
the updater blocks on one of four wait queues in the
|
||||
<tt>->exp_wq[]</tt> array, using the second-from-bottom
|
||||
and third-from bottom bits as an index.
|
||||
An <tt>->exp_lock</tt> field in the <tt>rcu_node</tt> structure
|
||||
synchronizes access to these fields.
|
||||
|
||||
<p>
|
||||
An empty <tt>rcu_node</tt> tree is shown in the following diagram,
|
||||
with the white cells representing the <tt>->exp_seq_rq</tt> field
|
||||
and the red cells representing the elements of the
|
||||
<tt>->exp_wq[]</tt> array.
|
||||
|
||||
<p><img src="Funnel0.svg" alt="Funnel0.svg" width="75%">
|
||||
|
||||
<p>
|
||||
The next diagram shows the situation after the arrival of Task A
|
||||
and Task B at the leftmost and rightmost leaf <tt>rcu_node</tt>
|
||||
structures, respectively.
|
||||
The current value of the <tt>rcu_state</tt> structure's
|
||||
<tt>->expedited_sequence</tt> field is zero, so adding three and
|
||||
clearing the bottom bit results in the value two, which both tasks
|
||||
record in the <tt>->exp_seq_rq</tt> field of their respective
|
||||
<tt>rcu_node</tt> structures:
|
||||
|
||||
<p><img src="Funnel1.svg" alt="Funnel1.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Each of Tasks A and B will move up to the root
|
||||
<tt>rcu_node</tt> structure.
|
||||
Suppose that Task A wins, recording its desired grace-period sequence
|
||||
number and resulting in the state shown below:
|
||||
|
||||
<p><img src="Funnel2.svg" alt="Funnel2.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Task A now advances to initiate a new grace period, while Task B
|
||||
moves up to the root <tt>rcu_node</tt> structure, and, seeing that
|
||||
its desired sequence number is already recorded, blocks on
|
||||
<tt>->exp_wq[1]</tt>.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
Why <tt>->exp_wq[1]</tt>?
|
||||
Given that the value of these tasks' desired sequence number is
|
||||
two, so shouldn't they instead block on <tt>->exp_wq[2]</tt>?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
No.
|
||||
|
||||
<p><font color="ffffff">
|
||||
Recall that the bottom bit of the desired sequence number indicates
|
||||
whether or not a grace period is currently in progress.
|
||||
It is therefore necessary to shift the sequence number right one
|
||||
bit position to obtain the number of the grace period.
|
||||
This results in <tt>->exp_wq[1]</tt>.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<p>
|
||||
If Tasks C and D also arrive at this point, they will compute the
|
||||
same desired grace-period sequence number, and see that both leaf
|
||||
<tt>rcu_node</tt> structures already have that value recorded.
|
||||
They will therefore block on their respective <tt>rcu_node</tt>
|
||||
structures' <tt>->exp_wq[1]</tt> fields, as shown below:
|
||||
|
||||
<p><img src="Funnel3.svg" alt="Funnel3.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Task A now acquires the <tt>rcu_state</tt> structure's
|
||||
<tt>->exp_mutex</tt> and initiates the grace period, which
|
||||
increments <tt>->expedited_sequence</tt>.
|
||||
Therefore, if Tasks E and F arrive, they will compute
|
||||
a desired sequence number of 4 and will record this value as
|
||||
shown below:
|
||||
|
||||
<p><img src="Funnel4.svg" alt="Funnel4.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Tasks E and F will propagate up the <tt>rcu_node</tt>
|
||||
combining tree, with Task F blocking on the root <tt>rcu_node</tt>
|
||||
structure and Task E wait for Task A to finish so that
|
||||
it can start the next grace period.
|
||||
The resulting state is as shown below:
|
||||
|
||||
<p><img src="Funnel5.svg" alt="Funnel5.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Once the grace period completes, Task A
|
||||
starts waking up the tasks waiting for this grace period to complete,
|
||||
increments the <tt>->expedited_sequence</tt>,
|
||||
acquires the <tt>->exp_wake_mutex</tt> and then releases the
|
||||
<tt>->exp_mutex</tt>.
|
||||
This results in the following state:
|
||||
|
||||
<p><img src="Funnel6.svg" alt="Funnel6.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Task E can then acquire <tt>->exp_mutex</tt> and increment
|
||||
<tt>->expedited_sequence</tt> to the value three.
|
||||
If new tasks G and H arrive and moves up the combining tree at the
|
||||
same time, the state will be as follows:
|
||||
|
||||
<p><img src="Funnel7.svg" alt="Funnel7.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Note that three of the root <tt>rcu_node</tt> structure's
|
||||
waitqueues are now occupied.
|
||||
However, at some point, Task A will wake up the
|
||||
tasks blocked on the <tt>->exp_wq</tt> waitqueues, resulting
|
||||
in the following state:
|
||||
|
||||
<p><img src="Funnel8.svg" alt="Funnel8.svg" width="75%">
|
||||
|
||||
<p>
|
||||
Execution will continue with Tasks E and H completing
|
||||
their grace periods and carrying out their wakeups.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
What happens if Task A takes so long to do its wakeups
|
||||
that Task E's grace period completes?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Then Task E will block on the <tt>->exp_wake_mutex</tt>,
|
||||
which will also prevent it from releasing <tt>->exp_mutex</tt>,
|
||||
which in turn will prevent the next grace period from starting.
|
||||
This last is important in preventing overflow of the
|
||||
<tt>->exp_wq[]</tt> array.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<h3><a name="Use of Workqueues">Use of Workqueues</a></h3>
|
||||
|
||||
<p>
|
||||
In earlier implementations, the task requesting the expedited
|
||||
grace period also drove it to completion.
|
||||
This straightforward approach had the disadvantage of needing to
|
||||
account for POSIX signals sent to user tasks,
|
||||
so more recent implemementations use the Linux kernel's
|
||||
<a href="https://www.kernel.org/doc/Documentation/core-api/workqueue.rst">workqueues</a>.
|
||||
|
||||
<p>
|
||||
The requesting task still does counter snapshotting and funnel-lock
|
||||
processing, but the task reaching the top of the funnel lock
|
||||
does a <tt>schedule_work()</tt> (from <tt>_synchronize_rcu_expedited()</tt>
|
||||
so that a workqueue kthread does the actual grace-period processing.
|
||||
Because workqueue kthreads do not accept POSIX signals, grace-period-wait
|
||||
processing need not allow for POSIX signals.
|
||||
|
||||
In addition, this approach allows wakeups for the previous expedited
|
||||
grace period to be overlapped with processing for the next expedited
|
||||
grace period.
|
||||
Because there are only four sets of waitqueues, it is necessary to
|
||||
ensure that the previous grace period's wakeups complete before the
|
||||
next grace period's wakeups start.
|
||||
This is handled by having the <tt>->exp_mutex</tt>
|
||||
guard expedited grace-period processing and the
|
||||
<tt>->exp_wake_mutex</tt> guard wakeups.
|
||||
The key point is that the <tt>->exp_mutex</tt> is not released
|
||||
until the first wakeup is complete, which means that the
|
||||
<tt>->exp_wake_mutex</tt> has already been acquired at that point.
|
||||
This approach ensures that the previous grace period's wakeups can
|
||||
be carried out while the current grace period is in process, but
|
||||
that these wakeups will complete before the next grace period starts.
|
||||
This means that only three waitqueues are required, guaranteeing that
|
||||
the four that are provided are sufficient.
|
||||
|
||||
<h3><a name="Stall Warnings">Stall Warnings</a></h3>
|
||||
|
||||
<p>
|
||||
Expediting grace periods does nothing to speed things up when RCU
|
||||
readers take too long, and therefore expedited grace periods check
|
||||
for stalls just as normal grace periods do.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
But why not just let the normal grace-period machinery
|
||||
detect the stalls, given that a given reader must block
|
||||
both normal and expedited grace periods?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Because it is quite possible that at a given time there
|
||||
is no normal grace period in progress, in which case the
|
||||
normal grace period cannot emit a stall warning.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
The <tt>synchronize_sched_expedited_wait()</tt> function loops waiting
|
||||
for the expedited grace period to end, but with a timeout set to the
|
||||
current RCU CPU stall-warning time.
|
||||
If this time is exceeded, any CPUs or <tt>rcu_node</tt> structures
|
||||
blocking the current grace period are printed.
|
||||
Each stall warning results in another pass through the loop, but the
|
||||
second and subsequent passes use longer stall times.
|
||||
|
||||
<h3><a name="Mid-Boot Operation">Mid-boot operation</a></h3>
|
||||
|
||||
<p>
|
||||
The use of workqueues has the advantage that the expedited
|
||||
grace-period code need not worry about POSIX signals.
|
||||
Unfortunately, it has the
|
||||
corresponding disadvantage that workqueues cannot be used until
|
||||
they are initialized, which does not happen until some time after
|
||||
the scheduler spawns the first task.
|
||||
Given that there are parts of the kernel that really do want to
|
||||
execute grace periods during this mid-boot “dead zone”,
|
||||
expedited grace periods must do something else during thie time.
|
||||
|
||||
<p>
|
||||
What they do is to fall back to the old practice of requiring that the
|
||||
requesting task drive the expedited grace period, as was the case
|
||||
before the use of workqueues.
|
||||
However, the requesting task is only required to drive the grace period
|
||||
during the mid-boot dead zone.
|
||||
Before mid-boot, a synchronous grace period is a no-op.
|
||||
Some time after mid-boot, workqueues are used.
|
||||
|
||||
<p>
|
||||
Non-expedited non-SRCU synchronous grace periods must also operate
|
||||
normally during mid-boot.
|
||||
This is handled by causing non-expedited grace periods to take the
|
||||
expedited code path during mid-boot.
|
||||
|
||||
<p>
|
||||
The current code assumes that there are no POSIX signals during
|
||||
the mid-boot dead zone.
|
||||
However, if an overwhelming need for POSIX signals somehow arises,
|
||||
appropriate adjustments can be made to the expedited stall-warning code.
|
||||
One such adjustment would reinstate the pre-workqueue stall-warning
|
||||
checks, but only during the mid-boot dead zone.
|
||||
|
||||
<p>
|
||||
With this refinement, synchronous grace periods can now be used from
|
||||
task context pretty much any time during the life of the kernel.
|
||||
That is, aside from some points in the suspend, hibernate, or shutdown
|
||||
code path.
|
||||
|
||||
<h3><a name="Summary">
|
||||
Summary</a></h3>
|
||||
|
||||
<p>
|
||||
Expedited grace periods use a sequence-number approach to promote
|
||||
batching, so that a single grace-period operation can serve numerous
|
||||
requests.
|
||||
A funnel lock is used to efficiently identify the one task out of
|
||||
a concurrent group that will request the grace period.
|
||||
All members of the group will block on waitqueues provided in
|
||||
the <tt>rcu_node</tt> structure.
|
||||
The actual grace-period processing is carried out by a workqueue.
|
||||
|
||||
<p>
|
||||
CPU-hotplug operations are noted lazily in order to prevent the need
|
||||
for tight synchronization between expedited grace periods and
|
||||
CPU-hotplug operations.
|
||||
The dyntick-idle counters are used to avoid sending IPIs to idle CPUs,
|
||||
at least in the common case.
|
||||
RCU-preempt and RCU-sched use different IPI handlers and different
|
||||
code to respond to the state changes carried out by those handlers,
|
||||
but otherwise use common code.
|
||||
|
||||
<p>
|
||||
Quiescent states are tracked using the <tt>rcu_node</tt> tree,
|
||||
and once all necessary quiescent states have been reported,
|
||||
all tasks waiting on this expedited grace period are awakened.
|
||||
A pair of mutexes are used to allow one grace period's wakeups
|
||||
to proceed concurrently with the next grace period's processing.
|
||||
|
||||
<p>
|
||||
This combination of mechanisms allows expedited grace periods to
|
||||
run reasonably efficiently.
|
||||
However, for non-time-critical tasks, normal grace periods should be
|
||||
used instead because their longer duration permits much higher
|
||||
degrees of batching, and thus much lower per-request overheads.
|
||||
|
||||
</body></html>
|
@ -0,0 +1,521 @@
|
||||
=================================================
|
||||
A Tour Through TREE_RCU's Expedited Grace Periods
|
||||
=================================================
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document describes RCU's expedited grace periods.
|
||||
Unlike RCU's normal grace periods, which accept long latencies to attain
|
||||
high efficiency and minimal disturbance, expedited grace periods accept
|
||||
lower efficiency and significant disturbance to attain shorter latencies.
|
||||
|
||||
There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
|
||||
third RCU-bh flavor having been implemented in terms of the other two.
|
||||
Each of the two implementations is covered in its own section.
|
||||
|
||||
Expedited Grace Period Design
|
||||
=============================
|
||||
|
||||
The expedited RCU grace periods cannot be accused of being subtle,
|
||||
given that they for all intents and purposes hammer every CPU that
|
||||
has not yet provided a quiescent state for the current expedited
|
||||
grace period.
|
||||
The one saving grace is that the hammer has grown a bit smaller
|
||||
over time: The old call to ``try_stop_cpus()`` has been
|
||||
replaced with a set of calls to ``smp_call_function_single()``,
|
||||
each of which results in an IPI to the target CPU.
|
||||
The corresponding handler function checks the CPU's state, motivating
|
||||
a faster quiescent state where possible, and triggering a report
|
||||
of that quiescent state.
|
||||
As always for RCU, once everything has spent some time in a quiescent
|
||||
state, the expedited grace period has completed.
|
||||
|
||||
The details of the ``smp_call_function_single()`` handler's
|
||||
operation depend on the RCU flavor, as described in the following
|
||||
sections.
|
||||
|
||||
RCU-preempt Expedited Grace Periods
|
||||
===================================
|
||||
|
||||
``CONFIG_PREEMPT=y`` kernels implement RCU-preempt.
|
||||
The overall flow of the handling of a given CPU by an RCU-preempt
|
||||
expedited grace period is shown in the following diagram:
|
||||
|
||||
.. kernel-figure:: ExpRCUFlow.svg
|
||||
|
||||
The solid arrows denote direct action, for example, a function call.
|
||||
The dotted arrows denote indirect action, for example, an IPI
|
||||
or a state that is reached after some time.
|
||||
|
||||
If a given CPU is offline or idle, ``synchronize_rcu_expedited()``
|
||||
will ignore it because idle and offline CPUs are already residing
|
||||
in quiescent states.
|
||||
Otherwise, the expedited grace period will use
|
||||
``smp_call_function_single()`` to send the CPU an IPI, which
|
||||
is handled by ``rcu_exp_handler()``.
|
||||
|
||||
However, because this is preemptible RCU, ``rcu_exp_handler()``
|
||||
can check to see if the CPU is currently running in an RCU read-side
|
||||
critical section.
|
||||
If not, the handler can immediately report a quiescent state.
|
||||
Otherwise, it sets flags so that the outermost ``rcu_read_unlock()``
|
||||
invocation will provide the needed quiescent-state report.
|
||||
This flag-setting avoids the previous forced preemption of all
|
||||
CPUs that might have RCU read-side critical sections.
|
||||
In addition, this flag-setting is done so as to avoid increasing
|
||||
the overhead of the common-case fastpath through the scheduler.
|
||||
|
||||
Again because this is preemptible RCU, an RCU read-side critical section
|
||||
can be preempted.
|
||||
When that happens, RCU will enqueue the task, which will the continue to
|
||||
block the current expedited grace period until it resumes and finds its
|
||||
outermost ``rcu_read_unlock()``.
|
||||
The CPU will report a quiescent state just after enqueuing the task because
|
||||
the CPU is no longer blocking the grace period.
|
||||
It is instead the preempted task doing the blocking.
|
||||
The list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``,
|
||||
which is called from ``rcu_preempt_note_context_switch()``, which
|
||||
in turn is called from ``rcu_note_context_switch()``, which in
|
||||
turn is called from the scheduler.
|
||||
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Why not just have the expedited grace period check the state of all |
|
||||
| the CPUs? After all, that would avoid all those real-time-unfriendly |
|
||||
| IPIs. |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Because we want the RCU read-side critical sections to run fast, |
|
||||
| which means no memory barriers. Therefore, it is not possible to |
|
||||
| safely check the state from some other CPU. And even if it was |
|
||||
| possible to safely check the state, it would still be necessary to |
|
||||
| IPI the CPU to safely interact with the upcoming |
|
||||
| ``rcu_read_unlock()`` invocation, which means that the remote state |
|
||||
| testing would not help the worst-case latency that real-time |
|
||||
| applications care about. |
|
||||
| |
|
||||
| One way to prevent your real-time application from getting hit with |
|
||||
| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU |
|
||||
| would then perceive the CPU running your application as being idle, |
|
||||
| and it would be able to safely detect that state without needing to |
|
||||
| IPI the CPU. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Please note that this is just the overall flow: Additional complications
|
||||
can arise due to races with CPUs going idle or offline, among other
|
||||
things.
|
||||
|
||||
RCU-sched Expedited Grace Periods
|
||||
---------------------------------
|
||||
|
||||
``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of
|
||||
the handling of a given CPU by an RCU-sched expedited grace period is
|
||||
shown in the following diagram:
|
||||
|
||||
.. kernel-figure:: ExpSchedFlow.svg
|
||||
|
||||
As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores
|
||||
offline and idle CPUs, again because they are in remotely detectable
|
||||
quiescent states. However, because the ``rcu_read_lock_sched()`` and
|
||||
``rcu_read_unlock_sched()`` leave no trace of their invocation, in
|
||||
general it is not possible to tell whether or not the current CPU is in
|
||||
an RCU read-side critical section. The best that RCU-sched's
|
||||
``rcu_exp_handler()`` can do is to check for idle, on the off-chance
|
||||
that the CPU went idle while the IPI was in flight. If the CPU is idle,
|
||||
then ``rcu_exp_handler()`` reports the quiescent state.
|
||||
|
||||
Otherwise, the handler forces a future context switch by setting the
|
||||
NEED_RESCHED flag of the current task's thread flag and the CPU preempt
|
||||
counter. At the time of the context switch, the CPU reports the
|
||||
quiescent state. Should the CPU go offline first, it will report the
|
||||
quiescent state at that time.
|
||||
|
||||
Expedited Grace Period and CPU Hotplug
|
||||
--------------------------------------
|
||||
|
||||
The expedited nature of expedited grace periods require a much tighter
|
||||
interaction with CPU hotplug operations than is required for normal
|
||||
grace periods. In addition, attempting to IPI offline CPUs will result
|
||||
in splats, but failing to IPI online CPUs can result in too-short grace
|
||||
periods. Neither option is acceptable in production kernels.
|
||||
|
||||
The interaction between expedited grace periods and CPU hotplug
|
||||
operations is carried out at several levels:
|
||||
|
||||
#. The number of CPUs that have ever been online is tracked by the
|
||||
``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state``
|
||||
structure's ``->ncpus_snap`` field tracks the number of CPUs that
|
||||
have ever been online at the beginning of an RCU expedited grace
|
||||
period. Note that this number never decreases, at least in the
|
||||
absence of a time machine.
|
||||
#. The identities of the CPUs that have ever been online is tracked by
|
||||
the ``rcu_node`` structure's ``->expmaskinitnext`` field. The
|
||||
``rcu_node`` structure's ``->expmaskinit`` field tracks the
|
||||
identities of the CPUs that were online at least once at the
|
||||
beginning of the most recent RCU expedited grace period. The
|
||||
``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are
|
||||
used to detect when new CPUs have come online for the first time,
|
||||
that is, when the ``rcu_node`` structure's ``->expmaskinitnext``
|
||||
field has changed since the beginning of the last RCU expedited grace
|
||||
period, which triggers an update of each ``rcu_node`` structure's
|
||||
``->expmaskinit`` field from its ``->expmaskinitnext`` field.
|
||||
#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to
|
||||
initialize that structure's ``->expmask`` at the beginning of each
|
||||
RCU expedited grace period. This means that only those CPUs that have
|
||||
been online at least once will be considered for a given grace
|
||||
period.
|
||||
#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node``
|
||||
structure's ``->qsmaskinitnext`` field, so any CPU with that bit
|
||||
clear can safely be ignored. However, it is possible for a CPU coming
|
||||
online or going offline to have this bit set for some time while
|
||||
``cpu_online`` returns ``false``.
|
||||
#. For each non-idle CPU that RCU believes is currently online, the
|
||||
grace period invokes ``smp_call_function_single()``. If this
|
||||
succeeds, the CPU was fully online. Failure indicates that the CPU is
|
||||
in the process of coming online or going offline, in which case it is
|
||||
necessary to wait for a short time period and try again. The purpose
|
||||
of this wait (or series of waits, as the case may be) is to permit a
|
||||
concurrent CPU-hotplug operation to complete.
|
||||
#. In the case of RCU-sched, one of the last acts of an outgoing CPU is
|
||||
to invoke ``rcu_report_dead()``, which reports a quiescent state for
|
||||
that CPU. However, this is likely paranoia-induced redundancy.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Why all the dancing around with multiple counters and masks tracking |
|
||||
| CPUs that were once online? Why not just have a single set of masks |
|
||||
| tracking the currently online CPUs and be done with it? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Maintaining single set of masks tracking the online CPUs *sounds* |
|
||||
| easier, at least until you try working out all the race conditions |
|
||||
| between grace-period initialization and CPU-hotplug operations. For |
|
||||
| example, suppose initialization is progressing down the tree while a |
|
||||
| CPU-offline operation is progressing up the tree. This situation can |
|
||||
| result in bits set at the top of the tree that have no counterparts |
|
||||
| at the bottom of the tree. Those bits will never be cleared, which |
|
||||
| will result in grace-period hangs. In short, that way lies madness, |
|
||||
| to say nothing of a great many bugs, hangs, and deadlocks. |
|
||||
| In contrast, the current multi-mask multi-counter scheme ensures that |
|
||||
| grace-period initialization will always see consistent masks up and |
|
||||
| down the tree, which brings significant simplifications over the |
|
||||
| single-mask method. |
|
||||
| |
|
||||
| This is an instance of `deferring work in order to avoid |
|
||||
| synchronization <http://www.cs.columbia.edu/~library/TR-repository/re |
|
||||
| ports/reports-1992/cucs-039-92.ps.gz>`__. |
|
||||
| Lazily recording CPU-hotplug events at the beginning of the next |
|
||||
| grace period greatly simplifies maintenance of the CPU-tracking |
|
||||
| bitmasks in the ``rcu_node`` tree. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Expedited Grace Period Refinements
|
||||
----------------------------------
|
||||
|
||||
Idle-CPU Checks
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Each expedited grace period checks for idle CPUs when initially forming
|
||||
the mask of CPUs to be IPIed and again just before IPIing a CPU (both
|
||||
checks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is
|
||||
idle at any time between those two times, the CPU will not be IPIed.
|
||||
Instead, the task pushing the grace period forward will include the idle
|
||||
CPUs in the mask passed to ``rcu_report_exp_cpu_mult()``.
|
||||
|
||||
For RCU-sched, there is an additional check: If the IPI has interrupted
|
||||
the idle loop, then ``rcu_exp_handler()`` invokes
|
||||
``rcu_report_exp_rdp()`` to report the corresponding quiescent state.
|
||||
|
||||
For RCU-preempt, there is no specific check for idle in the IPI handler
|
||||
(``rcu_exp_handler()``), but because RCU read-side critical sections are
|
||||
not permitted within the idle loop, if ``rcu_exp_handler()`` sees that
|
||||
the CPU is within RCU read-side critical section, the CPU cannot
|
||||
possibly be idle. Otherwise, ``rcu_exp_handler()`` invokes
|
||||
``rcu_report_exp_rdp()`` to report the corresponding quiescent state,
|
||||
regardless of whether or not that quiescent state was due to the CPU
|
||||
being idle.
|
||||
|
||||
In summary, RCU expedited grace periods check for idle when building the
|
||||
bitmask of CPUs that must be IPIed, just before sending each IPI, and
|
||||
(either explicitly or implicitly) within the IPI handler.
|
||||
|
||||
Batching via Sequence Counter
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If each grace-period request was carried out separately, expedited grace
|
||||
periods would have abysmal scalability and problematic high-load
|
||||
characteristics. Because each grace-period operation can serve an
|
||||
unlimited number of updates, it is important to *batch* requests, so
|
||||
that a single expedited grace-period operation will cover all requests
|
||||
in the corresponding batch.
|
||||
|
||||
This batching is controlled by a sequence counter named
|
||||
``->expedited_sequence`` in the ``rcu_state`` structure. This counter
|
||||
has an odd value when there is an expedited grace period in progress and
|
||||
an even value otherwise, so that dividing the counter value by two gives
|
||||
the number of completed grace periods. During any given update request,
|
||||
the counter must transition from even to odd and then back to even, thus
|
||||
indicating that a grace period has elapsed. Therefore, if the initial
|
||||
value of the counter is ``s``, the updater must wait until the counter
|
||||
reaches at least the value ``(s+3)&~0x1``. This counter is managed by
|
||||
the following access functions:
|
||||
|
||||
#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited
|
||||
grace period.
|
||||
#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace
|
||||
period.
|
||||
#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter.
|
||||
#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited
|
||||
grace period has elapsed since the corresponding call to
|
||||
``rcu_exp_gp_seq_snap()``.
|
||||
|
||||
Again, only one request in a given batch need actually carry out a
|
||||
grace-period operation, which means there must be an efficient way to
|
||||
identify which of many concurrent reqeusts will initiate the grace
|
||||
period, and that there be an efficient way for the remaining requests to
|
||||
wait for that grace period to complete. However, that is the topic of
|
||||
the next section.
|
||||
|
||||
Funnel Locking and Wait/Wakeup
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The natural way to sort out which of a batch of updaters will initiate
|
||||
the expedited grace period is to use the ``rcu_node`` combining tree, as
|
||||
implemented by the ``exp_funnel_lock()`` function. The first updater
|
||||
corresponding to a given grace period arriving at a given ``rcu_node``
|
||||
structure records its desired grace-period sequence number in the
|
||||
``->exp_seq_rq`` field and moves up to the next level in the tree.
|
||||
Otherwise, if the ``->exp_seq_rq`` field already contains the sequence
|
||||
number for the desired grace period or some later one, the updater
|
||||
blocks on one of four wait queues in the ``->exp_wq[]`` array, using the
|
||||
second-from-bottom and third-from bottom bits as an index. An
|
||||
``->exp_lock`` field in the ``rcu_node`` structure synchronizes access
|
||||
to these fields.
|
||||
|
||||
An empty ``rcu_node`` tree is shown in the following diagram, with the
|
||||
white cells representing the ``->exp_seq_rq`` field and the red cells
|
||||
representing the elements of the ``->exp_wq[]`` array.
|
||||
|
||||
.. kernel-figure:: Funnel0.svg
|
||||
|
||||
The next diagram shows the situation after the arrival of Task A and
|
||||
Task B at the leftmost and rightmost leaf ``rcu_node`` structures,
|
||||
respectively. The current value of the ``rcu_state`` structure's
|
||||
``->expedited_sequence`` field is zero, so adding three and clearing the
|
||||
bottom bit results in the value two, which both tasks record in the
|
||||
``->exp_seq_rq`` field of their respective ``rcu_node`` structures:
|
||||
|
||||
.. kernel-figure:: Funnel1.svg
|
||||
|
||||
Each of Tasks A and B will move up to the root ``rcu_node`` structure.
|
||||
Suppose that Task A wins, recording its desired grace-period sequence
|
||||
number and resulting in the state shown below:
|
||||
|
||||
.. kernel-figure:: Funnel2.svg
|
||||
|
||||
Task A now advances to initiate a new grace period, while Task B moves
|
||||
up to the root ``rcu_node`` structure, and, seeing that its desired
|
||||
sequence number is already recorded, blocks on ``->exp_wq[1]``.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Why ``->exp_wq[1]``? Given that the value of these tasks' desired |
|
||||
| sequence number is two, so shouldn't they instead block on |
|
||||
| ``->exp_wq[2]``? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| No. |
|
||||
| Recall that the bottom bit of the desired sequence number indicates |
|
||||
| whether or not a grace period is currently in progress. It is |
|
||||
| therefore necessary to shift the sequence number right one bit |
|
||||
| position to obtain the number of the grace period. This results in |
|
||||
| ``->exp_wq[1]``. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
If Tasks C and D also arrive at this point, they will compute the same
|
||||
desired grace-period sequence number, and see that both leaf
|
||||
``rcu_node`` structures already have that value recorded. They will
|
||||
therefore block on their respective ``rcu_node`` structures'
|
||||
``->exp_wq[1]`` fields, as shown below:
|
||||
|
||||
.. kernel-figure:: Funnel3.svg
|
||||
|
||||
Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and
|
||||
initiates the grace period, which increments ``->expedited_sequence``.
|
||||
Therefore, if Tasks E and F arrive, they will compute a desired sequence
|
||||
number of 4 and will record this value as shown below:
|
||||
|
||||
.. kernel-figure:: Funnel4.svg
|
||||
|
||||
Tasks E and F will propagate up the ``rcu_node`` combining tree, with
|
||||
Task F blocking on the root ``rcu_node`` structure and Task E wait for
|
||||
Task A to finish so that it can start the next grace period. The
|
||||
resulting state is as shown below:
|
||||
|
||||
.. kernel-figure:: Funnel5.svg
|
||||
|
||||
Once the grace period completes, Task A starts waking up the tasks
|
||||
waiting for this grace period to complete, increments the
|
||||
``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then
|
||||
releases the ``->exp_mutex``. This results in the following state:
|
||||
|
||||
.. kernel-figure:: Funnel6.svg
|
||||
|
||||
Task E can then acquire ``->exp_mutex`` and increment
|
||||
``->expedited_sequence`` to the value three. If new tasks G and H arrive
|
||||
and moves up the combining tree at the same time, the state will be as
|
||||
follows:
|
||||
|
||||
.. kernel-figure:: Funnel7.svg
|
||||
|
||||
Note that three of the root ``rcu_node`` structure's waitqueues are now
|
||||
occupied. However, at some point, Task A will wake up the tasks blocked
|
||||
on the ``->exp_wq`` waitqueues, resulting in the following state:
|
||||
|
||||
.. kernel-figure:: Funnel8.svg
|
||||
|
||||
Execution will continue with Tasks E and H completing their grace
|
||||
periods and carrying out their wakeups.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| What happens if Task A takes so long to do its wakeups that Task E's |
|
||||
| grace period completes? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Then Task E will block on the ``->exp_wake_mutex``, which will also |
|
||||
| prevent it from releasing ``->exp_mutex``, which in turn will prevent |
|
||||
| the next grace period from starting. This last is important in |
|
||||
| preventing overflow of the ``->exp_wq[]`` array. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Use of Workqueues
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
In earlier implementations, the task requesting the expedited grace
|
||||
period also drove it to completion. This straightforward approach had
|
||||
the disadvantage of needing to account for POSIX signals sent to user
|
||||
tasks, so more recent implemementations use the Linux kernel's
|
||||
`workqueues <https://www.kernel.org/doc/Documentation/core-api/workqueue.rst>`__.
|
||||
|
||||
The requesting task still does counter snapshotting and funnel-lock
|
||||
processing, but the task reaching the top of the funnel lock does a
|
||||
``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a
|
||||
workqueue kthread does the actual grace-period processing. Because
|
||||
workqueue kthreads do not accept POSIX signals, grace-period-wait
|
||||
processing need not allow for POSIX signals. In addition, this approach
|
||||
allows wakeups for the previous expedited grace period to be overlapped
|
||||
with processing for the next expedited grace period. Because there are
|
||||
only four sets of waitqueues, it is necessary to ensure that the
|
||||
previous grace period's wakeups complete before the next grace period's
|
||||
wakeups start. This is handled by having the ``->exp_mutex`` guard
|
||||
expedited grace-period processing and the ``->exp_wake_mutex`` guard
|
||||
wakeups. The key point is that the ``->exp_mutex`` is not released until
|
||||
the first wakeup is complete, which means that the ``->exp_wake_mutex``
|
||||
has already been acquired at that point. This approach ensures that the
|
||||
previous grace period's wakeups can be carried out while the current
|
||||
grace period is in process, but that these wakeups will complete before
|
||||
the next grace period starts. This means that only three waitqueues are
|
||||
required, guaranteeing that the four that are provided are sufficient.
|
||||
|
||||
Stall Warnings
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
Expediting grace periods does nothing to speed things up when RCU
|
||||
readers take too long, and therefore expedited grace periods check for
|
||||
stalls just as normal grace periods do.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But why not just let the normal grace-period machinery detect the |
|
||||
| stalls, given that a given reader must block both normal and |
|
||||
| expedited grace periods? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Because it is quite possible that at a given time there is no normal |
|
||||
| grace period in progress, in which case the normal grace period |
|
||||
| cannot emit a stall warning. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
The ``synchronize_sched_expedited_wait()`` function loops waiting for
|
||||
the expedited grace period to end, but with a timeout set to the current
|
||||
RCU CPU stall-warning time. If this time is exceeded, any CPUs or
|
||||
``rcu_node`` structures blocking the current grace period are printed.
|
||||
Each stall warning results in another pass through the loop, but the
|
||||
second and subsequent passes use longer stall times.
|
||||
|
||||
Mid-boot operation
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The use of workqueues has the advantage that the expedited grace-period
|
||||
code need not worry about POSIX signals. Unfortunately, it has the
|
||||
corresponding disadvantage that workqueues cannot be used until they are
|
||||
initialized, which does not happen until some time after the scheduler
|
||||
spawns the first task. Given that there are parts of the kernel that
|
||||
really do want to execute grace periods during this mid-boot “dead
|
||||
zone”, expedited grace periods must do something else during thie time.
|
||||
|
||||
What they do is to fall back to the old practice of requiring that the
|
||||
requesting task drive the expedited grace period, as was the case before
|
||||
the use of workqueues. However, the requesting task is only required to
|
||||
drive the grace period during the mid-boot dead zone. Before mid-boot, a
|
||||
synchronous grace period is a no-op. Some time after mid-boot,
|
||||
workqueues are used.
|
||||
|
||||
Non-expedited non-SRCU synchronous grace periods must also operate
|
||||
normally during mid-boot. This is handled by causing non-expedited grace
|
||||
periods to take the expedited code path during mid-boot.
|
||||
|
||||
The current code assumes that there are no POSIX signals during the
|
||||
mid-boot dead zone. However, if an overwhelming need for POSIX signals
|
||||
somehow arises, appropriate adjustments can be made to the expedited
|
||||
stall-warning code. One such adjustment would reinstate the
|
||||
pre-workqueue stall-warning checks, but only during the mid-boot dead
|
||||
zone.
|
||||
|
||||
With this refinement, synchronous grace periods can now be used from
|
||||
task context pretty much any time during the life of the kernel. That
|
||||
is, aside from some points in the suspend, hibernate, or shutdown code
|
||||
path.
|
||||
|
||||
Summary
|
||||
~~~~~~~
|
||||
|
||||
Expedited grace periods use a sequence-number approach to promote
|
||||
batching, so that a single grace-period operation can serve numerous
|
||||
requests. A funnel lock is used to efficiently identify the one task out
|
||||
of a concurrent group that will request the grace period. All members of
|
||||
the group will block on waitqueues provided in the ``rcu_node``
|
||||
structure. The actual grace-period processing is carried out by a
|
||||
workqueue.
|
||||
|
||||
CPU-hotplug operations are noted lazily in order to prevent the need for
|
||||
tight synchronization between expedited grace periods and CPU-hotplug
|
||||
operations. The dyntick-idle counters are used to avoid sending IPIs to
|
||||
idle CPUs, at least in the common case. RCU-preempt and RCU-sched use
|
||||
different IPI handlers and different code to respond to the state
|
||||
changes carried out by those handlers, but otherwise use common code.
|
||||
|
||||
Quiescent states are tracked using the ``rcu_node`` tree, and once all
|
||||
necessary quiescent states have been reported, all tasks waiting on this
|
||||
expedited grace period are awakened. A pair of mutexes are used to allow
|
||||
one grace period's wakeups to proceed concurrently with the next grace
|
||||
period's processing.
|
||||
|
||||
This combination of mechanisms allows expedited grace periods to run
|
||||
reasonably efficiently. However, for non-time-critical tasks, normal
|
||||
grace periods should be used instead because their longer duration
|
||||
permits much higher degrees of batching, and thus much lower per-request
|
||||
overheads.
|
@ -1,9 +0,0 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
||||
"http://www.w3.org/TR/html4/loose.dtd">
|
||||
<html>
|
||||
<head><title>A Diagram of TREE_RCU's Grace-Period Memory Ordering</title>
|
||||
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
||||
|
||||
<p><img src="TreeRCU-gp.svg" alt="TreeRCU-gp.svg">
|
||||
|
||||
</body></html>
|
@ -1,704 +0,0 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
||||
"http://www.w3.org/TR/html4/loose.dtd">
|
||||
<html>
|
||||
<head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
|
||||
<meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
|
||||
|
||||
<p>August 8, 2017</p>
|
||||
<p>This article was contributed by Paul E. McKenney</p>
|
||||
|
||||
<h3>Introduction</h3>
|
||||
|
||||
<p>This document gives a rough visual overview of how Tree RCU's
|
||||
grace-period memory ordering guarantee is provided.
|
||||
|
||||
<ol>
|
||||
<li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
|
||||
What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
|
||||
<li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
|
||||
Tree RCU Grace Period Memory Ordering Building Blocks</a>
|
||||
<li> <a href="#Tree RCU Grace Period Memory Ordering Components">
|
||||
Tree RCU Grace Period Memory Ordering Components</a>
|
||||
<li> <a href="#Putting It All Together">Putting It All Together</a>
|
||||
</ol>
|
||||
|
||||
<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
|
||||
What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
|
||||
|
||||
<p>RCU grace periods provide extremely strong memory-ordering guarantees
|
||||
for non-idle non-offline code.
|
||||
Any code that happens after the end of a given RCU grace period is guaranteed
|
||||
to see the effects of all accesses prior to the beginning of that grace
|
||||
period that are within RCU read-side critical sections.
|
||||
Similarly, any code that happens before the beginning of a given RCU grace
|
||||
period is guaranteed to see the effects of all accesses following the end
|
||||
of that grace period that are within RCU read-side critical sections.
|
||||
|
||||
<p>Note well that RCU-sched read-side critical sections include any region
|
||||
of code for which preemption is disabled.
|
||||
Given that each individual machine instruction can be thought of as
|
||||
an extremely small region of preemption-disabled code, one can think of
|
||||
<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids.
|
||||
|
||||
<p>RCU updaters use this guarantee by splitting their updates into
|
||||
two phases, one of which is executed before the grace period and
|
||||
the other of which is executed after the grace period.
|
||||
In the most common use case, phase one removes an element from
|
||||
a linked RCU-protected data structure, and phase two frees that element.
|
||||
For this to work, any readers that have witnessed state prior to the
|
||||
phase-one update (in the common case, removal) must not witness state
|
||||
following the phase-two update (in the common case, freeing).
|
||||
|
||||
<p>The RCU implementation provides this guarantee using a network
|
||||
of lock-based critical sections, memory barriers, and per-CPU
|
||||
processing, as is described in the following sections.
|
||||
|
||||
<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
|
||||
Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
|
||||
|
||||
<p>The workhorse for RCU's grace-period memory ordering is the
|
||||
critical section for the <tt>rcu_node</tt> structure's
|
||||
<tt>->lock</tt>.
|
||||
These critical sections use helper functions for lock acquisition, including
|
||||
<tt>raw_spin_lock_rcu_node()</tt>,
|
||||
<tt>raw_spin_lock_irq_rcu_node()</tt>, and
|
||||
<tt>raw_spin_lock_irqsave_rcu_node()</tt>.
|
||||
Their lock-release counterparts are
|
||||
<tt>raw_spin_unlock_rcu_node()</tt>,
|
||||
<tt>raw_spin_unlock_irq_rcu_node()</tt>, and
|
||||
<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
|
||||
respectively.
|
||||
For completeness, a
|
||||
<tt>raw_spin_trylock_rcu_node()</tt>
|
||||
is also provided.
|
||||
The key point is that the lock-acquisition functions, including
|
||||
<tt>raw_spin_trylock_rcu_node()</tt>, all invoke
|
||||
<tt>smp_mb__after_unlock_lock()</tt> immediately after successful
|
||||
acquisition of the lock.
|
||||
|
||||
<p>Therefore, for any given <tt>rcu_node</tt> structure, any access
|
||||
happening before one of the above lock-release functions will be seen
|
||||
by all CPUs as happening before any access happening after a later
|
||||
one of the above lock-acquisition functions.
|
||||
Furthermore, any access happening before one of the
|
||||
above lock-release function on any given CPU will be seen by all
|
||||
CPUs as happening before any access happening after a later one
|
||||
of the above lock-acquisition functions executing on that same CPU,
|
||||
even if the lock-release and lock-acquisition functions are operating
|
||||
on different <tt>rcu_node</tt> structures.
|
||||
Tree RCU uses these two ordering guarantees to form an ordering
|
||||
network among all CPUs that were in any way involved in the grace
|
||||
period, including any CPUs that came online or went offline during
|
||||
the grace period in question.
|
||||
|
||||
<p>The following litmus test exhibits the ordering effects of these
|
||||
lock-acquisition and lock-release functions:
|
||||
|
||||
<pre>
|
||||
1 int x, y, z;
|
||||
2
|
||||
3 void task0(void)
|
||||
4 {
|
||||
5 raw_spin_lock_rcu_node(rnp);
|
||||
6 WRITE_ONCE(x, 1);
|
||||
7 r1 = READ_ONCE(y);
|
||||
8 raw_spin_unlock_rcu_node(rnp);
|
||||
9 }
|
||||
10
|
||||
11 void task1(void)
|
||||
12 {
|
||||
13 raw_spin_lock_rcu_node(rnp);
|
||||
14 WRITE_ONCE(y, 1);
|
||||
15 r2 = READ_ONCE(z);
|
||||
16 raw_spin_unlock_rcu_node(rnp);
|
||||
17 }
|
||||
18
|
||||
19 void task2(void)
|
||||
20 {
|
||||
21 WRITE_ONCE(z, 1);
|
||||
22 smp_mb();
|
||||
23 r3 = READ_ONCE(x);
|
||||
24 }
|
||||
25
|
||||
26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
|
||||
</pre>
|
||||
|
||||
<p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”,
|
||||
after all changes have propagated throughout the system.
|
||||
Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
|
||||
acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
|
||||
on PowerPC.
|
||||
The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
|
||||
<tt>WARN_ON()</tt> from triggering.
|
||||
|
||||
<p>This approach must be extended to include idle CPUs, which need
|
||||
RCU's grace-period memory ordering guarantee to extend to any
|
||||
RCU read-side critical sections preceding and following the current
|
||||
idle sojourn.
|
||||
This case is handled by calls to the strongly ordered
|
||||
<tt>atomic_add_return()</tt> read-modify-write atomic operation that
|
||||
is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
|
||||
time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
|
||||
The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
|
||||
<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
|
||||
an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
But what about CPUs that remain offline for the entire
|
||||
grace period?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Such CPUs will be offline at the beginning of the grace period,
|
||||
so the grace period won't expect quiescent states from them.
|
||||
Races between grace-period start and CPU-hotplug operations
|
||||
are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
|
||||
<tt>->lock</tt> as described above.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<p>The approach must be extended to handle one final case, that
|
||||
of waking a task blocked in <tt>synchronize_rcu()</tt>.
|
||||
This task might be affinitied to a CPU that is not yet aware that
|
||||
the grace period has ended, and thus might not yet be subject to
|
||||
the grace period's memory ordering.
|
||||
Therefore, there is an <tt>smp_mb()</tt> after the return from
|
||||
<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
|
||||
code path.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
What? Where???
|
||||
I don't see any <tt>smp_mb()</tt> after the return from
|
||||
<tt>wait_for_completion()</tt>!!!
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
That would be because I spotted the need for that
|
||||
<tt>smp_mb()</tt> during the creation of this documentation,
|
||||
and it is therefore unlikely to hit mainline before v4.14.
|
||||
Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
|
||||
Jonathan Cameron for asking questions that sensitized me
|
||||
to the rather elaborate sequence of events that demonstrate
|
||||
the need for this memory barrier.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<p>Tree RCU's grace--period memory-ordering guarantees rely most
|
||||
heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt>
|
||||
field, so much so that it is necessary to abbreviate this pattern
|
||||
in the diagrams in the next section.
|
||||
For example, consider the <tt>rcu_prepare_for_idle()</tt> function
|
||||
shown below, which is one of several functions that enforce ordering
|
||||
of newly arrived RCU callbacks against future grace periods:
|
||||
|
||||
<pre>
|
||||
1 static void rcu_prepare_for_idle(void)
|
||||
2 {
|
||||
3 bool needwake;
|
||||
4 struct rcu_data *rdp;
|
||||
5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
|
||||
6 struct rcu_node *rnp;
|
||||
7 struct rcu_state *rsp;
|
||||
8 int tne;
|
||||
9
|
||||
10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
|
||||
11 rcu_is_nocb_cpu(smp_processor_id()))
|
||||
12 return;
|
||||
13 tne = READ_ONCE(tick_nohz_active);
|
||||
14 if (tne != rdtp->tick_nohz_enabled_snap) {
|
||||
15 if (rcu_cpu_has_callbacks(NULL))
|
||||
16 invoke_rcu_core();
|
||||
17 rdtp->tick_nohz_enabled_snap = tne;
|
||||
18 return;
|
||||
19 }
|
||||
20 if (!tne)
|
||||
21 return;
|
||||
22 if (rdtp->all_lazy &&
|
||||
23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
|
||||
24 rdtp->all_lazy = false;
|
||||
25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
|
||||
26 invoke_rcu_core();
|
||||
27 return;
|
||||
28 }
|
||||
29 if (rdtp->last_accelerate == jiffies)
|
||||
30 return;
|
||||
31 rdtp->last_accelerate = jiffies;
|
||||
32 for_each_rcu_flavor(rsp) {
|
||||
33 rdp = this_cpu_ptr(rsp->rda);
|
||||
34 if (rcu_segcblist_pend_cbs(&rdp->cblist))
|
||||
35 continue;
|
||||
36 rnp = rdp->mynode;
|
||||
37 raw_spin_lock_rcu_node(rnp);
|
||||
38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
|
||||
39 raw_spin_unlock_rcu_node(rnp);
|
||||
40 if (needwake)
|
||||
41 rcu_gp_kthread_wake(rsp);
|
||||
42 }
|
||||
43 }
|
||||
</pre>
|
||||
|
||||
<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
|
||||
matters for this discussion are lines 37–39.
|
||||
We will therefore abbreviate this function as follows:
|
||||
|
||||
</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
|
||||
|
||||
<p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt>
|
||||
critical section, with the double line on top representing the additional
|
||||
<tt>smp_mb__after_unlock_lock()</tt>.
|
||||
|
||||
<h3><a name="Tree RCU Grace Period Memory Ordering Components">
|
||||
Tree RCU Grace Period Memory Ordering Components</a></h3>
|
||||
|
||||
<p>Tree RCU's grace-period memory-ordering guarantee is provided by
|
||||
a number of RCU components:
|
||||
|
||||
<ol>
|
||||
<li> <a href="#Callback Registry">Callback Registry</a>
|
||||
<li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a>
|
||||
<li> <a href="#Self-Reported Quiescent States">
|
||||
Self-Reported Quiescent States</a>
|
||||
<li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
|
||||
<li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
|
||||
<li> <a href="Forcing Quiescent States">Forcing Quiescent States</a>
|
||||
<li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
|
||||
<li> <a href="Callback Invocation">Callback Invocation</a>
|
||||
</ol>
|
||||
|
||||
<p>Each of the following section looks at the corresponding component
|
||||
in detail.
|
||||
|
||||
<h4><a name="Callback Registry">Callback Registry</a></h4>
|
||||
|
||||
<p>If RCU's grace-period guarantee is to mean anything at all, any
|
||||
access that happens before a given invocation of <tt>call_rcu()</tt>
|
||||
must also happen before the corresponding grace period.
|
||||
The implementation of this portion of RCU's grace period guarantee
|
||||
is shown in the following figure:
|
||||
|
||||
</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
|
||||
|
||||
<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
|
||||
it provides no ordering guarantees, either for itself or for
|
||||
phase one of the update (which again will usually be removal of
|
||||
an element from an RCU-protected data structure).
|
||||
It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
|
||||
which cannot become associated with a grace period until a later
|
||||
call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
|
||||
|
||||
<p>One set of code paths shown on the left invokes
|
||||
<tt>rcu_accelerate_cbs()</tt> via
|
||||
<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
|
||||
the current CPU is inundated with queued <tt>rcu_head</tt> structures)
|
||||
or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
|
||||
Another code path in the middle is taken only in kernels built with
|
||||
<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
|
||||
<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
|
||||
The final code path on the right is taken only in kernels built with
|
||||
<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
|
||||
<tt>rcu_accelerate_cbs()</tt> via
|
||||
<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
|
||||
<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
|
||||
which in turn is invoked on a surviving CPU after the outgoing
|
||||
CPU has been completely offlined.
|
||||
|
||||
<p>There are a few other code paths within grace-period processing
|
||||
that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
|
||||
However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
|
||||
structures are associated with a future grace-period number under
|
||||
the protection of the CPU's lead <tt>rcu_node</tt> structure's
|
||||
<tt>->lock</tt>.
|
||||
In all cases, there is full ordering against any prior critical section
|
||||
for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and
|
||||
also full ordering against any of the current task's or CPU's prior critical
|
||||
sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>.
|
||||
|
||||
<p>The next section will show how this ordering ensures that any
|
||||
accesses prior to the <tt>call_rcu()</tt> (particularly including phase
|
||||
one of the update)
|
||||
happen before the start of the corresponding grace period.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
But what about <tt>synchronize_rcu()</tt>?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
|
||||
to <tt>wait_rcu_gp()</tt>, which invokes it.
|
||||
So either way, it eventually comes down to <tt>call_rcu()</tt>.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
|
||||
|
||||
<p>Grace-period initialization is carried out by
|
||||
the grace-period kernel thread, which makes several passes over the
|
||||
<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
|
||||
This means that showing the full flow of ordering through the
|
||||
grace-period computation will require duplicating this tree.
|
||||
If you find this confusing, please note that the state of the
|
||||
<tt>rcu_node</tt> changes over time, just like Heraclitus's river.
|
||||
However, to keep the <tt>rcu_node</tt> river tractable, the
|
||||
grace-period kernel thread's traversals are presented in multiple
|
||||
parts, starting in this section with the various phases of
|
||||
grace-period initialization.
|
||||
|
||||
<p>The first ordering-related grace-period initialization action is to
|
||||
advance the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt>
|
||||
grace-period-number counter, as shown below:
|
||||
|
||||
</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
|
||||
|
||||
<p>The actual increment is carried out using <tt>smp_store_release()</tt>,
|
||||
which helps reject false-positive RCU CPU stall detection.
|
||||
Note that only the root <tt>rcu_node</tt> structure is touched.
|
||||
|
||||
<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
|
||||
based on CPUs having come online or gone offline since the start of
|
||||
the previous grace period.
|
||||
In the common case where the number of online CPUs for this <tt>rcu_node</tt>
|
||||
structure has not transitioned to or from zero,
|
||||
this pass will scan only the leaf <tt>rcu_node</tt> structures.
|
||||
However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
|
||||
structure has transitioned from zero,
|
||||
<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
|
||||
Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
|
||||
structure has transitioned to zero,
|
||||
<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
|
||||
The diagram below shows the path of ordering if the leftmost
|
||||
<tt>rcu_node</tt> structure onlines its first CPU and if the next
|
||||
<tt>rcu_node</tt> structure has no online CPUs
|
||||
(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
|
||||
its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
|
||||
|
||||
</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
|
||||
|
||||
<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
|
||||
tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
|
||||
<tt>->gp_seq</tt> field to the newly advanced value from the
|
||||
<tt>rcu_state</tt> structure, as shown in the following diagram.
|
||||
|
||||
</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
|
||||
|
||||
<p>This change will also cause each CPU's next call to
|
||||
<tt>__note_gp_changes()</tt>
|
||||
to notice that a new grace period has started, as described in the next
|
||||
section.
|
||||
But because the grace-period kthread started the grace period at the
|
||||
root (with the advancing of the <tt>rcu_state</tt> structure's
|
||||
<tt>->gp_seq</tt> field) before setting each leaf <tt>rcu_node</tt>
|
||||
structure's <tt>->gp_seq</tt> field, each CPU's observation of
|
||||
the start of the grace period will happen after the actual start
|
||||
of the grace period.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
But what about the CPU that started the grace period?
|
||||
Why wouldn't it see the start of the grace period right when
|
||||
it started that grace period?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
In some deep philosophical and overly anthromorphized
|
||||
sense, yes, the CPU starting the grace period is immediately
|
||||
aware of having done so.
|
||||
However, if we instead assume that RCU is not self-aware,
|
||||
then even the CPU starting the grace period does not really
|
||||
become aware of the start of this grace period until its
|
||||
first call to <tt>__note_gp_changes()</tt>.
|
||||
On the other hand, this CPU potentially gets early notification
|
||||
because it invokes <tt>__note_gp_changes()</tt> during its
|
||||
last <tt>rcu_gp_init()</tt> pass through its leaf
|
||||
<tt>rcu_node</tt> structure.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<h4><a name="Self-Reported Quiescent States">
|
||||
Self-Reported Quiescent States</a></h4>
|
||||
|
||||
<p>When all entities that might block the grace period have reported
|
||||
quiescent states (or as described in a later section, had quiescent
|
||||
states reported on their behalf), the grace period can end.
|
||||
Online non-idle CPUs report their own quiescent states, as shown
|
||||
in the following diagram:
|
||||
|
||||
</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
|
||||
|
||||
<p>This is for the last CPU to report a quiescent state, which signals
|
||||
the end of the grace period.
|
||||
Earlier quiescent states would push up the <tt>rcu_node</tt> tree
|
||||
only until they encountered an <tt>rcu_node</tt> structure that
|
||||
is waiting for additional quiescent states.
|
||||
However, ordering is nevertheless preserved because some later quiescent
|
||||
state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>.
|
||||
|
||||
<p>Any number of events can lead up to a CPU invoking
|
||||
<tt>note_gp_changes</tt> (or alternatively, directly invoking
|
||||
<tt>__note_gp_changes()</tt>), at which point that CPU will notice
|
||||
the start of a new grace period while holding its leaf
|
||||
<tt>rcu_node</tt> lock.
|
||||
Therefore, all execution shown in this diagram happens after the
|
||||
start of the grace period.
|
||||
In addition, this CPU will consider any RCU read-side critical
|
||||
section that started before the invocation of <tt>__note_gp_changes()</tt>
|
||||
to have started before the grace period, and thus a critical
|
||||
section that the grace period must wait on.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
But a RCU read-side critical section might have started
|
||||
after the beginning of the grace period
|
||||
(the advancing of <tt>->gp_seq</tt> from earlier), so why should
|
||||
the grace period wait on such a critical section?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
It is indeed not necessary for the grace period to wait on such
|
||||
a critical section.
|
||||
However, it is permissible to wait on it.
|
||||
And it is furthermore important to wait on it, as this
|
||||
lazy approach is far more scalable than a “big bang”
|
||||
all-at-once grace-period start could possibly be.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<p>If the CPU does a context switch, a quiescent state will be
|
||||
noted by <tt>rcu_node_context_switch()</tt> on the left.
|
||||
On the other hand, if the CPU takes a scheduler-clock interrupt
|
||||
while executing in usermode, a quiescent state will be noted by
|
||||
<tt>rcu_sched_clock_irq()</tt> on the right.
|
||||
Either way, the passage through a quiescent state will be noted
|
||||
in a per-CPU variable.
|
||||
|
||||
<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
|
||||
this CPU (for example, after the next scheduler-clock
|
||||
interrupt), <tt>rcu_core()</tt> will invoke
|
||||
<tt>rcu_check_quiescent_state()</tt>, which will notice the
|
||||
recorded quiescent state, and invoke
|
||||
<tt>rcu_report_qs_rdp()</tt>.
|
||||
If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
|
||||
really does apply to the current grace period, it invokes
|
||||
<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
|
||||
tree as shown at the bottom of the diagram, clearing bits from
|
||||
each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field,
|
||||
and propagating up the tree when the result is zero.
|
||||
|
||||
<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
|
||||
structure only if the current CPU is reporting the last quiescent
|
||||
state for the subtree headed by that <tt>rcu_node</tt> structure.
|
||||
A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
|
||||
structure, then there will be a later traversal by another CPU
|
||||
(or perhaps the same one) that proceeds upwards
|
||||
from that point, and the <tt>rcu_node</tt> <tt>->lock</tt>
|
||||
guarantees that the first CPU's quiescent state happens before the
|
||||
remainder of the second CPU's traversal.
|
||||
Applying this line of thought repeatedly shows that all CPUs'
|
||||
quiescent states happen before the last CPU traverses through
|
||||
the root <tt>rcu_node</tt> structure, the “last CPU”
|
||||
being the one that clears the last bit in the root <tt>rcu_node</tt>
|
||||
structure's <tt>->qsmask</tt> field.
|
||||
|
||||
<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
|
||||
|
||||
<p>Due to energy-efficiency considerations, RCU is forbidden from
|
||||
disturbing idle CPUs.
|
||||
CPUs are therefore required to notify RCU when entering or leaving idle
|
||||
state, which they do via fully ordered value-returning atomic operations
|
||||
on a per-CPU variable.
|
||||
The ordering effects are as shown below:
|
||||
|
||||
</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
|
||||
|
||||
<p>The RCU grace-period kernel thread samples the per-CPU idleness
|
||||
variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
|
||||
structure's <tt>->lock</tt>.
|
||||
This means that any RCU read-side critical sections that precede the
|
||||
idle period (the oval near the top of the diagram above) will happen
|
||||
before the end of the current grace period.
|
||||
Similarly, the beginning of the current grace period will happen before
|
||||
any RCU read-side critical sections that follow the
|
||||
idle period (the oval near the bottom of the diagram above).
|
||||
|
||||
<p>Plumbing this into the full grace-period execution is described
|
||||
<a href="#Forcing Quiescent States">below</a>.
|
||||
|
||||
<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
|
||||
|
||||
<p>RCU is also forbidden from disturbing offline CPUs, which might well
|
||||
be powered off and removed from the system completely.
|
||||
CPUs are therefore required to notify RCU of their comings and goings
|
||||
as part of the corresponding CPU hotplug operations.
|
||||
The ordering effects are shown below:
|
||||
|
||||
</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
|
||||
|
||||
<p>Because CPU hotplug operations are much less frequent than idle transitions,
|
||||
they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
|
||||
structure's <tt>->lock</tt> and update this structure's
|
||||
<tt>->qsmaskinitnext</tt>.
|
||||
The RCU grace-period kernel thread samples this mask to detect CPUs
|
||||
having gone offline since the beginning of this grace period.
|
||||
|
||||
<p>Plumbing this into the full grace-period execution is described
|
||||
<a href="#Forcing Quiescent States">below</a>.
|
||||
|
||||
<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
|
||||
|
||||
<p>As noted above, idle and offline CPUs cannot report their own
|
||||
quiescent states, and therefore the grace-period kernel thread
|
||||
must do the reporting on their behalf.
|
||||
This process is called “forcing quiescent states”, it is
|
||||
repeated every few jiffies, and its ordering effects are shown below:
|
||||
|
||||
</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
|
||||
|
||||
<p>Each pass of quiescent state forcing is guaranteed to traverse the
|
||||
leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
|
||||
states due to recently idled and/or offlined CPUs, then only the
|
||||
leaves are traversed.
|
||||
However, if there is a newly offlined CPU as illustrated on the left
|
||||
or a newly idled CPU as illustrated on the right, the corresponding
|
||||
quiescent state will be driven up towards the root.
|
||||
As with self-reported quiescent states, the upwards driving stops
|
||||
once it reaches an <tt>rcu_node</tt> structure that has quiescent
|
||||
states outstanding from other CPUs.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
The leftmost drive to root stopped before it reached
|
||||
the root <tt>rcu_node</tt> structure, which means that
|
||||
there are still CPUs subordinate to that structure on
|
||||
which the current grace period is waiting.
|
||||
Given that, how is it possible that the rightmost drive
|
||||
to root ended the grace period?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
Good analysis!
|
||||
It is in fact impossible in the absence of bugs in RCU.
|
||||
But this diagram is complex enough as it is, so simplicity
|
||||
overrode accuracy.
|
||||
You can think of it as poetic license, or you can think of
|
||||
it as misdirection that is resolved in the
|
||||
<a href="#Putting It All Together">stitched-together diagram</a>.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
|
||||
|
||||
<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
|
||||
breadth-first advancing all the <tt>->gp_seq</tt> fields, then it
|
||||
advances the <tt>rcu_state</tt> structure's <tt>->gp_seq</tt> field.
|
||||
The ordering effects are shown below:
|
||||
|
||||
</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
|
||||
|
||||
<p>As indicated by the oval at the bottom of the diagram, once
|
||||
grace-period cleanup is complete, the next grace period can begin.
|
||||
|
||||
<table>
|
||||
<tr><th> </th></tr>
|
||||
<tr><th align="left">Quick Quiz:</th></tr>
|
||||
<tr><td>
|
||||
But when precisely does the grace period end?
|
||||
</td></tr>
|
||||
<tr><th align="left">Answer:</th></tr>
|
||||
<tr><td bgcolor="#ffffff"><font color="ffffff">
|
||||
There is no useful single point at which the grace period
|
||||
can be said to end.
|
||||
The earliest reasonable candidate is as soon as the last
|
||||
CPU has reported its quiescent state, but it may be some
|
||||
milliseconds before RCU becomes aware of this.
|
||||
The latest reasonable candidate is once the <tt>rcu_state</tt>
|
||||
structure's <tt>->gp_seq</tt> field has been updated,
|
||||
but it is quite possible that some CPUs have already completed
|
||||
phase two of their updates by that time.
|
||||
In short, if you are going to work with RCU, you need to
|
||||
learn to embrace uncertainty.
|
||||
</font></td></tr>
|
||||
<tr><td> </td></tr>
|
||||
</table>
|
||||
|
||||
|
||||
<h4><a name="Callback Invocation">Callback Invocation</a></h4>
|
||||
|
||||
<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
|
||||
<tt>->gp_seq</tt> field has been updated, that CPU can begin
|
||||
invoking its RCU callbacks that were waiting for this grace period
|
||||
to end.
|
||||
These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
|
||||
which is usually invoked by <tt>__note_gp_changes()</tt>.
|
||||
As shown in the diagram below, this invocation can be triggered by
|
||||
the scheduling-clock interrupt (<tt>rcu_sched_clock_irq()</tt> on
|
||||
the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
|
||||
the right, but only for kernels build with
|
||||
<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
|
||||
Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
|
||||
<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
|
||||
allows those callbacks to carry out (either directly or indirectly
|
||||
via wakeup) the needed phase-two processing for each update.
|
||||
|
||||
</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
|
||||
|
||||
<p>Please note that callback invocation can also be prompted by any
|
||||
number of corner-case code paths, for example, when a CPU notes that
|
||||
it has excessive numbers of callbacks queued.
|
||||
In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
|
||||
<tt>->lock</tt> before invoking callbacks, which preserves the
|
||||
required ordering against the newly completed grace period.
|
||||
|
||||
<p>However, if the callback function communicates to other CPUs,
|
||||
for example, doing a wakeup, then it is that function's responsibility
|
||||
to maintain ordering.
|
||||
For example, if the callback function wakes up a task that runs on
|
||||
some other CPU, proper ordering must in place in both the callback
|
||||
function and the task being awakened.
|
||||
To see why this is important, consider the top half of the
|
||||
<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
|
||||
The callback might be running on a CPU corresponding to the leftmost
|
||||
leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
|
||||
a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
|
||||
and the grace-period kernel thread might not yet have reached the
|
||||
rightmost leaf.
|
||||
In this case, the grace period's memory ordering might not yet have
|
||||
reached that CPU, so again the callback function and the awakened
|
||||
task must supply proper ordering.
|
||||
|
||||
<h3><a name="Putting It All Together">Putting It All Together</a></h3>
|
||||
|
||||
<p>A stitched-together diagram is
|
||||
<a href="Tree-RCU-Diagram.html">here</a>.
|
||||
|
||||
<h3><a name="Legal Statement">
|
||||
Legal Statement</a></h3>
|
||||
|
||||
<p>This work represents the view of the author and does not necessarily
|
||||
represent the view of IBM.
|
||||
|
||||
</p><p>Linux is a registered trademark of Linus Torvalds.
|
||||
|
||||
</p><p>Other company, product, and service names may be trademarks or
|
||||
service marks of others.
|
||||
|
||||
</body></html>
|
@ -0,0 +1,624 @@
|
||||
======================================================
|
||||
A Tour Through TREE_RCU's Grace-Period Memory Ordering
|
||||
======================================================
|
||||
|
||||
August 8, 2017
|
||||
|
||||
This article was contributed by Paul E. McKenney
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This document gives a rough visual overview of how Tree RCU's
|
||||
grace-period memory ordering guarantee is provided.
|
||||
|
||||
What Is Tree RCU's Grace Period Memory Ordering Guarantee?
|
||||
==========================================================
|
||||
|
||||
RCU grace periods provide extremely strong memory-ordering guarantees
|
||||
for non-idle non-offline code.
|
||||
Any code that happens after the end of a given RCU grace period is guaranteed
|
||||
to see the effects of all accesses prior to the beginning of that grace
|
||||
period that are within RCU read-side critical sections.
|
||||
Similarly, any code that happens before the beginning of a given RCU grace
|
||||
period is guaranteed to see the effects of all accesses following the end
|
||||
of that grace period that are within RCU read-side critical sections.
|
||||
|
||||
Note well that RCU-sched read-side critical sections include any region
|
||||
of code for which preemption is disabled.
|
||||
Given that each individual machine instruction can be thought of as
|
||||
an extremely small region of preemption-disabled code, one can think of
|
||||
``synchronize_rcu()`` as ``smp_mb()`` on steroids.
|
||||
|
||||
RCU updaters use this guarantee by splitting their updates into
|
||||
two phases, one of which is executed before the grace period and
|
||||
the other of which is executed after the grace period.
|
||||
In the most common use case, phase one removes an element from
|
||||
a linked RCU-protected data structure, and phase two frees that element.
|
||||
For this to work, any readers that have witnessed state prior to the
|
||||
phase-one update (in the common case, removal) must not witness state
|
||||
following the phase-two update (in the common case, freeing).
|
||||
|
||||
The RCU implementation provides this guarantee using a network
|
||||
of lock-based critical sections, memory barriers, and per-CPU
|
||||
processing, as is described in the following sections.
|
||||
|
||||
Tree RCU Grace Period Memory Ordering Building Blocks
|
||||
=====================================================
|
||||
|
||||
The workhorse for RCU's grace-period memory ordering is the
|
||||
critical section for the ``rcu_node`` structure's
|
||||
``->lock``. These critical sections use helper functions for lock
|
||||
acquisition, including ``raw_spin_lock_rcu_node()``,
|
||||
``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``.
|
||||
Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``,
|
||||
``raw_spin_unlock_irq_rcu_node()``, and
|
||||
``raw_spin_unlock_irqrestore_rcu_node()``, respectively.
|
||||
For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided.
|
||||
The key point is that the lock-acquisition functions, including
|
||||
``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()``
|
||||
immediately after successful acquisition of the lock.
|
||||
|
||||
Therefore, for any given ``rcu_node`` structure, any access
|
||||
happening before one of the above lock-release functions will be seen
|
||||
by all CPUs as happening before any access happening after a later
|
||||
one of the above lock-acquisition functions.
|
||||
Furthermore, any access happening before one of the
|
||||
above lock-release function on any given CPU will be seen by all
|
||||
CPUs as happening before any access happening after a later one
|
||||
of the above lock-acquisition functions executing on that same CPU,
|
||||
even if the lock-release and lock-acquisition functions are operating
|
||||
on different ``rcu_node`` structures.
|
||||
Tree RCU uses these two ordering guarantees to form an ordering
|
||||
network among all CPUs that were in any way involved in the grace
|
||||
period, including any CPUs that came online or went offline during
|
||||
the grace period in question.
|
||||
|
||||
The following litmus test exhibits the ordering effects of these
|
||||
lock-acquisition and lock-release functions::
|
||||
|
||||
1 int x, y, z;
|
||||
2
|
||||
3 void task0(void)
|
||||
4 {
|
||||
5 raw_spin_lock_rcu_node(rnp);
|
||||
6 WRITE_ONCE(x, 1);
|
||||
7 r1 = READ_ONCE(y);
|
||||
8 raw_spin_unlock_rcu_node(rnp);
|
||||
9 }
|
||||
10
|
||||
11 void task1(void)
|
||||
12 {
|
||||
13 raw_spin_lock_rcu_node(rnp);
|
||||
14 WRITE_ONCE(y, 1);
|
||||
15 r2 = READ_ONCE(z);
|
||||
16 raw_spin_unlock_rcu_node(rnp);
|
||||
17 }
|
||||
18
|
||||
19 void task2(void)
|
||||
20 {
|
||||
21 WRITE_ONCE(z, 1);
|
||||
22 smp_mb();
|
||||
23 r3 = READ_ONCE(x);
|
||||
24 }
|
||||
25
|
||||
26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
|
||||
|
||||
The ``WARN_ON()`` is evaluated at “the end of time”,
|
||||
after all changes have propagated throughout the system.
|
||||
Without the ``smp_mb__after_unlock_lock()`` provided by the
|
||||
acquisition functions, this ``WARN_ON()`` could trigger, for example
|
||||
on PowerPC.
|
||||
The ``smp_mb__after_unlock_lock()`` invocations prevent this
|
||||
``WARN_ON()`` from triggering.
|
||||
|
||||
This approach must be extended to include idle CPUs, which need
|
||||
RCU's grace-period memory ordering guarantee to extend to any
|
||||
RCU read-side critical sections preceding and following the current
|
||||
idle sojourn.
|
||||
This case is handled by calls to the strongly ordered
|
||||
``atomic_add_return()`` read-modify-write atomic operation that
|
||||
is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
|
||||
time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
|
||||
The grace-period kthread invokes ``rcu_dynticks_snap()`` and
|
||||
``rcu_dynticks_in_eqs_since()`` (both of which invoke
|
||||
an ``atomic_add_return()`` of zero) to detect idle CPUs.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But what about CPUs that remain offline for the entire grace period? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Such CPUs will be offline at the beginning of the grace period, so |
|
||||
| the grace period won't expect quiescent states from them. Races |
|
||||
| between grace-period start and CPU-hotplug operations are mediated |
|
||||
| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described |
|
||||
| above. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
The approach must be extended to handle one final case, that of waking a
|
||||
task blocked in ``synchronize_rcu()``. This task might be affinitied to
|
||||
a CPU that is not yet aware that the grace period has ended, and thus
|
||||
might not yet be subject to the grace period's memory ordering.
|
||||
Therefore, there is an ``smp_mb()`` after the return from
|
||||
``wait_for_completion()`` in the ``synchronize_rcu()`` code path.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| What? Where??? I don't see any ``smp_mb()`` after the return from |
|
||||
| ``wait_for_completion()``!!! |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| That would be because I spotted the need for that ``smp_mb()`` during |
|
||||
| the creation of this documentation, and it is therefore unlikely to |
|
||||
| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter |
|
||||
| Zijlstra, and Jonathan Cameron for asking questions that sensitized |
|
||||
| me to the rather elaborate sequence of events that demonstrate the |
|
||||
| need for this memory barrier. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Tree RCU's grace--period memory-ordering guarantees rely most heavily on
|
||||
the ``rcu_node`` structure's ``->lock`` field, so much so that it is
|
||||
necessary to abbreviate this pattern in the diagrams in the next
|
||||
section. For example, consider the ``rcu_prepare_for_idle()`` function
|
||||
shown below, which is one of several functions that enforce ordering of
|
||||
newly arrived RCU callbacks against future grace periods:
|
||||
|
||||
::
|
||||
|
||||
1 static void rcu_prepare_for_idle(void)
|
||||
2 {
|
||||
3 bool needwake;
|
||||
4 struct rcu_data *rdp;
|
||||
5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
|
||||
6 struct rcu_node *rnp;
|
||||
7 struct rcu_state *rsp;
|
||||
8 int tne;
|
||||
9
|
||||
10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
|
||||
11 rcu_is_nocb_cpu(smp_processor_id()))
|
||||
12 return;
|
||||
13 tne = READ_ONCE(tick_nohz_active);
|
||||
14 if (tne != rdtp->tick_nohz_enabled_snap) {
|
||||
15 if (rcu_cpu_has_callbacks(NULL))
|
||||
16 invoke_rcu_core();
|
||||
17 rdtp->tick_nohz_enabled_snap = tne;
|
||||
18 return;
|
||||
19 }
|
||||
20 if (!tne)
|
||||
21 return;
|
||||
22 if (rdtp->all_lazy &&
|
||||
23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
|
||||
24 rdtp->all_lazy = false;
|
||||
25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
|
||||
26 invoke_rcu_core();
|
||||
27 return;
|
||||
28 }
|
||||
29 if (rdtp->last_accelerate == jiffies)
|
||||
30 return;
|
||||
31 rdtp->last_accelerate = jiffies;
|
||||
32 for_each_rcu_flavor(rsp) {
|
||||
33 rdp = this_cpu_ptr(rsp->rda);
|
||||
34 if (rcu_segcblist_pend_cbs(&rdp->cblist))
|
||||
35 continue;
|
||||
36 rnp = rdp->mynode;
|
||||
37 raw_spin_lock_rcu_node(rnp);
|
||||
38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
|
||||
39 raw_spin_unlock_rcu_node(rnp);
|
||||
40 if (needwake)
|
||||
41 rcu_gp_kthread_wake(rsp);
|
||||
42 }
|
||||
43 }
|
||||
|
||||
But the only part of ``rcu_prepare_for_idle()`` that really matters for
|
||||
this discussion are lines 37–39. We will therefore abbreviate this
|
||||
function as follows:
|
||||
|
||||
.. kernel-figure:: rcu_node-lock.svg
|
||||
|
||||
The box represents the ``rcu_node`` structure's ``->lock`` critical
|
||||
section, with the double line on top representing the additional
|
||||
``smp_mb__after_unlock_lock()``.
|
||||
|
||||
Tree RCU Grace Period Memory Ordering Components
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Tree RCU's grace-period memory-ordering guarantee is provided by a
|
||||
number of RCU components:
|
||||
|
||||
#. `Callback Registry`_
|
||||
#. `Grace-Period Initialization`_
|
||||
#. `Self-Reported Quiescent States`_
|
||||
#. `Dynamic Tick Interface`_
|
||||
#. `CPU-Hotplug Interface`_
|
||||
#. `Forcing Quiescent States`_
|
||||
#. `Grace-Period Cleanup`_
|
||||
#. `Callback Invocation`_
|
||||
|
||||
Each of the following section looks at the corresponding component in
|
||||
detail.
|
||||
|
||||
Callback Registry
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
If RCU's grace-period guarantee is to mean anything at all, any access
|
||||
that happens before a given invocation of ``call_rcu()`` must also
|
||||
happen before the corresponding grace period. The implementation of this
|
||||
portion of RCU's grace period guarantee is shown in the following
|
||||
figure:
|
||||
|
||||
.. kernel-figure:: TreeRCU-callback-registry.svg
|
||||
|
||||
Because ``call_rcu()`` normally acts only on CPU-local state, it
|
||||
provides no ordering guarantees, either for itself or for phase one of
|
||||
the update (which again will usually be removal of an element from an
|
||||
RCU-protected data structure). It simply enqueues the ``rcu_head``
|
||||
structure on a per-CPU list, which cannot become associated with a grace
|
||||
period until a later call to ``rcu_accelerate_cbs()``, as shown in the
|
||||
diagram above.
|
||||
|
||||
One set of code paths shown on the left invokes ``rcu_accelerate_cbs()``
|
||||
via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the
|
||||
current CPU is inundated with queued ``rcu_head`` structures) or more
|
||||
likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle
|
||||
is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which
|
||||
invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The
|
||||
final code path on the right is taken only in kernels built with
|
||||
``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via
|
||||
``rcu_advance_cbs()``, ``rcu_migrate_callbacks``,
|
||||
``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn
|
||||
is invoked on a surviving CPU after the outgoing CPU has been completely
|
||||
offlined.
|
||||
|
||||
There are a few other code paths within grace-period processing that
|
||||
opportunistically invoke ``rcu_accelerate_cbs()``. However, either way,
|
||||
all of the CPU's recently queued ``rcu_head`` structures are associated
|
||||
with a future grace-period number under the protection of the CPU's lead
|
||||
``rcu_node`` structure's ``->lock``. In all cases, there is full
|
||||
ordering against any prior critical section for that same ``rcu_node``
|
||||
structure's ``->lock``, and also full ordering against any of the
|
||||
current task's or CPU's prior critical sections for any ``rcu_node``
|
||||
structure's ``->lock``.
|
||||
|
||||
The next section will show how this ordering ensures that any accesses
|
||||
prior to the ``call_rcu()`` (particularly including phase one of the
|
||||
update) happen before the start of the corresponding grace period.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But what about ``synchronize_rcu()``? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, |
|
||||
| which invokes it. So either way, it eventually comes down to |
|
||||
| ``call_rcu()``. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Grace-Period Initialization
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Grace-period initialization is carried out by the grace-period kernel
|
||||
thread, which makes several passes over the ``rcu_node`` tree within the
|
||||
``rcu_gp_init()`` function. This means that showing the full flow of
|
||||
ordering through the grace-period computation will require duplicating
|
||||
this tree. If you find this confusing, please note that the state of the
|
||||
``rcu_node`` changes over time, just like Heraclitus's river. However,
|
||||
to keep the ``rcu_node`` river tractable, the grace-period kernel
|
||||
thread's traversals are presented in multiple parts, starting in this
|
||||
section with the various phases of grace-period initialization.
|
||||
|
||||
The first ordering-related grace-period initialization action is to
|
||||
advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number
|
||||
counter, as shown below:
|
||||
|
||||
.. kernel-figure:: TreeRCU-gp-init-1.svg
|
||||
|
||||
The actual increment is carried out using ``smp_store_release()``, which
|
||||
helps reject false-positive RCU CPU stall detection. Note that only the
|
||||
root ``rcu_node`` structure is touched.
|
||||
|
||||
The first pass through the ``rcu_node`` tree updates bitmasks based on
|
||||
CPUs having come online or gone offline since the start of the previous
|
||||
grace period. In the common case where the number of online CPUs for
|
||||
this ``rcu_node`` structure has not transitioned to or from zero, this
|
||||
pass will scan only the leaf ``rcu_node`` structures. However, if the
|
||||
number of online CPUs for a given leaf ``rcu_node`` structure has
|
||||
transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the
|
||||
first incoming CPU. Similarly, if the number of online CPUs for a given
|
||||
leaf ``rcu_node`` structure has transitioned to zero,
|
||||
``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU.
|
||||
The diagram below shows the path of ordering if the leftmost
|
||||
``rcu_node`` structure onlines its first CPU and if the next
|
||||
``rcu_node`` structure has no online CPUs (or, alternatively if the
|
||||
leftmost ``rcu_node`` structure offlines its last CPU and if the next
|
||||
``rcu_node`` structure has no online CPUs).
|
||||
|
||||
.. kernel-figure:: TreeRCU-gp-init-1.svg
|
||||
|
||||
The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
|
||||
breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
|
||||
to the newly advanced value from the ``rcu_state`` structure, as shown
|
||||
in the following diagram.
|
||||
|
||||
.. kernel-figure:: TreeRCU-gp-init-1.svg
|
||||
|
||||
This change will also cause each CPU's next call to
|
||||
``__note_gp_changes()`` to notice that a new grace period has started,
|
||||
as described in the next section. But because the grace-period kthread
|
||||
started the grace period at the root (with the advancing of the
|
||||
``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf
|
||||
``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of
|
||||
the start of the grace period will happen after the actual start of the
|
||||
grace period.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But what about the CPU that started the grace period? Why wouldn't it |
|
||||
| see the start of the grace period right when it started that grace |
|
||||
| period? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| In some deep philosophical and overly anthromorphized sense, yes, the |
|
||||
| CPU starting the grace period is immediately aware of having done so. |
|
||||
| However, if we instead assume that RCU is not self-aware, then even |
|
||||
| the CPU starting the grace period does not really become aware of the |
|
||||
| start of this grace period until its first call to |
|
||||
| ``__note_gp_changes()``. On the other hand, this CPU potentially gets |
|
||||
| early notification because it invokes ``__note_gp_changes()`` during |
|
||||
| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` |
|
||||
| structure. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Self-Reported Quiescent States
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When all entities that might block the grace period have reported
|
||||
quiescent states (or as described in a later section, had quiescent
|
||||
states reported on their behalf), the grace period can end. Online
|
||||
non-idle CPUs report their own quiescent states, as shown in the
|
||||
following diagram:
|
||||
|
||||
.. kernel-figure:: TreeRCU-qs.svg
|
||||
|
||||
This is for the last CPU to report a quiescent state, which signals the
|
||||
end of the grace period. Earlier quiescent states would push up the
|
||||
``rcu_node`` tree only until they encountered an ``rcu_node`` structure
|
||||
that is waiting for additional quiescent states. However, ordering is
|
||||
nevertheless preserved because some later quiescent state will acquire
|
||||
that ``rcu_node`` structure's ``->lock``.
|
||||
|
||||
Any number of events can lead up to a CPU invoking ``note_gp_changes``
|
||||
(or alternatively, directly invoking ``__note_gp_changes()``), at which
|
||||
point that CPU will notice the start of a new grace period while holding
|
||||
its leaf ``rcu_node`` lock. Therefore, all execution shown in this
|
||||
diagram happens after the start of the grace period. In addition, this
|
||||
CPU will consider any RCU read-side critical section that started before
|
||||
the invocation of ``__note_gp_changes()`` to have started before the
|
||||
grace period, and thus a critical section that the grace period must
|
||||
wait on.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But a RCU read-side critical section might have started after the |
|
||||
| beginning of the grace period (the advancing of ``->gp_seq`` from |
|
||||
| earlier), so why should the grace period wait on such a critical |
|
||||
| section? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| It is indeed not necessary for the grace period to wait on such a |
|
||||
| critical section. However, it is permissible to wait on it. And it is |
|
||||
| furthermore important to wait on it, as this lazy approach is far |
|
||||
| more scalable than a “big bang” all-at-once grace-period start could |
|
||||
| possibly be. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
If the CPU does a context switch, a quiescent state will be noted by
|
||||
``rcu_note_context_switch()`` on the left. On the other hand, if the CPU
|
||||
takes a scheduler-clock interrupt while executing in usermode, a
|
||||
quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right.
|
||||
Either way, the passage through a quiescent state will be noted in a
|
||||
per-CPU variable.
|
||||
|
||||
The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for
|
||||
example, after the next scheduler-clock interrupt), ``rcu_core()`` will
|
||||
invoke ``rcu_check_quiescent_state()``, which will notice the recorded
|
||||
quiescent state, and invoke ``rcu_report_qs_rdp()``. If
|
||||
``rcu_report_qs_rdp()`` verifies that the quiescent state really does
|
||||
apply to the current grace period, it invokes ``rcu_report_rnp()`` which
|
||||
traverses up the ``rcu_node`` tree as shown at the bottom of the
|
||||
diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask``
|
||||
field, and propagating up the tree when the result is zero.
|
||||
|
||||
Note that traversal passes upwards out of a given ``rcu_node`` structure
|
||||
only if the current CPU is reporting the last quiescent state for the
|
||||
subtree headed by that ``rcu_node`` structure. A key point is that if a
|
||||
CPU's traversal stops at a given ``rcu_node`` structure, then there will
|
||||
be a later traversal by another CPU (or perhaps the same one) that
|
||||
proceeds upwards from that point, and the ``rcu_node`` ``->lock``
|
||||
guarantees that the first CPU's quiescent state happens before the
|
||||
remainder of the second CPU's traversal. Applying this line of thought
|
||||
repeatedly shows that all CPUs' quiescent states happen before the last
|
||||
CPU traverses through the root ``rcu_node`` structure, the “last CPU”
|
||||
being the one that clears the last bit in the root ``rcu_node``
|
||||
structure's ``->qsmask`` field.
|
||||
|
||||
Dynamic Tick Interface
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Due to energy-efficiency considerations, RCU is forbidden from
|
||||
disturbing idle CPUs. CPUs are therefore required to notify RCU when
|
||||
entering or leaving idle state, which they do via fully ordered
|
||||
value-returning atomic operations on a per-CPU variable. The ordering
|
||||
effects are as shown below:
|
||||
|
||||
.. kernel-figure:: TreeRCU-dyntick.svg
|
||||
|
||||
The RCU grace-period kernel thread samples the per-CPU idleness variable
|
||||
while holding the corresponding CPU's leaf ``rcu_node`` structure's
|
||||
``->lock``. This means that any RCU read-side critical sections that
|
||||
precede the idle period (the oval near the top of the diagram above)
|
||||
will happen before the end of the current grace period. Similarly, the
|
||||
beginning of the current grace period will happen before any RCU
|
||||
read-side critical sections that follow the idle period (the oval near
|
||||
the bottom of the diagram above).
|
||||
|
||||
Plumbing this into the full grace-period execution is described
|
||||
`below <#Forcing%20Quiescent%20States>`__.
|
||||
|
||||
CPU-Hotplug Interface
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
RCU is also forbidden from disturbing offline CPUs, which might well be
|
||||
powered off and removed from the system completely. CPUs are therefore
|
||||
required to notify RCU of their comings and goings as part of the
|
||||
corresponding CPU hotplug operations. The ordering effects are shown
|
||||
below:
|
||||
|
||||
.. kernel-figure:: TreeRCU-hotplug.svg
|
||||
|
||||
Because CPU hotplug operations are much less frequent than idle
|
||||
transitions, they are heavier weight, and thus acquire the CPU's leaf
|
||||
``rcu_node`` structure's ``->lock`` and update this structure's
|
||||
``->qsmaskinitnext``. The RCU grace-period kernel thread samples this
|
||||
mask to detect CPUs having gone offline since the beginning of this
|
||||
grace period.
|
||||
|
||||
Plumbing this into the full grace-period execution is described
|
||||
`below <#Forcing%20Quiescent%20States>`__.
|
||||
|
||||
Forcing Quiescent States
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
As noted above, idle and offline CPUs cannot report their own quiescent
|
||||
states, and therefore the grace-period kernel thread must do the
|
||||
reporting on their behalf. This process is called “forcing quiescent
|
||||
states”, it is repeated every few jiffies, and its ordering effects are
|
||||
shown below:
|
||||
|
||||
.. kernel-figure:: TreeRCU-gp-fqs.svg
|
||||
|
||||
Each pass of quiescent state forcing is guaranteed to traverse the leaf
|
||||
``rcu_node`` structures, and if there are no new quiescent states due to
|
||||
recently idled and/or offlined CPUs, then only the leaves are traversed.
|
||||
However, if there is a newly offlined CPU as illustrated on the left or
|
||||
a newly idled CPU as illustrated on the right, the corresponding
|
||||
quiescent state will be driven up towards the root. As with
|
||||
self-reported quiescent states, the upwards driving stops once it
|
||||
reaches an ``rcu_node`` structure that has quiescent states outstanding
|
||||
from other CPUs.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| The leftmost drive to root stopped before it reached the root |
|
||||
| ``rcu_node`` structure, which means that there are still CPUs |
|
||||
| subordinate to that structure on which the current grace period is |
|
||||
| waiting. Given that, how is it possible that the rightmost drive to |
|
||||
| root ended the grace period? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Good analysis! It is in fact impossible in the absence of bugs in |
|
||||
| RCU. But this diagram is complex enough as it is, so simplicity |
|
||||
| overrode accuracy. You can think of it as poetic license, or you can |
|
||||
| think of it as misdirection that is resolved in the |
|
||||
| `stitched-together diagram <#Putting%20It%20All%20Together>`__. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Grace-Period Cleanup
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Grace-period cleanup first scans the ``rcu_node`` tree breadth-first
|
||||
advancing all the ``->gp_seq`` fields, then it advances the
|
||||
``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are
|
||||
shown below:
|
||||
|
||||
.. kernel-figure:: TreeRCU-gp-cleanup.svg
|
||||
|
||||
As indicated by the oval at the bottom of the diagram, once grace-period
|
||||
cleanup is complete, the next grace period can begin.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But when precisely does the grace period end? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| There is no useful single point at which the grace period can be said |
|
||||
| to end. The earliest reasonable candidate is as soon as the last CPU |
|
||||
| has reported its quiescent state, but it may be some milliseconds |
|
||||
| before RCU becomes aware of this. The latest reasonable candidate is |
|
||||
| once the ``rcu_state`` structure's ``->gp_seq`` field has been |
|
||||
| updated, but it is quite possible that some CPUs have already |
|
||||
| completed phase two of their updates by that time. In short, if you |
|
||||
| are going to work with RCU, you need to learn to embrace uncertainty. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
Callback Invocation
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has
|
||||
been updated, that CPU can begin invoking its RCU callbacks that were
|
||||
waiting for this grace period to end. These callbacks are identified by
|
||||
``rcu_advance_cbs()``, which is usually invoked by
|
||||
``__note_gp_changes()``. As shown in the diagram below, this invocation
|
||||
can be triggered by the scheduling-clock interrupt
|
||||
(``rcu_sched_clock_irq()`` on the left) or by idle entry
|
||||
(``rcu_cleanup_after_idle()`` on the right, but only for kernels build
|
||||
with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is
|
||||
raised, which results in ``rcu_do_batch()`` invoking the callbacks,
|
||||
which in turn allows those callbacks to carry out (either directly or
|
||||
indirectly via wakeup) the needed phase-two processing for each update.
|
||||
|
||||
.. kernel-figure:: TreeRCU-callback-invocation.svg
|
||||
|
||||
Please note that callback invocation can also be prompted by any number
|
||||
of corner-case code paths, for example, when a CPU notes that it has
|
||||
excessive numbers of callbacks queued. In all cases, the CPU acquires
|
||||
its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks,
|
||||
which preserves the required ordering against the newly completed grace
|
||||
period.
|
||||
|
||||
However, if the callback function communicates to other CPUs, for
|
||||
example, doing a wakeup, then it is that function's responsibility to
|
||||
maintain ordering. For example, if the callback function wakes up a task
|
||||
that runs on some other CPU, proper ordering must in place in both the
|
||||
callback function and the task being awakened. To see why this is
|
||||
important, consider the top half of the `grace-period
|
||||
cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be
|
||||
running on a CPU corresponding to the leftmost leaf ``rcu_node``
|
||||
structure, and awaken a task that is to run on a CPU corresponding to
|
||||
the rightmost leaf ``rcu_node`` structure, and the grace-period kernel
|
||||
thread might not yet have reached the rightmost leaf. In this case, the
|
||||
grace period's memory ordering might not yet have reached that CPU, so
|
||||
again the callback function and the awakened task must supply proper
|
||||
ordering.
|
||||
|
||||
Putting It All Together
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A stitched-together diagram is here:
|
||||
|
||||
.. kernel-figure:: TreeRCU-gp.svg
|
||||
|
||||
Legal Statement
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
This work represents the view of the author and does not necessarily
|
||||
represent the view of IBM.
|
||||
|
||||
Linux is a registered trademark of Linus Torvalds.
|
||||
|
||||
Other company, product, and service names may be trademarks or service
|
||||
marks of others.
|
@ -3880,7 +3880,7 @@
|
||||
font-style="normal"
|
||||
y="-4418.6582"
|
||||
x="3745.7725"
|
||||
xml:space="preserve">rcu_node_context_switch()</text>
|
||||
xml:space="preserve">rcu_note_context_switch()</text>
|
||||
</g>
|
||||
<g
|
||||
transform="translate(1881.1886,54048.57)"
|
||||
|
Before Width: | Height: | Size: 209 KiB After Width: | Height: | Size: 209 KiB |
@ -753,7 +753,7 @@
|
||||
font-style="normal"
|
||||
y="-4418.6582"
|
||||
x="3745.7725"
|
||||
xml:space="preserve">rcu_node_context_switch()</text>
|
||||
xml:space="preserve">rcu_note_context_switch()</text>
|
||||
</g>
|
||||
<g
|
||||
transform="translate(3131.2648,-585.6713)"
|
||||
|
Before Width: | Height: | Size: 43 KiB After Width: | Height: | Size: 43 KiB |
File diff suppressed because it is too large
Load Diff
2704
Documentation/RCU/Design/Requirements/Requirements.rst
Normal file
2704
Documentation/RCU/Design/Requirements/Requirements.rst
Normal file
File diff suppressed because it is too large
Load Diff
@ -5,12 +5,17 @@ RCU concepts
|
||||
============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:maxdepth: 3
|
||||
|
||||
rcu
|
||||
listRCU
|
||||
UP
|
||||
|
||||
Design/Memory-Ordering/Tree-RCU-Memory-Ordering
|
||||
Design/Expedited-Grace-Periods/Expedited-Grace-Periods
|
||||
Design/Requirements/Requirements
|
||||
Design/Data-Structures/Data-Structures
|
||||
|
||||
.. only:: subproject and html
|
||||
|
||||
Indices
|
||||
|
@ -96,7 +96,17 @@ other flavors of rcu_dereference(). On the other hand, it is illegal
|
||||
to use rcu_dereference_protected() if either the RCU-protected pointer
|
||||
or the RCU-protected data that it points to can change concurrently.
|
||||
|
||||
There are currently only "universal" versions of the rcu_assign_pointer()
|
||||
and RCU list-/tree-traversal primitives, which do not (yet) check for
|
||||
being in an RCU read-side critical section. In the future, separate
|
||||
versions of these primitives might be created.
|
||||
Like rcu_dereference(), when lockdep is enabled, RCU list and hlist
|
||||
traversal primitives check for being called from within an RCU read-side
|
||||
critical section. However, a lockdep expression can be passed to them
|
||||
as a additional optional argument. With this lockdep expression, these
|
||||
traversal primitives will complain only if the lockdep expression is
|
||||
false and they are called from outside any RCU read-side critical section.
|
||||
|
||||
For example, the workqueue for_each_pwq() macro is intended to be used
|
||||
either within an RCU read-side critical section or with wq->mutex held.
|
||||
It is thus implemented as follows:
|
||||
|
||||
#define for_each_pwq(pwq, wq)
|
||||
list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node,
|
||||
lock_is_held(&(wq->mutex).dep_map))
|
||||
|
@ -290,7 +290,7 @@ rcu_dereference()
|
||||
at any time, including immediately after the rcu_dereference().
|
||||
And, again like rcu_assign_pointer(), rcu_dereference() is
|
||||
typically used indirectly, via the _rcu list-manipulation
|
||||
primitives, such as list_for_each_entry_rcu().
|
||||
primitives, such as list_for_each_entry_rcu() [2].
|
||||
|
||||
[1] The variant rcu_dereference_protected() can be used outside
|
||||
of an RCU read-side critical section as long as the usage is
|
||||
@ -302,9 +302,17 @@ rcu_dereference()
|
||||
must prohibit. The rcu_dereference_protected() variant takes
|
||||
a lockdep expression to indicate which locks must be acquired
|
||||
by the caller. If the indicated protection is not provided,
|
||||
a lockdep splat is emitted. See RCU/Design/Requirements/Requirements.html
|
||||
a lockdep splat is emitted. See Documentation/RCU/Design/Requirements/Requirements.rst
|
||||
and the API's code comments for more details and example usage.
|
||||
|
||||
[2] If the list_for_each_entry_rcu() instance might be used by
|
||||
update-side code as well as by RCU readers, then an additional
|
||||
lockdep expression can be added to its list of arguments.
|
||||
For example, given an additional "lock_is_held(&mylock)" argument,
|
||||
the RCU lockdep code would complain only if this instance was
|
||||
invoked outside of an RCU read-side critical section and without
|
||||
the protection of mylock.
|
||||
|
||||
The following diagram shows how each API communicates among the
|
||||
reader, updater, and reclaimer.
|
||||
|
||||
@ -630,7 +638,7 @@ been able to write-acquire the lock otherwise. The smp_mb__after_spinlock()
|
||||
promotes synchronize_rcu() to a full memory barrier in compliance with
|
||||
the "Memory-Barrier Guarantees" listed in:
|
||||
|
||||
Documentation/RCU/Design/Requirements/Requirements.html.
|
||||
Documentation/RCU/Design/Requirements/Requirements.rst
|
||||
|
||||
It is possible to nest rcu_read_lock(), since reader-writer locks may
|
||||
be recursively acquired. Note also that rcu_read_lock() is immune
|
||||
|
@ -508,8 +508,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
|
||||
*filter = tmp;
|
||||
|
||||
mutex_lock(&kvm->lock);
|
||||
rcu_swap_protected(kvm->arch.pmu_event_filter, filter,
|
||||
mutex_is_locked(&kvm->lock));
|
||||
filter = rcu_replace_pointer(kvm->arch.pmu_event_filter, filter,
|
||||
mutex_is_locked(&kvm->lock));
|
||||
mutex_unlock(&kvm->lock);
|
||||
|
||||
synchronize_srcu_expedited(&kvm->srcu);
|
||||
|
@ -1634,7 +1634,7 @@ replace:
|
||||
i915_gem_context_set_user_engines(ctx);
|
||||
else
|
||||
i915_gem_context_clear_user_engines(ctx);
|
||||
rcu_swap_protected(ctx->engines, set.engines, 1);
|
||||
set.engines = rcu_replace_pointer(ctx->engines, set.engines, 1);
|
||||
mutex_unlock(&ctx->engines_mutex);
|
||||
|
||||
call_rcu(&set.engines->rcu, free_engines_rcu);
|
||||
|
@ -434,8 +434,8 @@ static void scsi_update_vpd_page(struct scsi_device *sdev, u8 page,
|
||||
return;
|
||||
|
||||
mutex_lock(&sdev->inquiry_mutex);
|
||||
rcu_swap_protected(*sdev_vpd_buf, vpd_buf,
|
||||
lockdep_is_held(&sdev->inquiry_mutex));
|
||||
vpd_buf = rcu_replace_pointer(*sdev_vpd_buf, vpd_buf,
|
||||
lockdep_is_held(&sdev->inquiry_mutex));
|
||||
mutex_unlock(&sdev->inquiry_mutex);
|
||||
|
||||
if (vpd_buf)
|
||||
|
@ -466,10 +466,10 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
|
||||
sdev->request_queue = NULL;
|
||||
|
||||
mutex_lock(&sdev->inquiry_mutex);
|
||||
rcu_swap_protected(sdev->vpd_pg80, vpd_pg80,
|
||||
lockdep_is_held(&sdev->inquiry_mutex));
|
||||
rcu_swap_protected(sdev->vpd_pg83, vpd_pg83,
|
||||
lockdep_is_held(&sdev->inquiry_mutex));
|
||||
vpd_pg80 = rcu_replace_pointer(sdev->vpd_pg80, vpd_pg80,
|
||||
lockdep_is_held(&sdev->inquiry_mutex));
|
||||
vpd_pg83 = rcu_replace_pointer(sdev->vpd_pg83, vpd_pg83,
|
||||
lockdep_is_held(&sdev->inquiry_mutex));
|
||||
mutex_unlock(&sdev->inquiry_mutex);
|
||||
|
||||
if (vpd_pg83)
|
||||
|
@ -279,8 +279,8 @@ struct afs_vlserver_list *afs_extract_vlserver_list(struct afs_cell *cell,
|
||||
struct afs_addr_list *old = addrs;
|
||||
|
||||
write_lock(&server->lock);
|
||||
rcu_swap_protected(server->addresses, old,
|
||||
lockdep_is_held(&server->lock));
|
||||
old = rcu_replace_pointer(server->addresses, old,
|
||||
lockdep_is_held(&server->lock));
|
||||
write_unlock(&server->lock);
|
||||
afs_put_addrlist(old);
|
||||
}
|
||||
|
@ -24,34 +24,6 @@ static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
|
||||
((unsigned long)rcu_dereference_check(h->first, hlist_bl_is_locked(h)) & ~LIST_BL_LOCKMASK);
|
||||
}
|
||||
|
||||
/**
|
||||
* hlist_bl_del_init_rcu - deletes entry from hash list with re-initialization
|
||||
* @n: the element to delete from the hash list.
|
||||
*
|
||||
* Note: hlist_bl_unhashed() on the node returns true after this. It is
|
||||
* useful for RCU based read lockfree traversal if the writer side
|
||||
* must know if the list entry is still hashed or already unhashed.
|
||||
*
|
||||
* In particular, it means that we can not poison the forward pointers
|
||||
* that may still be used for walking the hash list and we can only
|
||||
* zero the pprev pointer so list_unhashed() will return true after
|
||||
* this.
|
||||
*
|
||||
* The caller must take whatever precautions are necessary (such as
|
||||
* holding appropriate locks) to avoid racing with another
|
||||
* list-mutation primitive, such as hlist_bl_add_head_rcu() or
|
||||
* hlist_bl_del_rcu(), running on this same list. However, it is
|
||||
* perfectly legal to run concurrently with the _rcu list-traversal
|
||||
* primitives, such as hlist_bl_for_each_entry_rcu().
|
||||
*/
|
||||
static inline void hlist_bl_del_init_rcu(struct hlist_bl_node *n)
|
||||
{
|
||||
if (!hlist_bl_unhashed(n)) {
|
||||
__hlist_bl_del(n);
|
||||
n->pprev = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* hlist_bl_del_rcu - deletes entry from hash list without re-initialization
|
||||
* @n: the element to delete from the hash list.
|
||||
|
@ -382,6 +382,24 @@ do { \
|
||||
smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \
|
||||
} while (0)
|
||||
|
||||
/**
|
||||
* rcu_replace_pointer() - replace an RCU pointer, returning its old value
|
||||
* @rcu_ptr: RCU pointer, whose old value is returned
|
||||
* @ptr: regular pointer
|
||||
* @c: the lockdep conditions under which the dereference will take place
|
||||
*
|
||||
* Perform a replacement, where @rcu_ptr is an RCU-annotated
|
||||
* pointer and @c is the lockdep argument that is passed to the
|
||||
* rcu_dereference_protected() call used to read that pointer. The old
|
||||
* value of @rcu_ptr is returned, and @rcu_ptr is set to @ptr.
|
||||
*/
|
||||
#define rcu_replace_pointer(rcu_ptr, ptr, c) \
|
||||
({ \
|
||||
typeof(ptr) __tmp = rcu_dereference_protected((rcu_ptr), (c)); \
|
||||
rcu_assign_pointer((rcu_ptr), (ptr)); \
|
||||
__tmp; \
|
||||
})
|
||||
|
||||
/**
|
||||
* rcu_swap_protected() - swap an RCU and a regular pointer
|
||||
* @rcu_ptr: RCU pointer
|
||||
|
@ -84,6 +84,7 @@ static inline void rcu_scheduler_starting(void) { }
|
||||
#endif /* #else #ifndef CONFIG_SRCU */
|
||||
static inline void rcu_end_inkernel_boot(void) { }
|
||||
static inline bool rcu_is_watching(void) { return true; }
|
||||
static inline void rcu_momentary_dyntick_idle(void) { }
|
||||
|
||||
/* Avoid RCU read-side critical sections leaking across. */
|
||||
static inline void rcu_all_qs(void) { barrier(); }
|
||||
|
@ -37,6 +37,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
|
||||
|
||||
void rcu_barrier(void);
|
||||
bool rcu_eqs_special_set(int cpu);
|
||||
void rcu_momentary_dyntick_idle(void);
|
||||
unsigned long get_state_synchronize_rcu(void);
|
||||
void cond_synchronize_rcu(unsigned long oldstate);
|
||||
|
||||
|
@ -108,7 +108,8 @@ enum tick_dep_bits {
|
||||
TICK_DEP_BIT_POSIX_TIMER = 0,
|
||||
TICK_DEP_BIT_PERF_EVENTS = 1,
|
||||
TICK_DEP_BIT_SCHED = 2,
|
||||
TICK_DEP_BIT_CLOCK_UNSTABLE = 3
|
||||
TICK_DEP_BIT_CLOCK_UNSTABLE = 3,
|
||||
TICK_DEP_BIT_RCU = 4
|
||||
};
|
||||
|
||||
#define TICK_DEP_MASK_NONE 0
|
||||
@ -116,6 +117,7 @@ enum tick_dep_bits {
|
||||
#define TICK_DEP_MASK_PERF_EVENTS (1 << TICK_DEP_BIT_PERF_EVENTS)
|
||||
#define TICK_DEP_MASK_SCHED (1 << TICK_DEP_BIT_SCHED)
|
||||
#define TICK_DEP_MASK_CLOCK_UNSTABLE (1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
|
||||
#define TICK_DEP_MASK_RCU (1 << TICK_DEP_BIT_RCU)
|
||||
|
||||
#ifdef CONFIG_NO_HZ_COMMON
|
||||
extern bool tick_nohz_enabled;
|
||||
@ -268,6 +270,9 @@ static inline bool tick_nohz_full_enabled(void) { return false; }
|
||||
static inline bool tick_nohz_full_cpu(int cpu) { return false; }
|
||||
static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { }
|
||||
|
||||
static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
|
||||
static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { }
|
||||
|
||||
static inline void tick_dep_set(enum tick_dep_bits bit) { }
|
||||
static inline void tick_dep_clear(enum tick_dep_bits bit) { }
|
||||
static inline void tick_dep_set_cpu(int cpu, enum tick_dep_bits bit) { }
|
||||
|
@ -93,16 +93,16 @@ TRACE_EVENT_RCU(rcu_grace_period,
|
||||
* the data from the rcu_node structure, other than rcuname, which comes
|
||||
* from the rcu_state structure, and event, which is one of the following:
|
||||
*
|
||||
* "Startleaf": Request a grace period based on leaf-node data.
|
||||
* "Cleanup": Clean up rcu_node structure after previous GP.
|
||||
* "CleanupMore": Clean up, and another GP is needed.
|
||||
* "EndWait": Complete wait.
|
||||
* "NoGPkthread": The RCU grace-period kthread has not yet started.
|
||||
* "Prestarted": Someone beat us to the request
|
||||
* "Startedleaf": Leaf node marked for future GP.
|
||||
* "Startedleafroot": All nodes from leaf to root marked for future GP.
|
||||
* "Startedroot": Requested a nocb grace period based on root-node data.
|
||||
* "NoGPkthread": The RCU grace-period kthread has not yet started.
|
||||
* "Startleaf": Request a grace period based on leaf-node data.
|
||||
* "StartWait": Start waiting for the requested grace period.
|
||||
* "EndWait": Complete wait.
|
||||
* "Cleanup": Clean up rcu_node structure after previous GP.
|
||||
* "CleanupMore": Clean up, and another GP is needed.
|
||||
*/
|
||||
TRACE_EVENT_RCU(rcu_future_grace_period,
|
||||
|
||||
@ -258,20 +258,27 @@ TRACE_EVENT_RCU(rcu_exp_funnel_lock,
|
||||
* the number of the offloaded CPU are extracted. The third and final
|
||||
* argument is a string as follows:
|
||||
*
|
||||
* "WakeEmpty": Wake rcuo kthread, first CB to empty list.
|
||||
* "WakeEmptyIsDeferred": Wake rcuo kthread later, first CB to empty list.
|
||||
* "WakeOvf": Wake rcuo kthread, CB list is huge.
|
||||
* "WakeOvfIsDeferred": Wake rcuo kthread later, CB list is huge.
|
||||
* "WakeNot": Don't wake rcuo kthread.
|
||||
* "WakeNotPoll": Don't wake rcuo kthread because it is polling.
|
||||
* "DeferredWake": Carried out the "IsDeferred" wakeup.
|
||||
* "Poll": Start of new polling cycle for rcu_nocb_poll.
|
||||
* "Sleep": Sleep waiting for GP for !rcu_nocb_poll.
|
||||
* "CBSleep": Sleep waiting for CBs for !rcu_nocb_poll.
|
||||
* "WokeEmpty": rcuo kthread woke to find empty list.
|
||||
* "WokeNonEmpty": rcuo kthread woke to find non-empty list.
|
||||
* "WaitQueue": Enqueue partially done, timed wait for it to complete.
|
||||
* "WokeQueue": Partial enqueue now complete.
|
||||
* "AlreadyAwake": The to-be-awakened rcuo kthread is already awake.
|
||||
* "Bypass": rcuo GP kthread sees non-empty ->nocb_bypass.
|
||||
* "CBSleep": rcuo CB kthread sleeping waiting for CBs.
|
||||
* "Check": rcuo GP kthread checking specified CPU for work.
|
||||
* "DeferredWake": Timer expired or polled check, time to wake.
|
||||
* "DoWake": The to-be-awakened rcuo kthread needs to be awakened.
|
||||
* "EndSleep": Done waiting for GP for !rcu_nocb_poll.
|
||||
* "FirstBQ": New CB to empty ->nocb_bypass (->cblist maybe non-empty).
|
||||
* "FirstBQnoWake": FirstBQ plus rcuo kthread need not be awakened.
|
||||
* "FirstBQwake": FirstBQ plus rcuo kthread must be awakened.
|
||||
* "FirstQ": New CB to empty ->cblist (->nocb_bypass maybe non-empty).
|
||||
* "NeedWaitGP": rcuo GP kthread must wait on a grace period.
|
||||
* "Poll": Start of new polling cycle for rcu_nocb_poll.
|
||||
* "Sleep": Sleep waiting for GP for !rcu_nocb_poll.
|
||||
* "Timer": Deferred-wake timer expired.
|
||||
* "WakeEmptyIsDeferred": Wake rcuo kthread later, first CB to empty list.
|
||||
* "WakeEmpty": Wake rcuo kthread, first CB to empty list.
|
||||
* "WakeNot": Don't wake rcuo kthread.
|
||||
* "WakeNotPoll": Don't wake rcuo kthread because it is polling.
|
||||
* "WakeOvfIsDeferred": Wake rcuo kthread later, CB list is huge.
|
||||
* "WokeEmpty": rcuo CB kthread woke to find empty list.
|
||||
*/
|
||||
TRACE_EVENT_RCU(rcu_nocb_wake,
|
||||
|
||||
@ -713,8 +720,6 @@ TRACE_EVENT_RCU(rcu_torture_read,
|
||||
* "Begin": rcu_barrier() started.
|
||||
* "EarlyExit": rcu_barrier() piggybacked, thus early exit.
|
||||
* "Inc1": rcu_barrier() piggyback check counter incremented.
|
||||
* "OfflineNoCB": rcu_barrier() found callback on never-online CPU
|
||||
* "OnlineNoCB": rcu_barrier() found online no-CBs CPU.
|
||||
* "OnlineQ": rcu_barrier() found online CPU with callbacks.
|
||||
* "OnlineNQ": rcu_barrier() found online CPU, no callbacks.
|
||||
* "IRQ": An rcu_barrier_callback() callback posted on remote CPU.
|
||||
|
@ -367,7 +367,8 @@ TRACE_EVENT(itimer_expire,
|
||||
tick_dep_name(POSIX_TIMER) \
|
||||
tick_dep_name(PERF_EVENTS) \
|
||||
tick_dep_name(SCHED) \
|
||||
tick_dep_name_end(CLOCK_UNSTABLE)
|
||||
tick_dep_name(CLOCK_UNSTABLE) \
|
||||
tick_dep_name_end(RCU)
|
||||
|
||||
#undef tick_dep_name
|
||||
#undef tick_dep_mask_name
|
||||
|
@ -180,8 +180,8 @@ static void activate_effective_progs(struct cgroup *cgrp,
|
||||
enum bpf_attach_type type,
|
||||
struct bpf_prog_array *old_array)
|
||||
{
|
||||
rcu_swap_protected(cgrp->bpf.effective[type], old_array,
|
||||
lockdep_is_held(&cgroup_mutex));
|
||||
old_array = rcu_replace_pointer(cgrp->bpf.effective[type], old_array,
|
||||
lockdep_is_held(&cgroup_mutex));
|
||||
/* free prog array after grace period, since __cgroup_bpf_run_*()
|
||||
* might be still walking the array
|
||||
*/
|
||||
|
@ -16,7 +16,6 @@
|
||||
#include <linux/kthread.h>
|
||||
#include <linux/sched/rt.h>
|
||||
#include <linux/spinlock.h>
|
||||
#include <linux/rwlock.h>
|
||||
#include <linux/mutex.h>
|
||||
#include <linux/rwsem.h>
|
||||
#include <linux/smp.h>
|
||||
@ -889,16 +888,16 @@ static int __init lock_torture_init(void)
|
||||
cxt.nrealwriters_stress = 2 * num_online_cpus();
|
||||
|
||||
#ifdef CONFIG_DEBUG_MUTEXES
|
||||
if (strncmp(torture_type, "mutex", 5) == 0)
|
||||
if (str_has_prefix(torture_type, "mutex"))
|
||||
cxt.debug_lock = true;
|
||||
#endif
|
||||
#ifdef CONFIG_DEBUG_RT_MUTEXES
|
||||
if (strncmp(torture_type, "rtmutex", 7) == 0)
|
||||
if (str_has_prefix(torture_type, "rtmutex"))
|
||||
cxt.debug_lock = true;
|
||||
#endif
|
||||
#ifdef CONFIG_DEBUG_SPINLOCK
|
||||
if ((strncmp(torture_type, "spin", 4) == 0) ||
|
||||
(strncmp(torture_type, "rw_lock", 7) == 0))
|
||||
if ((str_has_prefix(torture_type, "spin")) ||
|
||||
(str_has_prefix(torture_type, "rw_lock")))
|
||||
cxt.debug_lock = true;
|
||||
#endif
|
||||
|
||||
|
@ -299,6 +299,8 @@ static inline void rcu_init_levelspread(int *levelspread, const int *levelcnt)
|
||||
{
|
||||
int i;
|
||||
|
||||
for (i = 0; i < RCU_NUM_LVLS; i++)
|
||||
levelspread[i] = INT_MIN;
|
||||
if (rcu_fanout_exact) {
|
||||
levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf;
|
||||
for (i = rcu_num_lvls - 2; i >= 0; i--)
|
||||
@ -455,7 +457,6 @@ enum rcutorture_type {
|
||||
#if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU)
|
||||
void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
|
||||
unsigned long *gp_seq);
|
||||
void rcutorture_record_progress(unsigned long vernum);
|
||||
void do_trace_rcu_torture_read(const char *rcutorturename,
|
||||
struct rcu_head *rhp,
|
||||
unsigned long secs,
|
||||
@ -468,7 +469,6 @@ static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
|
||||
*flags = 0;
|
||||
*gp_seq = 0;
|
||||
}
|
||||
static inline void rcutorture_record_progress(unsigned long vernum) { }
|
||||
#ifdef CONFIG_RCU_TRACE
|
||||
void do_trace_rcu_torture_read(const char *rcutorturename,
|
||||
struct rcu_head *rhp,
|
||||
|
@ -88,7 +88,7 @@ struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp)
|
||||
}
|
||||
|
||||
/* Set the length of an rcu_segcblist structure. */
|
||||
void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v)
|
||||
static void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v)
|
||||
{
|
||||
#ifdef CONFIG_RCU_NOCB_CPU
|
||||
atomic_long_set(&rsclp->len, v);
|
||||
@ -104,7 +104,7 @@ void rcu_segcblist_set_len(struct rcu_segcblist *rsclp, long v)
|
||||
* This increase is fully ordered with respect to the callers accesses
|
||||
* both before and after.
|
||||
*/
|
||||
void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
|
||||
static void rcu_segcblist_add_len(struct rcu_segcblist *rsclp, long v)
|
||||
{
|
||||
#ifdef CONFIG_RCU_NOCB_CPU
|
||||
smp_mb__before_atomic(); /* Up to the caller! */
|
||||
@ -134,7 +134,7 @@ void rcu_segcblist_inc_len(struct rcu_segcblist *rsclp)
|
||||
* with the actual number of callbacks on the structure. This exchange is
|
||||
* fully ordered with respect to the callers accesses both before and after.
|
||||
*/
|
||||
long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
|
||||
static long rcu_segcblist_xchg_len(struct rcu_segcblist *rsclp, long v)
|
||||
{
|
||||
#ifdef CONFIG_RCU_NOCB_CPU
|
||||
return atomic_long_xchg(&rsclp->len, v);
|
||||
|
@ -109,15 +109,6 @@ static unsigned long b_rcu_perf_writer_started;
|
||||
static unsigned long b_rcu_perf_writer_finished;
|
||||
static DEFINE_PER_CPU(atomic_t, n_async_inflight);
|
||||
|
||||
static int rcu_perf_writer_state;
|
||||
#define RTWS_INIT 0
|
||||
#define RTWS_ASYNC 1
|
||||
#define RTWS_BARRIER 2
|
||||
#define RTWS_EXP_SYNC 3
|
||||
#define RTWS_SYNC 4
|
||||
#define RTWS_IDLE 5
|
||||
#define RTWS_STOPPING 6
|
||||
|
||||
#define MAX_MEAS 10000
|
||||
#define MIN_MEAS 100
|
||||
|
||||
@ -404,25 +395,20 @@ retry:
|
||||
if (!rhp)
|
||||
rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
|
||||
if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
|
||||
rcu_perf_writer_state = RTWS_ASYNC;
|
||||
atomic_inc(this_cpu_ptr(&n_async_inflight));
|
||||
cur_ops->async(rhp, rcu_perf_async_cb);
|
||||
rhp = NULL;
|
||||
} else if (!kthread_should_stop()) {
|
||||
rcu_perf_writer_state = RTWS_BARRIER;
|
||||
cur_ops->gp_barrier();
|
||||
goto retry;
|
||||
} else {
|
||||
kfree(rhp); /* Because we are stopping. */
|
||||
}
|
||||
} else if (gp_exp) {
|
||||
rcu_perf_writer_state = RTWS_EXP_SYNC;
|
||||
cur_ops->exp_sync();
|
||||
} else {
|
||||
rcu_perf_writer_state = RTWS_SYNC;
|
||||
cur_ops->sync();
|
||||
}
|
||||
rcu_perf_writer_state = RTWS_IDLE;
|
||||
t = ktime_get_mono_fast_ns();
|
||||
*wdp = t - *wdp;
|
||||
i_max = i;
|
||||
@ -463,10 +449,8 @@ retry:
|
||||
rcu_perf_wait_shutdown();
|
||||
} while (!torture_must_stop());
|
||||
if (gp_async) {
|
||||
rcu_perf_writer_state = RTWS_BARRIER;
|
||||
cur_ops->gp_barrier();
|
||||
}
|
||||
rcu_perf_writer_state = RTWS_STOPPING;
|
||||
writer_n_durations[me] = i_max;
|
||||
torture_kthread_stopping("rcu_perf_writer");
|
||||
return 0;
|
||||
|
@ -44,6 +44,7 @@
|
||||
#include <linux/sched/debug.h>
|
||||
#include <linux/sched/sysctl.h>
|
||||
#include <linux/oom.h>
|
||||
#include <linux/tick.h>
|
||||
|
||||
#include "rcu.h"
|
||||
|
||||
@ -1363,15 +1364,15 @@ rcu_torture_reader(void *arg)
|
||||
set_user_nice(current, MAX_NICE);
|
||||
if (irqreader && cur_ops->irq_capable)
|
||||
timer_setup_on_stack(&t, rcu_torture_timer, 0);
|
||||
|
||||
tick_dep_set_task(current, TICK_DEP_BIT_RCU);
|
||||
do {
|
||||
if (irqreader && cur_ops->irq_capable) {
|
||||
if (!timer_pending(&t))
|
||||
mod_timer(&t, jiffies + 1);
|
||||
}
|
||||
if (!rcu_torture_one_read(&rand))
|
||||
if (!rcu_torture_one_read(&rand) && !torture_must_stop())
|
||||
schedule_timeout_interruptible(HZ);
|
||||
if (time_after(jiffies, lastsleep)) {
|
||||
if (time_after(jiffies, lastsleep) && !torture_must_stop()) {
|
||||
schedule_timeout_interruptible(1);
|
||||
lastsleep = jiffies + 10;
|
||||
}
|
||||
@ -1383,6 +1384,7 @@ rcu_torture_reader(void *arg)
|
||||
del_timer_sync(&t);
|
||||
destroy_timer_on_stack(&t);
|
||||
}
|
||||
tick_dep_clear_task(current, TICK_DEP_BIT_RCU);
|
||||
torture_kthread_stopping("rcu_torture_reader");
|
||||
return 0;
|
||||
}
|
||||
@ -1442,15 +1444,18 @@ rcu_torture_stats_print(void)
|
||||
n_rcu_torture_barrier_error);
|
||||
|
||||
pr_alert("%s%s ", torture_type, TORTURE_FLAG);
|
||||
if (atomic_read(&n_rcu_torture_mberror) != 0 ||
|
||||
n_rcu_torture_barrier_error != 0 ||
|
||||
n_rcu_torture_boost_ktrerror != 0 ||
|
||||
n_rcu_torture_boost_rterror != 0 ||
|
||||
n_rcu_torture_boost_failure != 0 ||
|
||||
if (atomic_read(&n_rcu_torture_mberror) ||
|
||||
n_rcu_torture_barrier_error || n_rcu_torture_boost_ktrerror ||
|
||||
n_rcu_torture_boost_rterror || n_rcu_torture_boost_failure ||
|
||||
i > 1) {
|
||||
pr_cont("%s", "!!! ");
|
||||
atomic_inc(&n_rcu_torture_error);
|
||||
WARN_ON_ONCE(1);
|
||||
WARN_ON_ONCE(atomic_read(&n_rcu_torture_mberror));
|
||||
WARN_ON_ONCE(n_rcu_torture_barrier_error); // rcu_barrier()
|
||||
WARN_ON_ONCE(n_rcu_torture_boost_ktrerror); // no boost kthread
|
||||
WARN_ON_ONCE(n_rcu_torture_boost_rterror); // can't set RT prio
|
||||
WARN_ON_ONCE(n_rcu_torture_boost_failure); // RCU boost failed
|
||||
WARN_ON_ONCE(i > 1); // Too-short grace period
|
||||
}
|
||||
pr_cont("Reader Pipe: ");
|
||||
for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++)
|
||||
@ -1729,10 +1734,10 @@ static void rcu_torture_fwd_prog_cond_resched(unsigned long iter)
|
||||
// Real call_rcu() floods hit userspace, so emulate that.
|
||||
if (need_resched() || (iter & 0xfff))
|
||||
schedule();
|
||||
} else {
|
||||
// No userspace emulation: CB invocation throttles call_rcu()
|
||||
cond_resched();
|
||||
return;
|
||||
}
|
||||
// No userspace emulation: CB invocation throttles call_rcu()
|
||||
cond_resched();
|
||||
}
|
||||
|
||||
/*
|
||||
@ -1759,6 +1764,11 @@ static unsigned long rcu_torture_fwd_prog_cbfree(void)
|
||||
kfree(rfcp);
|
||||
freed++;
|
||||
rcu_torture_fwd_prog_cond_resched(freed);
|
||||
if (tick_nohz_full_enabled()) {
|
||||
local_irq_save(flags);
|
||||
rcu_momentary_dyntick_idle();
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
}
|
||||
return freed;
|
||||
}
|
||||
@ -1803,7 +1813,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries)
|
||||
udelay(10);
|
||||
cur_ops->readunlock(idx);
|
||||
if (!fwd_progress_need_resched || need_resched())
|
||||
rcu_torture_fwd_prog_cond_resched(1);
|
||||
cond_resched();
|
||||
}
|
||||
(*tested_tries)++;
|
||||
if (!time_before(jiffies, stopat) &&
|
||||
@ -1833,6 +1843,7 @@ static void rcu_torture_fwd_prog_nr(int *tested, int *tested_tries)
|
||||
static void rcu_torture_fwd_prog_cr(void)
|
||||
{
|
||||
unsigned long cver;
|
||||
unsigned long flags;
|
||||
unsigned long gps;
|
||||
int i;
|
||||
long n_launders;
|
||||
@ -1865,6 +1876,7 @@ static void rcu_torture_fwd_prog_cr(void)
|
||||
cver = READ_ONCE(rcu_torture_current_version);
|
||||
gps = cur_ops->get_gp_seq();
|
||||
rcu_launder_gp_seq_start = gps;
|
||||
tick_dep_set_task(current, TICK_DEP_BIT_RCU);
|
||||
while (time_before(jiffies, stopat) &&
|
||||
!shutdown_time_arrived() &&
|
||||
!READ_ONCE(rcu_fwd_emergency_stop) && !torture_must_stop()) {
|
||||
@ -1891,6 +1903,11 @@ static void rcu_torture_fwd_prog_cr(void)
|
||||
}
|
||||
cur_ops->call(&rfcp->rh, rcu_torture_fwd_cb_cr);
|
||||
rcu_torture_fwd_prog_cond_resched(n_launders + n_max_cbs);
|
||||
if (tick_nohz_full_enabled()) {
|
||||
local_irq_save(flags);
|
||||
rcu_momentary_dyntick_idle();
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
}
|
||||
stoppedat = jiffies;
|
||||
n_launders_cb_snap = READ_ONCE(n_launders_cb);
|
||||
@ -1911,6 +1928,7 @@ static void rcu_torture_fwd_prog_cr(void)
|
||||
rcu_torture_fwd_cb_hist();
|
||||
}
|
||||
schedule_timeout_uninterruptible(HZ); /* Let CBs drain. */
|
||||
tick_dep_clear_task(current, TICK_DEP_BIT_RCU);
|
||||
WRITE_ONCE(rcu_fwd_cb_nodelay, false);
|
||||
}
|
||||
|
||||
|
@ -364,7 +364,7 @@ bool rcu_eqs_special_set(int cpu)
|
||||
*
|
||||
* The caller must have disabled interrupts and must not be idle.
|
||||
*/
|
||||
static void __maybe_unused rcu_momentary_dyntick_idle(void)
|
||||
void rcu_momentary_dyntick_idle(void)
|
||||
{
|
||||
int special;
|
||||
|
||||
@ -375,6 +375,7 @@ static void __maybe_unused rcu_momentary_dyntick_idle(void)
|
||||
WARN_ON_ONCE(!(special & RCU_DYNTICK_CTRL_CTR));
|
||||
rcu_preempt_deferred_qs(current);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(rcu_momentary_dyntick_idle);
|
||||
|
||||
/**
|
||||
* rcu_is_cpu_rrupt_from_idle - see if interrupted from idle
|
||||
@ -496,7 +497,7 @@ module_param_cb(jiffies_till_next_fqs, &next_fqs_jiffies_ops, &jiffies_till_next
|
||||
module_param(rcu_kick_kthreads, bool, 0644);
|
||||
|
||||
static void force_qs_rnp(int (*f)(struct rcu_data *rdp));
|
||||
static int rcu_pending(void);
|
||||
static int rcu_pending(int user);
|
||||
|
||||
/*
|
||||
* Return the number of RCU GPs completed thus far for debug & stats.
|
||||
@ -824,6 +825,11 @@ static __always_inline void rcu_nmi_enter_common(bool irq)
|
||||
rcu_cleanup_after_idle();
|
||||
|
||||
incby = 1;
|
||||
} else if (tick_nohz_full_cpu(rdp->cpu) &&
|
||||
rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE &&
|
||||
READ_ONCE(rdp->rcu_urgent_qs) && !rdp->rcu_forced_tick) {
|
||||
rdp->rcu_forced_tick = true;
|
||||
tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
|
||||
}
|
||||
trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
|
||||
rdp->dynticks_nmi_nesting,
|
||||
@ -885,6 +891,21 @@ void rcu_irq_enter_irqson(void)
|
||||
local_irq_restore(flags);
|
||||
}
|
||||
|
||||
/*
|
||||
* If any sort of urgency was applied to the current CPU (for example,
|
||||
* the scheduler-clock interrupt was enabled on a nohz_full CPU) in order
|
||||
* to get to a quiescent state, disable it.
|
||||
*/
|
||||
static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
|
||||
{
|
||||
WRITE_ONCE(rdp->rcu_urgent_qs, false);
|
||||
WRITE_ONCE(rdp->rcu_need_heavy_qs, false);
|
||||
if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick) {
|
||||
tick_dep_clear_cpu(rdp->cpu, TICK_DEP_BIT_RCU);
|
||||
rdp->rcu_forced_tick = false;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* rcu_is_watching - see if RCU thinks that the current CPU is not idle
|
||||
*
|
||||
@ -1073,6 +1094,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
|
||||
if (tick_nohz_full_cpu(rdp->cpu) &&
|
||||
time_after(jiffies,
|
||||
READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
|
||||
WRITE_ONCE(*ruqp, true);
|
||||
resched_cpu(rdp->cpu);
|
||||
WRITE_ONCE(rdp->last_fqs_resched, jiffies);
|
||||
}
|
||||
@ -1968,7 +1990,6 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
|
||||
return;
|
||||
}
|
||||
mask = rdp->grpmask;
|
||||
rdp->core_needs_qs = false;
|
||||
if ((rnp->qsmask & mask) == 0) {
|
||||
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
||||
} else {
|
||||
@ -1979,6 +2000,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
|
||||
if (!offloaded)
|
||||
needwake = rcu_accelerate_cbs(rnp, rdp);
|
||||
|
||||
rcu_disable_urgency_upon_qs(rdp);
|
||||
rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
|
||||
/* ^^^ Released rnp->lock */
|
||||
if (needwake)
|
||||
@ -2101,6 +2123,9 @@ int rcutree_dead_cpu(unsigned int cpu)
|
||||
rcu_boost_kthread_setaffinity(rnp, -1);
|
||||
/* Do any needed no-CB deferred wakeups from this CPU. */
|
||||
do_nocb_deferred_wakeup(per_cpu_ptr(&rcu_data, cpu));
|
||||
|
||||
// Stop-machine done, so allow nohz_full to disable tick.
|
||||
tick_dep_clear(TICK_DEP_BIT_RCU);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -2151,6 +2176,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
|
||||
rcu_nocb_unlock_irqrestore(rdp, flags);
|
||||
|
||||
/* Invoke callbacks. */
|
||||
tick_dep_set_task(current, TICK_DEP_BIT_RCU);
|
||||
rhp = rcu_cblist_dequeue(&rcl);
|
||||
for (; rhp; rhp = rcu_cblist_dequeue(&rcl)) {
|
||||
debug_rcu_head_unqueue(rhp);
|
||||
@ -2217,6 +2243,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
|
||||
/* Re-invoke RCU core processing if there are callbacks remaining. */
|
||||
if (!offloaded && rcu_segcblist_ready_cbs(&rdp->cblist))
|
||||
invoke_rcu_core();
|
||||
tick_dep_clear_task(current, TICK_DEP_BIT_RCU);
|
||||
}
|
||||
|
||||
/*
|
||||
@ -2241,7 +2268,7 @@ void rcu_sched_clock_irq(int user)
|
||||
__this_cpu_write(rcu_data.rcu_urgent_qs, false);
|
||||
}
|
||||
rcu_flavor_sched_clock_irq(user);
|
||||
if (rcu_pending())
|
||||
if (rcu_pending(user))
|
||||
invoke_rcu_core();
|
||||
|
||||
trace_rcu_utilization(TPS("End scheduler-tick"));
|
||||
@ -2259,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
|
||||
int cpu;
|
||||
unsigned long flags;
|
||||
unsigned long mask;
|
||||
struct rcu_data *rdp;
|
||||
struct rcu_node *rnp;
|
||||
|
||||
rcu_for_each_leaf_node(rnp) {
|
||||
@ -2283,8 +2311,11 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
|
||||
for_each_leaf_node_possible_cpu(rnp, cpu) {
|
||||
unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
|
||||
if ((rnp->qsmask & bit) != 0) {
|
||||
if (f(per_cpu_ptr(&rcu_data, cpu)))
|
||||
rdp = per_cpu_ptr(&rcu_data, cpu);
|
||||
if (f(rdp)) {
|
||||
mask |= bit;
|
||||
rcu_disable_urgency_upon_qs(rdp);
|
||||
}
|
||||
}
|
||||
}
|
||||
if (mask != 0) {
|
||||
@ -2312,7 +2343,7 @@ void rcu_force_quiescent_state(void)
|
||||
rnp = __this_cpu_read(rcu_data.mynode);
|
||||
for (; rnp != NULL; rnp = rnp->parent) {
|
||||
ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
|
||||
!raw_spin_trylock(&rnp->fqslock);
|
||||
!raw_spin_trylock(&rnp->fqslock);
|
||||
if (rnp_old != NULL)
|
||||
raw_spin_unlock(&rnp_old->fqslock);
|
||||
if (ret)
|
||||
@ -2786,8 +2817,9 @@ EXPORT_SYMBOL_GPL(cond_synchronize_rcu);
|
||||
* CPU-local state are performed first. However, we must check for CPU
|
||||
* stalls first, else we might not get a chance.
|
||||
*/
|
||||
static int rcu_pending(void)
|
||||
static int rcu_pending(int user)
|
||||
{
|
||||
bool gp_in_progress;
|
||||
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
||||
struct rcu_node *rnp = rdp->mynode;
|
||||
|
||||
@ -2798,12 +2830,13 @@ static int rcu_pending(void)
|
||||
if (rcu_nocb_need_deferred_wakeup(rdp))
|
||||
return 1;
|
||||
|
||||
/* Is this CPU a NO_HZ_FULL CPU that should ignore RCU? */
|
||||
if (rcu_nohz_full_cpu())
|
||||
/* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */
|
||||
if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
|
||||
return 0;
|
||||
|
||||
/* Is the RCU core waiting for a quiescent state from this CPU? */
|
||||
if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm)
|
||||
gp_in_progress = rcu_gp_in_progress();
|
||||
if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm && gp_in_progress)
|
||||
return 1;
|
||||
|
||||
/* Does this CPU have callbacks ready to invoke? */
|
||||
@ -2811,8 +2844,7 @@ static int rcu_pending(void)
|
||||
return 1;
|
||||
|
||||
/* Has RCU gone idle with this CPU needing another grace period? */
|
||||
if (!rcu_gp_in_progress() &&
|
||||
rcu_segcblist_is_enabled(&rdp->cblist) &&
|
||||
if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
|
||||
(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) ||
|
||||
!rcu_segcblist_is_offloaded(&rdp->cblist)) &&
|
||||
!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
|
||||
@ -2845,7 +2877,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
|
||||
{
|
||||
if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
|
||||
rcu_barrier_trace(TPS("LastCB"), -1,
|
||||
rcu_state.barrier_sequence);
|
||||
rcu_state.barrier_sequence);
|
||||
complete(&rcu_state.barrier_completion);
|
||||
} else {
|
||||
rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
|
||||
@ -2869,7 +2901,7 @@ static void rcu_barrier_func(void *unused)
|
||||
} else {
|
||||
debug_rcu_head_unqueue(&rdp->barrier_head);
|
||||
rcu_barrier_trace(TPS("IRQNQ"), -1,
|
||||
rcu_state.barrier_sequence);
|
||||
rcu_state.barrier_sequence);
|
||||
}
|
||||
rcu_nocb_unlock(rdp);
|
||||
}
|
||||
@ -2896,7 +2928,7 @@ void rcu_barrier(void)
|
||||
/* Did someone else do our work for us? */
|
||||
if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
|
||||
rcu_barrier_trace(TPS("EarlyExit"), -1,
|
||||
rcu_state.barrier_sequence);
|
||||
rcu_state.barrier_sequence);
|
||||
smp_mb(); /* caller's subsequent code after above check. */
|
||||
mutex_unlock(&rcu_state.barrier_mutex);
|
||||
return;
|
||||
@ -2928,11 +2960,11 @@ void rcu_barrier(void)
|
||||
continue;
|
||||
if (rcu_segcblist_n_cbs(&rdp->cblist)) {
|
||||
rcu_barrier_trace(TPS("OnlineQ"), cpu,
|
||||
rcu_state.barrier_sequence);
|
||||
rcu_state.barrier_sequence);
|
||||
smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
|
||||
} else {
|
||||
rcu_barrier_trace(TPS("OnlineNQ"), cpu,
|
||||
rcu_state.barrier_sequence);
|
||||
rcu_state.barrier_sequence);
|
||||
}
|
||||
}
|
||||
put_online_cpus();
|
||||
@ -3083,6 +3115,9 @@ int rcutree_online_cpu(unsigned int cpu)
|
||||
return 0; /* Too early in boot for scheduler work. */
|
||||
sync_sched_exp_online_cleanup(cpu);
|
||||
rcutree_affinity_setting(cpu, -1);
|
||||
|
||||
// Stop-machine done, so allow nohz_full to disable tick.
|
||||
tick_dep_clear(TICK_DEP_BIT_RCU);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -3103,6 +3138,9 @@ int rcutree_offline_cpu(unsigned int cpu)
|
||||
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
|
||||
|
||||
rcutree_affinity_setting(cpu, cpu);
|
||||
|
||||
// nohz_full CPUs need the tick for stop-machine to work quickly
|
||||
tick_dep_set(TICK_DEP_BIT_RCU);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -3148,6 +3186,7 @@ void rcu_cpu_starting(unsigned int cpu)
|
||||
rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
|
||||
rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
|
||||
if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
|
||||
rcu_disable_urgency_upon_qs(rdp);
|
||||
/* Report QS -after- changing ->qsmaskinitnext! */
|
||||
rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
|
||||
} else {
|
||||
|
@ -181,6 +181,7 @@ struct rcu_data {
|
||||
atomic_t dynticks; /* Even value for idle, else odd. */
|
||||
bool rcu_need_heavy_qs; /* GP old, so heavy quiescent state! */
|
||||
bool rcu_urgent_qs; /* GP old need light quiescent state. */
|
||||
bool rcu_forced_tick; /* Forced tick to provide QS. */
|
||||
#ifdef CONFIG_RCU_FAST_NO_HZ
|
||||
bool all_lazy; /* All CPU's CBs lazy at idle start? */
|
||||
unsigned long last_accelerate; /* Last jiffy CBs were accelerated. */
|
||||
|
@ -1946,7 +1946,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
|
||||
int __maybe_unused cpu = my_rdp->cpu;
|
||||
unsigned long cur_gp_seq;
|
||||
unsigned long flags;
|
||||
bool gotcbs;
|
||||
bool gotcbs = false;
|
||||
unsigned long j = jiffies;
|
||||
bool needwait_gp = false; // This prevents actual uninitialized use.
|
||||
bool needwake;
|
||||
|
@ -235,6 +235,7 @@ static int multi_cpu_stop(void *data)
|
||||
*/
|
||||
touch_nmi_watchdog();
|
||||
}
|
||||
rcu_momentary_dyntick_idle();
|
||||
} while (curstate != MULTI_STOP_EXIT);
|
||||
|
||||
local_irq_restore(flags);
|
||||
|
@ -172,6 +172,7 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
|
||||
#ifdef CONFIG_NO_HZ_FULL
|
||||
cpumask_var_t tick_nohz_full_mask;
|
||||
bool tick_nohz_full_running;
|
||||
EXPORT_SYMBOL_GPL(tick_nohz_full_running);
|
||||
static atomic_t tick_dep_mask;
|
||||
|
||||
static bool check_tick_dependency(atomic_t *dep)
|
||||
@ -198,6 +199,11 @@ static bool check_tick_dependency(atomic_t *dep)
|
||||
return true;
|
||||
}
|
||||
|
||||
if (val & TICK_DEP_MASK_RCU) {
|
||||
trace_tick_stop(0, TICK_DEP_MASK_RCU);
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
||||
|
||||
@ -324,6 +330,7 @@ void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit)
|
||||
preempt_enable();
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(tick_nohz_dep_set_cpu);
|
||||
|
||||
void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit)
|
||||
{
|
||||
@ -331,6 +338,7 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit)
|
||||
|
||||
atomic_andnot(BIT(bit), &ts->tick_dep_mask);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);
|
||||
|
||||
/*
|
||||
* Set a per-task tick dependency. Posix CPU timers need this in order to elapse
|
||||
@ -344,11 +352,13 @@ void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
|
||||
*/
|
||||
tick_nohz_dep_set_all(&tsk->tick_dep_mask, bit);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);
|
||||
|
||||
void tick_nohz_dep_clear_task(struct task_struct *tsk, enum tick_dep_bits bit)
|
||||
{
|
||||
atomic_andnot(BIT(bit), &tsk->tick_dep_mask);
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_task);
|
||||
|
||||
/*
|
||||
* Set a per-taskgroup tick dependency. Posix CPU timers need this in order to elapse
|
||||
@ -397,6 +407,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask)
|
||||
cpumask_copy(tick_nohz_full_mask, cpumask);
|
||||
tick_nohz_full_running = true;
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(tick_nohz_full_setup);
|
||||
|
||||
static int tick_nohz_cpu_down(unsigned int cpu)
|
||||
{
|
||||
|
@ -365,11 +365,6 @@ static void show_pwq(struct pool_workqueue *pwq);
|
||||
!lockdep_is_held(&wq_pool_mutex), \
|
||||
"RCU or wq_pool_mutex should be held")
|
||||
|
||||
#define assert_rcu_or_wq_mutex(wq) \
|
||||
RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
|
||||
!lockdep_is_held(&wq->mutex), \
|
||||
"RCU or wq->mutex should be held")
|
||||
|
||||
#define assert_rcu_or_wq_mutex_or_pool_mutex(wq) \
|
||||
RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
|
||||
!lockdep_is_held(&wq->mutex) && \
|
||||
@ -427,9 +422,7 @@ static void show_pwq(struct pool_workqueue *pwq);
|
||||
*/
|
||||
#define for_each_pwq(pwq, wq) \
|
||||
list_for_each_entry_rcu((pwq), &(wq)->pwqs, pwqs_node, \
|
||||
lockdep_is_held(&wq->mutex)) \
|
||||
if (({ assert_rcu_or_wq_mutex(wq); false; })) { } \
|
||||
else
|
||||
lockdep_is_held(&(wq->mutex)))
|
||||
|
||||
#ifdef CONFIG_DEBUG_OBJECTS_WORK
|
||||
|
||||
|
@ -1314,8 +1314,8 @@ int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
|
||||
}
|
||||
|
||||
mutex_lock(&ifalias_mutex);
|
||||
rcu_swap_protected(dev->ifalias, new_alias,
|
||||
mutex_is_locked(&ifalias_mutex));
|
||||
new_alias = rcu_replace_pointer(dev->ifalias, new_alias,
|
||||
mutex_is_locked(&ifalias_mutex));
|
||||
mutex_unlock(&ifalias_mutex);
|
||||
|
||||
if (new_alias)
|
||||
|
@ -356,8 +356,8 @@ int reuseport_detach_prog(struct sock *sk)
|
||||
spin_lock_bh(&reuseport_lock);
|
||||
reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
|
||||
lockdep_is_held(&reuseport_lock));
|
||||
rcu_swap_protected(reuse->prog, old_prog,
|
||||
lockdep_is_held(&reuseport_lock));
|
||||
old_prog = rcu_replace_pointer(reuse->prog, old_prog,
|
||||
lockdep_is_held(&reuseport_lock));
|
||||
spin_unlock_bh(&reuseport_lock);
|
||||
|
||||
if (!old_prog)
|
||||
|
@ -1557,8 +1557,9 @@ static void nft_chain_stats_replace(struct nft_trans *trans)
|
||||
if (!nft_trans_chain_stats(trans))
|
||||
return;
|
||||
|
||||
rcu_swap_protected(chain->stats, nft_trans_chain_stats(trans),
|
||||
lockdep_commit_lock_is_held(trans->ctx.net));
|
||||
nft_trans_chain_stats(trans) =
|
||||
rcu_replace_pointer(chain->stats, nft_trans_chain_stats(trans),
|
||||
lockdep_commit_lock_is_held(trans->ctx.net));
|
||||
|
||||
if (!nft_trans_chain_stats(trans))
|
||||
static_branch_inc(&nft_counters_enabled);
|
||||
|
@ -88,7 +88,7 @@ struct tcf_chain *tcf_action_set_ctrlact(struct tc_action *a, int action,
|
||||
struct tcf_chain *goto_chain)
|
||||
{
|
||||
a->tcfa_action = action;
|
||||
rcu_swap_protected(a->goto_chain, goto_chain, 1);
|
||||
goto_chain = rcu_replace_pointer(a->goto_chain, goto_chain, 1);
|
||||
return goto_chain;
|
||||
}
|
||||
EXPORT_SYMBOL(tcf_action_set_ctrlact);
|
||||
|
@ -101,8 +101,8 @@ static int tcf_csum_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&p->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(p->params, params_new,
|
||||
lockdep_is_held(&p->tcf_lock));
|
||||
params_new = rcu_replace_pointer(p->params, params_new,
|
||||
lockdep_is_held(&p->tcf_lock));
|
||||
spin_unlock_bh(&p->tcf_lock);
|
||||
|
||||
if (goto_ch)
|
||||
|
@ -721,7 +721,8 @@ static int tcf_ct_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&c->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(c->params, params, lockdep_is_held(&c->tcf_lock));
|
||||
params = rcu_replace_pointer(c->params, params,
|
||||
lockdep_is_held(&c->tcf_lock));
|
||||
spin_unlock_bh(&c->tcf_lock);
|
||||
|
||||
if (goto_ch)
|
||||
|
@ -257,8 +257,8 @@ static int tcf_ctinfo_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&ci->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, actparm->action, goto_ch);
|
||||
rcu_swap_protected(ci->params, cp_new,
|
||||
lockdep_is_held(&ci->tcf_lock));
|
||||
cp_new = rcu_replace_pointer(ci->params, cp_new,
|
||||
lockdep_is_held(&ci->tcf_lock));
|
||||
spin_unlock_bh(&ci->tcf_lock);
|
||||
|
||||
if (goto_ch)
|
||||
|
@ -595,7 +595,7 @@ static int tcf_ife_init(struct net *net, struct nlattr *nla,
|
||||
spin_lock_bh(&ife->tcf_lock);
|
||||
/* protected by tcf_lock when modifying existing action */
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(ife->params, p, 1);
|
||||
p = rcu_replace_pointer(ife->params, p, 1);
|
||||
|
||||
if (exists)
|
||||
spin_unlock_bh(&ife->tcf_lock);
|
||||
|
@ -178,8 +178,8 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
|
||||
goto put_chain;
|
||||
}
|
||||
mac_header_xmit = dev_is_mac_header_xmit(dev);
|
||||
rcu_swap_protected(m->tcfm_dev, dev,
|
||||
lockdep_is_held(&m->tcf_lock));
|
||||
dev = rcu_replace_pointer(m->tcfm_dev, dev,
|
||||
lockdep_is_held(&m->tcf_lock));
|
||||
if (dev)
|
||||
dev_put(dev);
|
||||
m->tcfm_mac_header_xmit = mac_header_xmit;
|
||||
|
@ -262,7 +262,7 @@ static int tcf_mpls_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&m->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(m->mpls_p, p, lockdep_is_held(&m->tcf_lock));
|
||||
p = rcu_replace_pointer(m->mpls_p, p, lockdep_is_held(&m->tcf_lock));
|
||||
spin_unlock_bh(&m->tcf_lock);
|
||||
|
||||
if (goto_ch)
|
||||
|
@ -191,9 +191,9 @@ static int tcf_police_init(struct net *net, struct nlattr *nla,
|
||||
police->tcfp_ptoks = new->tcfp_mtu_ptoks;
|
||||
spin_unlock_bh(&police->tcfp_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(police->params,
|
||||
new,
|
||||
lockdep_is_held(&police->tcf_lock));
|
||||
new = rcu_replace_pointer(police->params,
|
||||
new,
|
||||
lockdep_is_held(&police->tcf_lock));
|
||||
spin_unlock_bh(&police->tcf_lock);
|
||||
|
||||
if (goto_ch)
|
||||
|
@ -102,8 +102,8 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
s->rate = rate;
|
||||
s->psample_group_num = psample_group_num;
|
||||
rcu_swap_protected(s->psample_group, psample_group,
|
||||
lockdep_is_held(&s->tcf_lock));
|
||||
psample_group = rcu_replace_pointer(s->psample_group, psample_group,
|
||||
lockdep_is_held(&s->tcf_lock));
|
||||
|
||||
if (tb[TCA_SAMPLE_TRUNC_SIZE]) {
|
||||
s->truncate = true;
|
||||
|
@ -206,8 +206,8 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&d->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(d->params, params_new,
|
||||
lockdep_is_held(&d->tcf_lock));
|
||||
params_new = rcu_replace_pointer(d->params, params_new,
|
||||
lockdep_is_held(&d->tcf_lock));
|
||||
spin_unlock_bh(&d->tcf_lock);
|
||||
if (params_new)
|
||||
kfree_rcu(params_new, rcu);
|
||||
|
@ -529,8 +529,8 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&t->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(t->params, params_new,
|
||||
lockdep_is_held(&t->tcf_lock));
|
||||
params_new = rcu_replace_pointer(t->params, params_new,
|
||||
lockdep_is_held(&t->tcf_lock));
|
||||
spin_unlock_bh(&t->tcf_lock);
|
||||
tunnel_key_release_params(params_new);
|
||||
if (goto_ch)
|
||||
|
@ -221,7 +221,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
|
||||
|
||||
spin_lock_bh(&v->tcf_lock);
|
||||
goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
|
||||
rcu_swap_protected(v->vlan_p, p, lockdep_is_held(&v->tcf_lock));
|
||||
p = rcu_replace_pointer(v->vlan_p, p, lockdep_is_held(&v->tcf_lock));
|
||||
spin_unlock_bh(&v->tcf_lock);
|
||||
|
||||
if (goto_ch)
|
||||
|
@ -179,8 +179,8 @@ out_free_rule:
|
||||
* doesn't currently exist, just use a spinlock for now.
|
||||
*/
|
||||
mutex_lock(&policy_update_lock);
|
||||
rcu_swap_protected(safesetid_setuid_rules, pol,
|
||||
lockdep_is_held(&policy_update_lock));
|
||||
pol = rcu_replace_pointer(safesetid_setuid_rules, pol,
|
||||
lockdep_is_held(&policy_update_lock));
|
||||
mutex_unlock(&policy_update_lock);
|
||||
err = len;
|
||||
|
||||
|
@ -27,9 +27,10 @@ Explanation of the Linux-Kernel Memory Consistency Model
|
||||
19. AND THEN THERE WAS ALPHA
|
||||
20. THE HAPPENS-BEFORE RELATION: hb
|
||||
21. THE PROPAGATES-BEFORE RELATION: pb
|
||||
22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-fence, and rb
|
||||
22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
|
||||
23. LOCKING
|
||||
24. ODDS AND ENDS
|
||||
24. PLAIN ACCESSES AND DATA RACES
|
||||
25. ODDS AND ENDS
|
||||
|
||||
|
||||
|
||||
@ -42,8 +43,7 @@ linux-kernel.bell and linux-kernel.cat files that make up the formal
|
||||
version of the model; they are extremely terse and their meanings are
|
||||
far from clear.
|
||||
|
||||
This document describes the ideas underlying the LKMM, but excluding
|
||||
the modeling of bare C (or plain) shared memory accesses. It is meant
|
||||
This document describes the ideas underlying the LKMM. It is meant
|
||||
for people who want to understand how the model was designed. It does
|
||||
not go into the details of the code in the .bell and .cat files;
|
||||
rather, it explains in English what the code expresses symbolically.
|
||||
@ -206,7 +206,7 @@ goes like this:
|
||||
P0 stores 1 to buf before storing 1 to flag, since it executes
|
||||
its instructions in order.
|
||||
|
||||
Since an instruction (in this case, P1's store to flag) cannot
|
||||
Since an instruction (in this case, P0's store to flag) cannot
|
||||
execute before itself, the specified outcome is impossible.
|
||||
|
||||
However, real computer hardware almost never follows the Sequential
|
||||
@ -419,7 +419,7 @@ example:
|
||||
|
||||
The object code might call f(5) either before or after g(6); the
|
||||
memory model cannot assume there is a fixed program order relation
|
||||
between them. (In fact, if the functions are inlined then the
|
||||
between them. (In fact, if the function calls are inlined then the
|
||||
compiler might even interleave their object code.)
|
||||
|
||||
|
||||
@ -499,7 +499,7 @@ different CPUs (external reads-from, or rfe).
|
||||
|
||||
For our purposes, a memory location's initial value is treated as
|
||||
though it had been written there by an imaginary initial store that
|
||||
executes on a separate CPU before the program runs.
|
||||
executes on a separate CPU before the main program runs.
|
||||
|
||||
Usage of the rf relation implicitly assumes that loads will always
|
||||
read from a single store. It doesn't apply properly in the presence
|
||||
@ -857,7 +857,7 @@ outlined above. These restrictions involve the necessity of
|
||||
maintaining cache coherence and the fact that a CPU can't operate on a
|
||||
value before it knows what that value is, among other things.
|
||||
|
||||
The formal version of the LKMM is defined by five requirements, or
|
||||
The formal version of the LKMM is defined by six requirements, or
|
||||
axioms:
|
||||
|
||||
Sequential consistency per variable: This requires that the
|
||||
@ -877,10 +877,14 @@ axioms:
|
||||
grace periods obey the rules of RCU, in particular, the
|
||||
Grace-Period Guarantee.
|
||||
|
||||
Plain-coherence: This requires that plain memory accesses
|
||||
(those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey
|
||||
the operational model's rules regarding cache coherence.
|
||||
|
||||
The first and second are quite common; they can be found in many
|
||||
memory models (such as those for C11/C++11). The "happens-before" and
|
||||
"propagation" axioms have analogs in other memory models as well. The
|
||||
"rcu" axiom is specific to the LKMM.
|
||||
"rcu" and "plain-coherence" axioms are specific to the LKMM.
|
||||
|
||||
Each of these axioms is discussed below.
|
||||
|
||||
@ -955,7 +959,7 @@ atomic update. This is what the LKMM's "atomic" axiom says.
|
||||
THE PRESERVED PROGRAM ORDER RELATION: ppo
|
||||
-----------------------------------------
|
||||
|
||||
There are many situations where a CPU is obligated to execute two
|
||||
There are many situations where a CPU is obliged to execute two
|
||||
instructions in program order. We amalgamate them into the ppo (for
|
||||
"preserved program order") relation, which links the po-earlier
|
||||
instruction to the po-later instruction and is thus a sub-relation of
|
||||
@ -1425,8 +1429,8 @@ they execute means that it cannot have cycles. This requirement is
|
||||
the content of the LKMM's "propagation" axiom.
|
||||
|
||||
|
||||
RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-fence, and rb
|
||||
-------------------------------------------------------------
|
||||
RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
|
||||
------------------------------------------------------------------------
|
||||
|
||||
RCU (Read-Copy-Update) is a powerful synchronization mechanism. It
|
||||
rests on two concepts: grace periods and read-side critical sections.
|
||||
@ -1536,29 +1540,29 @@ Z's CPU before Z begins but doesn't propagate to some other CPU until
|
||||
after X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is
|
||||
the end of a critical section which starts before Z begins.
|
||||
|
||||
The LKMM goes on to define the rcu-fence relation as a sequence of
|
||||
The LKMM goes on to define the rcu-order relation as a sequence of
|
||||
rcu-gp and rcu-rscsi links separated by rcu-link links, in which the
|
||||
number of rcu-gp links is >= the number of rcu-rscsi links. For
|
||||
example:
|
||||
|
||||
X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
|
||||
|
||||
would imply that X ->rcu-fence V, because this sequence contains two
|
||||
would imply that X ->rcu-order V, because this sequence contains two
|
||||
rcu-gp links and one rcu-rscsi link. (It also implies that
|
||||
X ->rcu-fence T and Z ->rcu-fence V.) On the other hand:
|
||||
X ->rcu-order T and Z ->rcu-order V.) On the other hand:
|
||||
|
||||
X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
|
||||
|
||||
does not imply X ->rcu-fence V, because the sequence contains only
|
||||
does not imply X ->rcu-order V, because the sequence contains only
|
||||
one rcu-gp link but two rcu-rscsi links.
|
||||
|
||||
The rcu-fence relation is important because the Grace Period Guarantee
|
||||
means that rcu-fence acts kind of like a strong fence. In particular,
|
||||
E ->rcu-fence F implies not only that E begins before F ends, but also
|
||||
that any write po-before E will propagate to every CPU before any
|
||||
instruction po-after F can execute. (However, it does not imply that
|
||||
E must execute before F; in fact, each synchronize_rcu() fence event
|
||||
is linked to itself by rcu-fence as a degenerate case.)
|
||||
The rcu-order relation is important because the Grace Period Guarantee
|
||||
means that rcu-order links act kind of like strong fences. In
|
||||
particular, E ->rcu-order F implies not only that E begins before F
|
||||
ends, but also that any write po-before E will propagate to every CPU
|
||||
before any instruction po-after F can execute. (However, it does not
|
||||
imply that E must execute before F; in fact, each synchronize_rcu()
|
||||
fence event is linked to itself by rcu-order as a degenerate case.)
|
||||
|
||||
To prove this in full generality requires some intellectual effort.
|
||||
We'll consider just a very simple case:
|
||||
@ -1572,7 +1576,7 @@ and there are events X, Y and a read-side critical section C such that:
|
||||
|
||||
2. X comes "before" Y in some sense (including rfe, co and fr);
|
||||
|
||||
2. Y is po-before Z;
|
||||
3. Y is po-before Z;
|
||||
|
||||
4. Z is the rcu_read_unlock() event marking the end of C;
|
||||
|
||||
@ -1585,7 +1589,26 @@ G's CPU before G starts must propagate to every CPU before C starts.
|
||||
In particular, the write propagates to every CPU before F finishes
|
||||
executing and hence before any instruction po-after F can execute.
|
||||
This sort of reasoning can be extended to handle all the situations
|
||||
covered by rcu-fence.
|
||||
covered by rcu-order.
|
||||
|
||||
The rcu-fence relation is a simple extension of rcu-order. While
|
||||
rcu-order only links certain fence events (calls to synchronize_rcu(),
|
||||
rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events
|
||||
that are separated by an rcu-order link. This is analogous to the way
|
||||
the strong-fence relation links events that are separated by an
|
||||
smp_mb() fence event (as mentioned above, rcu-order links act kind of
|
||||
like strong fences). Written symbolically, X ->rcu-fence Y means
|
||||
there are fence events E and F such that:
|
||||
|
||||
X ->po E ->rcu-order F ->po Y.
|
||||
|
||||
From the discussion above, we see this implies not only that X
|
||||
executes before Y, but also (if X is a store) that X propagates to
|
||||
every CPU before Y executes. Thus rcu-fence is sort of a
|
||||
"super-strong" fence: Unlike the original strong fences (smp_mb() and
|
||||
synchronize_rcu()), rcu-fence is able to link events on different
|
||||
CPUs. (Perhaps this fact should lead us to say that rcu-fence isn't
|
||||
really a fence at all!)
|
||||
|
||||
Finally, the LKMM defines the RCU-before (rb) relation in terms of
|
||||
rcu-fence. This is done in essentially the same way as the pb
|
||||
@ -1596,7 +1619,7 @@ before F, just as E ->pb F does (and for much the same reasons).
|
||||
Putting this all together, the LKMM expresses the Grace Period
|
||||
Guarantee by requiring that the rb relation does not contain a cycle.
|
||||
Equivalently, this "rcu" axiom requires that there are no events E
|
||||
and F with E ->rcu-link F ->rcu-fence E. Or to put it a third way,
|
||||
and F with E ->rcu-link F ->rcu-order E. Or to put it a third way,
|
||||
the axiom requires that there are no cycles consisting of rcu-gp and
|
||||
rcu-rscsi alternating with rcu-link, where the number of rcu-gp links
|
||||
is >= the number of rcu-rscsi links.
|
||||
@ -1750,7 +1773,7 @@ addition to normal RCU. The ideas involved are much the same as
|
||||
above, with new relations srcu-gp and srcu-rscsi added to represent
|
||||
SRCU grace periods and read-side critical sections. There is a
|
||||
restriction on the srcu-gp and srcu-rscsi links that can appear in an
|
||||
rcu-fence sequence (the srcu-rscsi links must be paired with srcu-gp
|
||||
rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp
|
||||
links having the same SRCU domain with proper nesting); the details
|
||||
are relatively unimportant.
|
||||
|
||||
@ -1896,6 +1919,521 @@ architectures supported by the Linux kernel, albeit for various
|
||||
differing reasons.
|
||||
|
||||
|
||||
PLAIN ACCESSES AND DATA RACES
|
||||
-----------------------------
|
||||
|
||||
In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y),
|
||||
smp_load_acquire(&z), and so on are collectively referred to as
|
||||
"marked" accesses, because they are all annotated with special
|
||||
operations of one kind or another. Ordinary C-language memory
|
||||
accesses such as x or y = 0 are simply called "plain" accesses.
|
||||
|
||||
Early versions of the LKMM had nothing to say about plain accesses.
|
||||
The C standard allows compilers to assume that the variables affected
|
||||
by plain accesses are not concurrently read or written by any other
|
||||
threads or CPUs. This leaves compilers free to implement all manner
|
||||
of transformations or optimizations of code containing plain accesses,
|
||||
making such code very difficult for a memory model to handle.
|
||||
|
||||
Here is just one example of a possible pitfall:
|
||||
|
||||
int a = 6;
|
||||
int *x = &a;
|
||||
|
||||
P0()
|
||||
{
|
||||
int *r1;
|
||||
int r2 = 0;
|
||||
|
||||
r1 = x;
|
||||
if (r1 != NULL)
|
||||
r2 = READ_ONCE(*r1);
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
WRITE_ONCE(x, NULL);
|
||||
}
|
||||
|
||||
On the face of it, one would expect that when this code runs, the only
|
||||
possible final values for r2 are 6 and 0, depending on whether or not
|
||||
P1's store to x propagates to P0 before P0's load from x executes.
|
||||
But since P0's load from x is a plain access, the compiler may decide
|
||||
to carry out the load twice (for the comparison against NULL, then again
|
||||
for the READ_ONCE()) and eliminate the temporary variable r1. The
|
||||
object code generated for P0 could therefore end up looking rather
|
||||
like this:
|
||||
|
||||
P0()
|
||||
{
|
||||
int r2 = 0;
|
||||
|
||||
if (x != NULL)
|
||||
r2 = READ_ONCE(*x);
|
||||
}
|
||||
|
||||
And now it is obvious that this code runs the risk of dereferencing a
|
||||
NULL pointer, because P1's store to x might propagate to P0 after the
|
||||
test against NULL has been made but before the READ_ONCE() executes.
|
||||
If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x",
|
||||
the compiler would not have performed this optimization and there
|
||||
would be no possibility of a NULL-pointer dereference.
|
||||
|
||||
Given the possibility of transformations like this one, the LKMM
|
||||
doesn't try to predict all possible outcomes of code containing plain
|
||||
accesses. It is instead content to determine whether the code
|
||||
violates the compiler's assumptions, which would render the ultimate
|
||||
outcome undefined.
|
||||
|
||||
In technical terms, the compiler is allowed to assume that when the
|
||||
program executes, there will not be any data races. A "data race"
|
||||
occurs when two conflicting memory accesses execute concurrently;
|
||||
two memory accesses "conflict" if:
|
||||
|
||||
they access the same location,
|
||||
|
||||
they occur on different CPUs (or in different threads on the
|
||||
same CPU),
|
||||
|
||||
at least one of them is a plain access,
|
||||
|
||||
and at least one of them is a store.
|
||||
|
||||
The LKMM tries to determine whether a program contains two conflicting
|
||||
accesses which may execute concurrently; if it does then the LKMM says
|
||||
there is a potential data race and makes no predictions about the
|
||||
program's outcome.
|
||||
|
||||
Determining whether two accesses conflict is easy; you can see that
|
||||
all the concepts involved in the definition above are already part of
|
||||
the memory model. The hard part is telling whether they may execute
|
||||
concurrently. The LKMM takes a conservative attitude, assuming that
|
||||
accesses may be concurrent unless it can prove they cannot.
|
||||
|
||||
If two memory accesses aren't concurrent then one must execute before
|
||||
the other. Therefore the LKMM decides two accesses aren't concurrent
|
||||
if they can be connected by a sequence of hb, pb, and rb links
|
||||
(together referred to as xb, for "executes before"). However, there
|
||||
are two complicating factors.
|
||||
|
||||
If X is a load and X executes before a store Y, then indeed there is
|
||||
no danger of X and Y being concurrent. After all, Y can't have any
|
||||
effect on the value obtained by X until the memory subsystem has
|
||||
propagated Y from its own CPU to X's CPU, which won't happen until
|
||||
some time after Y executes and thus after X executes. But if X is a
|
||||
store, then even if X executes before Y it is still possible that X
|
||||
will propagate to Y's CPU just as Y is executing. In such a case X
|
||||
could very well interfere somehow with Y, and we would have to
|
||||
consider X and Y to be concurrent.
|
||||
|
||||
Therefore when X is a store, for X and Y to be non-concurrent the LKMM
|
||||
requires not only that X must execute before Y but also that X must
|
||||
propagate to Y's CPU before Y executes. (Or vice versa, of course, if
|
||||
Y executes before X -- then Y must propagate to X's CPU before X
|
||||
executes if Y is a store.) This is expressed by the visibility
|
||||
relation (vis), where X ->vis Y is defined to hold if there is an
|
||||
intermediate event Z such that:
|
||||
|
||||
X is connected to Z by a possibly empty sequence of
|
||||
cumul-fence links followed by an optional rfe link (if none of
|
||||
these links are present, X and Z are the same event),
|
||||
|
||||
and either:
|
||||
|
||||
Z is connected to Y by a strong-fence link followed by a
|
||||
possibly empty sequence of xb links,
|
||||
|
||||
or:
|
||||
|
||||
Z is on the same CPU as Y and is connected to Y by a possibly
|
||||
empty sequence of xb links (again, if the sequence is empty it
|
||||
means Z and Y are the same event).
|
||||
|
||||
The motivations behind this definition are straightforward:
|
||||
|
||||
cumul-fence memory barriers force stores that are po-before
|
||||
the barrier to propagate to other CPUs before stores that are
|
||||
po-after the barrier.
|
||||
|
||||
An rfe link from an event W to an event R says that R reads
|
||||
from W, which certainly means that W must have propagated to
|
||||
R's CPU before R executed.
|
||||
|
||||
strong-fence memory barriers force stores that are po-before
|
||||
the barrier, or that propagate to the barrier's CPU before the
|
||||
barrier executes, to propagate to all CPUs before any events
|
||||
po-after the barrier can execute.
|
||||
|
||||
To see how this works out in practice, consider our old friend, the MP
|
||||
pattern (with fences and statement labels, but without the conditional
|
||||
test):
|
||||
|
||||
int buf = 0, flag = 0;
|
||||
|
||||
P0()
|
||||
{
|
||||
X: WRITE_ONCE(buf, 1);
|
||||
smp_wmb();
|
||||
W: WRITE_ONCE(flag, 1);
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
int r1;
|
||||
int r2 = 0;
|
||||
|
||||
Z: r1 = READ_ONCE(flag);
|
||||
smp_rmb();
|
||||
Y: r2 = READ_ONCE(buf);
|
||||
}
|
||||
|
||||
The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
|
||||
assuming r1 = 1 at the end, there is an rfe link from W to Z. This
|
||||
means that the store to buf must propagate from P0 to P1 before Z
|
||||
executes. Next, Z and Y are on the same CPU and the smp_rmb() fence
|
||||
provides an xb link from Z to Y (i.e., it forces Z to execute before
|
||||
Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
|
||||
executes.
|
||||
|
||||
The second complicating factor mentioned above arises from the fact
|
||||
that when we are considering data races, some of the memory accesses
|
||||
are plain. Now, although we have not said so explicitly, up to this
|
||||
point most of the relations defined by the LKMM (ppo, hb, prop,
|
||||
cumul-fence, pb, and so on -- including vis) apply only to marked
|
||||
accesses.
|
||||
|
||||
There are good reasons for this restriction. The compiler is not
|
||||
allowed to apply fancy transformations to marked accesses, and
|
||||
consequently each such access in the source code corresponds more or
|
||||
less directly to a single machine instruction in the object code. But
|
||||
plain accesses are a different story; the compiler may combine them,
|
||||
split them up, duplicate them, eliminate them, invent new ones, and
|
||||
who knows what else. Seeing a plain access in the source code tells
|
||||
you almost nothing about what machine instructions will end up in the
|
||||
object code.
|
||||
|
||||
Fortunately, the compiler isn't completely free; it is subject to some
|
||||
limitations. For one, it is not allowed to introduce a data race into
|
||||
the object code if the source code does not already contain a data
|
||||
race (if it could, memory models would be useless and no multithreaded
|
||||
code would be safe!). For another, it cannot move a plain access past
|
||||
a compiler barrier.
|
||||
|
||||
A compiler barrier is a kind of fence, but as the name implies, it
|
||||
only affects the compiler; it does not necessarily have any effect on
|
||||
how instructions are executed by the CPU. In Linux kernel source
|
||||
code, the barrier() function is a compiler barrier. It doesn't give
|
||||
rise directly to any machine instructions in the object code; rather,
|
||||
it affects how the compiler generates the rest of the object code.
|
||||
Given source code like this:
|
||||
|
||||
... some memory accesses ...
|
||||
barrier();
|
||||
... some other memory accesses ...
|
||||
|
||||
the barrier() function ensures that the machine instructions
|
||||
corresponding to the first group of accesses will all end po-before
|
||||
any machine instructions corresponding to the second group of accesses
|
||||
-- even if some of the accesses are plain. (Of course, the CPU may
|
||||
then execute some of those accesses out of program order, but we
|
||||
already know how to deal with such issues.) Without the barrier()
|
||||
there would be no such guarantee; the two groups of accesses could be
|
||||
intermingled or even reversed in the object code.
|
||||
|
||||
The LKMM doesn't say much about the barrier() function, but it does
|
||||
require that all fences are also compiler barriers. In addition, it
|
||||
requires that the ordering properties of memory barriers such as
|
||||
smp_rmb() or smp_store_release() apply to plain accesses as well as to
|
||||
marked accesses.
|
||||
|
||||
This is the key to analyzing data races. Consider the MP pattern
|
||||
again, now using plain accesses for buf:
|
||||
|
||||
int buf = 0, flag = 0;
|
||||
|
||||
P0()
|
||||
{
|
||||
U: buf = 1;
|
||||
smp_wmb();
|
||||
X: WRITE_ONCE(flag, 1);
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
int r1;
|
||||
int r2 = 0;
|
||||
|
||||
Y: r1 = READ_ONCE(flag);
|
||||
if (r1) {
|
||||
smp_rmb();
|
||||
V: r2 = buf;
|
||||
}
|
||||
}
|
||||
|
||||
This program does not contain a data race. Although the U and V
|
||||
accesses conflict, the LKMM can prove they are not concurrent as
|
||||
follows:
|
||||
|
||||
The smp_wmb() fence in P0 is both a compiler barrier and a
|
||||
cumul-fence. It guarantees that no matter what hash of
|
||||
machine instructions the compiler generates for the plain
|
||||
access U, all those instructions will be po-before the fence.
|
||||
Consequently U's store to buf, no matter how it is carried out
|
||||
at the machine level, must propagate to P1 before X's store to
|
||||
flag does.
|
||||
|
||||
X and Y are both marked accesses. Hence an rfe link from X to
|
||||
Y is a valid indicator that X propagated to P1 before Y
|
||||
executed, i.e., X ->vis Y. (And if there is no rfe link then
|
||||
r1 will be 0, so V will not be executed and ipso facto won't
|
||||
race with U.)
|
||||
|
||||
The smp_rmb() fence in P1 is a compiler barrier as well as a
|
||||
fence. It guarantees that all the machine-level instructions
|
||||
corresponding to the access V will be po-after the fence, and
|
||||
therefore any loads among those instructions will execute
|
||||
after the fence does and hence after Y does.
|
||||
|
||||
Thus U's store to buf is forced to propagate to P1 before V's load
|
||||
executes (assuming V does execute), ruling out the possibility of a
|
||||
data race between them.
|
||||
|
||||
This analysis illustrates how the LKMM deals with plain accesses in
|
||||
general. Suppose R is a plain load and we want to show that R
|
||||
executes before some marked access E. We can do this by finding a
|
||||
marked access X such that R and X are ordered by a suitable fence and
|
||||
X ->xb* E. If E was also a plain access, we would also look for a
|
||||
marked access Y such that X ->xb* Y, and Y and E are ordered by a
|
||||
fence. We describe this arrangement by saying that R is
|
||||
"post-bounded" by X and E is "pre-bounded" by Y.
|
||||
|
||||
In fact, we go one step further: Since R is a read, we say that R is
|
||||
"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or
|
||||
"w-pre-bounded" by Y, depending on whether E was a store or a load.
|
||||
This distinction is needed because some fences affect only loads
|
||||
(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
|
||||
the two types of bounds are the same. And as a degenerate case, we
|
||||
say that a marked access pre-bounds and post-bounds itself (e.g., if R
|
||||
above were a marked load then X could simply be taken to be R itself.)
|
||||
|
||||
The need to distinguish between r- and w-bounding raises yet another
|
||||
issue. When the source code contains a plain store, the compiler is
|
||||
allowed to put plain loads of the same location into the object code.
|
||||
For example, given the source code:
|
||||
|
||||
x = 1;
|
||||
|
||||
the compiler is theoretically allowed to generate object code that
|
||||
looks like:
|
||||
|
||||
if (x != 1)
|
||||
x = 1;
|
||||
|
||||
thereby adding a load (and possibly replacing the store entirely).
|
||||
For this reason, whenever the LKMM requires a plain store to be
|
||||
w-pre-bounded or w-post-bounded by a marked access, it also requires
|
||||
the store to be r-pre-bounded or r-post-bounded, so as to handle cases
|
||||
where the compiler adds a load.
|
||||
|
||||
(This may be overly cautious. We don't know of any examples where a
|
||||
compiler has augmented a store with a load in this fashion, and the
|
||||
Linux kernel developers would probably fight pretty hard to change a
|
||||
compiler if it ever did this. Still, better safe than sorry.)
|
||||
|
||||
Incidentally, the other tranformation -- augmenting a plain load by
|
||||
adding in a store to the same location -- is not allowed. This is
|
||||
because the compiler cannot know whether any other CPUs might perform
|
||||
a concurrent load from that location. Two concurrent loads don't
|
||||
constitute a race (they can't interfere with each other), but a store
|
||||
does race with a concurrent load. Thus adding a store might create a
|
||||
data race where one was not already present in the source code,
|
||||
something the compiler is forbidden to do. Augmenting a store with a
|
||||
load, on the other hand, is acceptable because doing so won't create a
|
||||
data race unless one already existed.
|
||||
|
||||
The LKMM includes a second way to pre-bound plain accesses, in
|
||||
addition to fences: an address dependency from a marked load. That
|
||||
is, in the sequence:
|
||||
|
||||
p = READ_ONCE(ptr);
|
||||
r = *p;
|
||||
|
||||
the LKMM says that the marked load of ptr pre-bounds the plain load of
|
||||
*p; the marked load must execute before any of the machine
|
||||
instructions corresponding to the plain load. This is a reasonable
|
||||
stipulation, since after all, the CPU can't perform the load of *p
|
||||
until it knows what value p will hold. Furthermore, without some
|
||||
assumption like this one, some usages typical of RCU would count as
|
||||
data races. For example:
|
||||
|
||||
int a = 1, b;
|
||||
int *ptr = &a;
|
||||
|
||||
P0()
|
||||
{
|
||||
b = 2;
|
||||
rcu_assign_pointer(ptr, &b);
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
int *p;
|
||||
int r;
|
||||
|
||||
rcu_read_lock();
|
||||
p = rcu_dereference(ptr);
|
||||
r = *p;
|
||||
rcu_read_unlock();
|
||||
}
|
||||
|
||||
(In this example the rcu_read_lock() and rcu_read_unlock() calls don't
|
||||
really do anything, because there aren't any grace periods. They are
|
||||
included merely for the sake of good form; typically P0 would call
|
||||
synchronize_rcu() somewhere after the rcu_assign_pointer().)
|
||||
|
||||
rcu_assign_pointer() performs a store-release, so the plain store to b
|
||||
is definitely w-post-bounded before the store to ptr, and the two
|
||||
stores will propagate to P1 in that order. However, rcu_dereference()
|
||||
is only equivalent to READ_ONCE(). While it is a marked access, it is
|
||||
not a fence or compiler barrier. Hence the only guarantee we have
|
||||
that the load of ptr in P1 is r-pre-bounded before the load of *p
|
||||
(thus avoiding a race) is the assumption about address dependencies.
|
||||
|
||||
This is a situation where the compiler can undermine the memory model,
|
||||
and a certain amount of care is required when programming constructs
|
||||
like this one. In particular, comparisons between the pointer and
|
||||
other known addresses can cause trouble. If you have something like:
|
||||
|
||||
p = rcu_dereference(ptr);
|
||||
if (p == &x)
|
||||
r = *p;
|
||||
|
||||
then the compiler just might generate object code resembling:
|
||||
|
||||
p = rcu_dereference(ptr);
|
||||
if (p == &x)
|
||||
r = x;
|
||||
|
||||
or even:
|
||||
|
||||
rtemp = x;
|
||||
p = rcu_dereference(ptr);
|
||||
if (p == &x)
|
||||
r = rtemp;
|
||||
|
||||
which would invalidate the memory model's assumption, since the CPU
|
||||
could now perform the load of x before the load of ptr (there might be
|
||||
a control dependency but no address dependency at the machine level).
|
||||
|
||||
Finally, it turns out there is a situation in which a plain write does
|
||||
not need to be w-post-bounded: when it is separated from the
|
||||
conflicting access by a fence. At first glance this may seem
|
||||
impossible. After all, to be conflicting the second access has to be
|
||||
on a different CPU from the first, and fences don't link events on
|
||||
different CPUs. Well, normal fences don't -- but rcu-fence can!
|
||||
Here's an example:
|
||||
|
||||
int x, y;
|
||||
|
||||
P0()
|
||||
{
|
||||
WRITE_ONCE(x, 1);
|
||||
synchronize_rcu();
|
||||
y = 3;
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
rcu_read_lock();
|
||||
if (READ_ONCE(x) == 0)
|
||||
y = 2;
|
||||
rcu_read_unlock();
|
||||
}
|
||||
|
||||
Do the plain stores to y race? Clearly not if P1 reads a non-zero
|
||||
value for x, so let's assume the READ_ONCE(x) does obtain 0. This
|
||||
means that the read-side critical section in P1 must finish executing
|
||||
before the grace period in P0 does, because RCU's Grace-Period
|
||||
Guarantee says that otherwise P0's store to x would have propagated to
|
||||
P1 before the critical section started and so would have been visible
|
||||
to the READ_ONCE(). (Another way of putting it is that the fre link
|
||||
from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
|
||||
between those two events.)
|
||||
|
||||
This means there is an rcu-fence link from P1's "y = 2" store to P0's
|
||||
"y = 3" store, and consequently the first must propagate from P1 to P0
|
||||
before the second can execute. Therefore the two stores cannot be
|
||||
concurrent and there is no race, even though P1's plain store to y
|
||||
isn't w-post-bounded by any marked accesses.
|
||||
|
||||
Putting all this material together yields the following picture. For
|
||||
two conflicting stores W and W', where W ->co W', the LKMM says the
|
||||
stores don't race if W can be linked to W' by a
|
||||
|
||||
w-post-bounded ; vis ; w-pre-bounded
|
||||
|
||||
sequence. If W is plain then they also have to be linked by an
|
||||
|
||||
r-post-bounded ; xb* ; w-pre-bounded
|
||||
|
||||
sequence, and if W' is plain then they also have to be linked by a
|
||||
|
||||
w-post-bounded ; vis ; r-pre-bounded
|
||||
|
||||
sequence. For a conflicting load R and store W, the LKMM says the two
|
||||
accesses don't race if R can be linked to W by an
|
||||
|
||||
r-post-bounded ; xb* ; w-pre-bounded
|
||||
|
||||
sequence or if W can be linked to R by a
|
||||
|
||||
w-post-bounded ; vis ; r-pre-bounded
|
||||
|
||||
sequence. For the cases involving a vis link, the LKMM also accepts
|
||||
sequences in which W is linked to W' or R by a
|
||||
|
||||
strong-fence ; xb* ; {w and/or r}-pre-bounded
|
||||
|
||||
sequence with no post-bounding, and in every case the LKMM also allows
|
||||
the link simply to be a fence with no bounding at all. If no sequence
|
||||
of the appropriate sort exists, the LKMM says that the accesses race.
|
||||
|
||||
There is one more part of the LKMM related to plain accesses (although
|
||||
not to data races) we should discuss. Recall that many relations such
|
||||
as hb are limited to marked accesses only. As a result, the
|
||||
happens-before, propagates-before, and rcu axioms (which state that
|
||||
various relation must not contain a cycle) doesn't apply to plain
|
||||
accesses. Nevertheless, we do want to rule out such cycles, because
|
||||
they don't make sense even for plain accesses.
|
||||
|
||||
To this end, the LKMM imposes three extra restrictions, together
|
||||
called the "plain-coherence" axiom because of their resemblance to the
|
||||
rules used by the operational model to ensure cache coherence (that
|
||||
is, the rules governing the memory subsystem's choice of a store to
|
||||
satisfy a load request and its determination of where a store will
|
||||
fall in the coherence order):
|
||||
|
||||
If R and W conflict and it is possible to link R to W by one
|
||||
of the xb* sequences listed above, then W ->rfe R is not
|
||||
allowed (i.e., a load cannot read from a store that it
|
||||
executes before, even if one or both is plain).
|
||||
|
||||
If W and R conflict and it is possible to link W to R by one
|
||||
of the vis sequences listed above, then R ->fre W is not
|
||||
allowed (i.e., if a store is visible to a load then the load
|
||||
must read from that store or one coherence-after it).
|
||||
|
||||
If W and W' conflict and it is possible to link W to W' by one
|
||||
of the vis sequences listed above, then W' ->co W is not
|
||||
allowed (i.e., if one store is visible to a second then the
|
||||
second must come after the first in the coherence order).
|
||||
|
||||
This is the extent to which the LKMM deals with plain accesses.
|
||||
Perhaps it could say more (for example, plain accesses might
|
||||
contribute to the ppo relation), but at the moment it seems that this
|
||||
minimal, conservative approach is good enough.
|
||||
|
||||
|
||||
ODDS AND ENDS
|
||||
-------------
|
||||
|
||||
@ -1943,6 +2481,16 @@ treated as READ_ONCE() and rcu_assign_pointer() is treated as
|
||||
smp_store_release() -- which is basically how the Linux kernel treats
|
||||
them.
|
||||
|
||||
Although we said that plain accesses are not linked by the ppo
|
||||
relation, they do contribute to it indirectly. Namely, when there is
|
||||
an address dependency from a marked load R to a plain store W,
|
||||
followed by smp_wmb() and then a marked store W', the LKMM creates a
|
||||
ppo link from R to W'. The reasoning behind this is perhaps a little
|
||||
shaky, but essentially it says there is no way to generate object code
|
||||
for this source code in which W' could execute before R. Just as with
|
||||
pre-bounding by address dependencies, it is possible for the compiler
|
||||
to undermine this relation if sufficient care is not taken.
|
||||
|
||||
There are a few oddball fences which need special treatment:
|
||||
smp_mb__before_atomic(), smp_mb__after_atomic(), and
|
||||
smp_mb__after_spinlock(). The LKMM uses fence events with special
|
||||
|
@ -197,7 +197,7 @@ empty (wr-incoh | rw-incoh | ww-incoh) as plain-coherence
|
||||
(* Actual races *)
|
||||
let ww-nonrace = ww-vis & ((Marked * W) | rw-xbstar) & ((W * Marked) | wr-vis)
|
||||
let ww-race = (pre-race & co) \ ww-nonrace
|
||||
let wr-race = (pre-race & (co? ; rf)) \ wr-vis
|
||||
let wr-race = (pre-race & (co? ; rf)) \ wr-vis \ rw-xbstar^-1
|
||||
let rw-race = (pre-race & fr) \ rw-xbstar
|
||||
|
||||
flag ~empty (ww-race | wr-race | rw-race) as data-race
|
||||
|
@ -1,8 +1,5 @@
|
||||
CONFIG_SMP=y
|
||||
CONFIG_NR_CPUS=2
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_PREEMPT_NONE=n
|
||||
CONFIG_PREEMPT_VOLUNTARY=n
|
||||
CONFIG_PREEMPT=y
|
||||
|
@ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=y
|
||||
CONFIG_NO_HZ_FULL=n
|
||||
CONFIG_RCU_FAST_NO_HZ=n
|
||||
CONFIG_RCU_TRACE=n
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_RCU_FANOUT=3
|
||||
CONFIG_RCU_FANOUT_LEAF=3
|
||||
CONFIG_RCU_NOCB_CPU=n
|
||||
|
@ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=n
|
||||
CONFIG_NO_HZ_FULL=y
|
||||
CONFIG_RCU_FAST_NO_HZ=y
|
||||
CONFIG_RCU_TRACE=y
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_RCU_FANOUT=4
|
||||
CONFIG_RCU_FANOUT_LEAF=3
|
||||
CONFIG_DEBUG_LOCK_ALLOC=n
|
||||
|
@ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=y
|
||||
CONFIG_NO_HZ_FULL=n
|
||||
CONFIG_RCU_FAST_NO_HZ=n
|
||||
CONFIG_RCU_TRACE=n
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_RCU_FANOUT=6
|
||||
CONFIG_RCU_FANOUT_LEAF=6
|
||||
CONFIG_RCU_NOCB_CPU=n
|
||||
|
@ -9,9 +9,6 @@ CONFIG_NO_HZ_IDLE=y
|
||||
CONFIG_NO_HZ_FULL=n
|
||||
CONFIG_RCU_FAST_NO_HZ=n
|
||||
CONFIG_RCU_TRACE=n
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_RCU_FANOUT=3
|
||||
CONFIG_RCU_FANOUT_LEAF=2
|
||||
CONFIG_RCU_NOCB_CPU=y
|
||||
|
@ -8,9 +8,6 @@ CONFIG_HZ_PERIODIC=n
|
||||
CONFIG_NO_HZ_IDLE=y
|
||||
CONFIG_NO_HZ_FULL=n
|
||||
CONFIG_RCU_TRACE=n
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_RCU_NOCB_CPU=n
|
||||
CONFIG_DEBUG_LOCK_ALLOC=n
|
||||
CONFIG_RCU_BOOST=n
|
||||
|
@ -6,9 +6,6 @@ CONFIG_PREEMPT=n
|
||||
CONFIG_HZ_PERIODIC=n
|
||||
CONFIG_NO_HZ_IDLE=y
|
||||
CONFIG_NO_HZ_FULL=n
|
||||
CONFIG_HOTPLUG_CPU=n
|
||||
CONFIG_SUSPEND=n
|
||||
CONFIG_HIBERNATION=n
|
||||
CONFIG_DEBUG_LOCK_ALLOC=n
|
||||
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
|
||||
CONFIG_RCU_EXPERT=y
|
||||
|
@ -6,7 +6,6 @@ Kconfig Parameters:
|
||||
|
||||
CONFIG_DEBUG_LOCK_ALLOC -- Do three, covering CONFIG_PROVE_LOCKING & not.
|
||||
CONFIG_DEBUG_OBJECTS_RCU_HEAD -- Do one.
|
||||
CONFIG_HOTPLUG_CPU -- Do half. (Every second.)
|
||||
CONFIG_HZ_PERIODIC -- Do one.
|
||||
CONFIG_NO_HZ_IDLE -- Do those not otherwise specified. (Groups of two.)
|
||||
CONFIG_NO_HZ_FULL -- Do two, one with partial CPU enablement.
|
||||
|
Loading…
Reference in New Issue
Block a user