Merge 'master' into 'os-build'
This commit is contained in:
commit
ba00cd785c
@ -100,6 +100,17 @@ Description:
|
||||
This attribute indicates the mode that the irq vector named by
|
||||
the file is in (msi vs. msix)
|
||||
|
||||
What: /sys/bus/pci/devices/.../irq
|
||||
Date: August 2021
|
||||
Contact: Linux PCI developers <linux-pci@vger.kernel.org>
|
||||
Description:
|
||||
If a driver has enabled MSI (not MSI-X), "irq" contains the
|
||||
IRQ of the first MSI vector. Otherwise "irq" contains the
|
||||
IRQ of the legacy INTx interrupt.
|
||||
|
||||
"irq" being set to 0 indicates that the device isn't
|
||||
capable of generating legacy INTx interrupts.
|
||||
|
||||
What: /sys/bus/pci/devices/.../remove
|
||||
Date: January 2009
|
||||
Contact: Linux PCI developers <linux-pci@vger.kernel.org>
|
||||
|
@ -328,6 +328,14 @@ as idle::
|
||||
From now on, any pages on zram are idle pages. The idle mark
|
||||
will be removed until someone requests access of the block.
|
||||
IOW, unless there is access request, those pages are still idle pages.
|
||||
Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be
|
||||
marked as idle based on how long (in seconds) it's been since they were
|
||||
last accessed::
|
||||
|
||||
echo 86400 > /sys/block/zramX/idle
|
||||
|
||||
In this example all pages which haven't been accessed in more than 86400
|
||||
seconds (one day) will be marked idle.
|
||||
|
||||
Admin can request writeback of those idle pages at right timing via::
|
||||
|
||||
|
@ -87,10 +87,8 @@ Brief summary of control files.
|
||||
memory.oom_control set/show oom controls.
|
||||
memory.numa_stat show the number of memory usage per numa
|
||||
node
|
||||
memory.kmem.limit_in_bytes set/show hard limit for kernel memory
|
||||
This knob is deprecated and shouldn't be
|
||||
used. It is planned that this be removed in
|
||||
the foreseeable future.
|
||||
memory.kmem.limit_in_bytes This knob is deprecated and writing to
|
||||
it will return -ENOTSUPP.
|
||||
memory.kmem.usage_in_bytes show current kernel memory allocation
|
||||
memory.kmem.failcnt show the number of kernel memory usage
|
||||
hits limits
|
||||
@ -518,11 +516,6 @@ will be charged as a new owner of it.
|
||||
charged file caches. Some out-of-use page caches may keep charged until
|
||||
memory pressure happens. If you want to avoid that, force_empty will be useful.
|
||||
|
||||
Also, note that when memory.kmem.limit_in_bytes is set the charges due to
|
||||
kernel pages will still be seen. This is not considered a failure and the
|
||||
write will still return success. In this case, it is expected that
|
||||
memory.kmem.usage_in_bytes == memory.usage_in_bytes.
|
||||
|
||||
5.2 stat file
|
||||
-------------
|
||||
|
||||
|
78
Documentation/admin-guide/filesystem-monitoring.rst
Normal file
78
Documentation/admin-guide/filesystem-monitoring.rst
Normal file
@ -0,0 +1,78 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
====================================
|
||||
File system Monitoring with fanotify
|
||||
====================================
|
||||
|
||||
File system Error Reporting
|
||||
===========================
|
||||
|
||||
Fanotify supports the FAN_FS_ERROR event type for file system-wide error
|
||||
reporting. It is meant to be used by file system health monitoring
|
||||
daemons, which listen for these events and take actions (notify
|
||||
sysadmin, start recovery) when a file system problem is detected.
|
||||
|
||||
By design, a FAN_FS_ERROR notification exposes sufficient information
|
||||
for a monitoring tool to know a problem in the file system has happened.
|
||||
It doesn't necessarily provide a user space application with semantics
|
||||
to verify an IO operation was successfully executed. That is out of
|
||||
scope for this feature. Instead, it is only meant as a framework for
|
||||
early file system problem detection and reporting recovery tools.
|
||||
|
||||
When a file system operation fails, it is common for dozens of kernel
|
||||
errors to cascade after the initial failure, hiding the original failure
|
||||
log, which is usually the most useful debug data to troubleshoot the
|
||||
problem. For this reason, FAN_FS_ERROR tries to report only the first
|
||||
error that occurred for a file system since the last notification, and
|
||||
it simply counts additional errors. This ensures that the most
|
||||
important pieces of information are never lost.
|
||||
|
||||
FAN_FS_ERROR requires the fanotify group to be setup with the
|
||||
FAN_REPORT_FID flag.
|
||||
|
||||
At the time of this writing, the only file system that emits FAN_FS_ERROR
|
||||
notifications is Ext4.
|
||||
|
||||
A FAN_FS_ERROR Notification has the following format::
|
||||
|
||||
::
|
||||
|
||||
[ Notification Metadata (Mandatory) ]
|
||||
[ Generic Error Record (Mandatory) ]
|
||||
[ FID record (Mandatory) ]
|
||||
|
||||
The order of records is not guaranteed, and new records might be added
|
||||
in the future. Therefore, applications must not rely on the order and
|
||||
must be prepared to skip over unknown records. Please refer to
|
||||
``samples/fanotify/fs-monitor.c`` for an example parser.
|
||||
|
||||
Generic error record
|
||||
--------------------
|
||||
|
||||
The generic error record provides enough information for a file system
|
||||
agnostic tool to learn about a problem in the file system, without
|
||||
providing any additional details about the problem. This record is
|
||||
identified by ``struct fanotify_event_info_header.info_type`` being set
|
||||
to FAN_EVENT_INFO_TYPE_ERROR.
|
||||
|
||||
::
|
||||
|
||||
struct fanotify_event_info_error {
|
||||
struct fanotify_event_info_header hdr;
|
||||
__s32 error;
|
||||
__u32 error_count;
|
||||
};
|
||||
|
||||
The `error` field identifies the type of error using errno values.
|
||||
`error_count` tracks the number of errors that occurred and were
|
||||
suppressed to preserve the original error information, since the last
|
||||
notification.
|
||||
|
||||
FID record
|
||||
----------
|
||||
|
||||
The FID record can be used to uniquely identify the inode that triggered
|
||||
the error through the combination of fsid and file handle. A file system
|
||||
specific application can use that information to attempt a recovery
|
||||
procedure. Errors that are not related to an inode are reported with an
|
||||
empty file handle of type FILEID_INVALID.
|
@ -82,6 +82,7 @@ configure specific aspects of kernel behavior to your liking.
|
||||
edid
|
||||
efi-stub
|
||||
ext4
|
||||
filesystem-monitoring
|
||||
nfs/index
|
||||
gpio/index
|
||||
highuid
|
||||
|
@ -1582,8 +1582,10 @@
|
||||
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
|
||||
|
||||
hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
|
||||
of gigantic hugepages.
|
||||
Format: nn[KMGTPE]
|
||||
of gigantic hugepages. Or using node format, the size
|
||||
of a CMA area per node can be specified.
|
||||
Format: nn[KMGTPE] or (node format)
|
||||
<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
|
||||
|
||||
Reserve a CMA area of given size and allocate gigantic
|
||||
hugepages using the CMA allocator. If enabled, the
|
||||
@ -1594,9 +1596,11 @@
|
||||
the number of pages of hugepagesz to be allocated.
|
||||
If this is the first HugeTLB parameter on the command
|
||||
line, it specifies the number of pages to allocate for
|
||||
the default huge page size. See also
|
||||
Documentation/admin-guide/mm/hugetlbpage.rst.
|
||||
Format: <integer>
|
||||
the default huge page size. If using node format, the
|
||||
number of pages to allocate per-node can be specified.
|
||||
See also Documentation/admin-guide/mm/hugetlbpage.rst.
|
||||
Format: <integer> or (node format)
|
||||
<node>:<integer>[,<node>:<integer>]
|
||||
|
||||
hugepagesz=
|
||||
[HW] The size of the HugeTLB pages. This is used in
|
||||
@ -4988,6 +4992,18 @@
|
||||
an IOTLB flush. Default is lazy flushing before reuse,
|
||||
which is faster.
|
||||
|
||||
s390_iommu_aperture= [KNL,S390]
|
||||
Specifies the size of the per device DMA address space
|
||||
accessible through the DMA and IOMMU APIs as a decimal
|
||||
factor of the size of main memory.
|
||||
The default is 1 meaning that one can concurrently use
|
||||
as many DMA addresses as physical memory is installed,
|
||||
if supported by hardware, and thus map all of memory
|
||||
once. With a value of 2 one can map all of memory twice
|
||||
and so on. As a special case a factor of 0 imposes no
|
||||
restrictions other than those given by hardware at the
|
||||
cost of significant additional memory use for tables.
|
||||
|
||||
sa1100ir [NET]
|
||||
See drivers/net/irda/sa1100_ir.c.
|
||||
|
||||
|
@ -13,3 +13,4 @@ optimize those.
|
||||
|
||||
start
|
||||
usage
|
||||
reclaim
|
||||
|
235
Documentation/admin-guide/mm/damon/reclaim.rst
Normal file
235
Documentation/admin-guide/mm/damon/reclaim.rst
Normal file
@ -0,0 +1,235 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=======================
|
||||
DAMON-based Reclamation
|
||||
=======================
|
||||
|
||||
DAMON-based Reclamation (DAMON_RECLAIM) is a static kernel module that aimed to
|
||||
be used for proactive and lightweight reclamation under light memory pressure.
|
||||
It doesn't aim to replace the LRU-list based page_granularity reclamation, but
|
||||
to be selectively used for different level of memory pressure and requirements.
|
||||
|
||||
Where Proactive Reclamation is Required?
|
||||
========================================
|
||||
|
||||
On general memory over-committed systems, proactively reclaiming cold pages
|
||||
helps saving memory and reducing latency spikes that incurred by the direct
|
||||
reclaim of the process or CPU consumption of kswapd, while incurring only
|
||||
minimal performance degradation [1]_ [2]_ .
|
||||
|
||||
Free Pages Reporting [3]_ based memory over-commit virtualization systems are
|
||||
good example of the cases. In such systems, the guest VMs reports their free
|
||||
memory to host, and the host reallocates the reported memory to other guests.
|
||||
As a result, the memory of the systems are fully utilized. However, the
|
||||
guests could be not so memory-frugal, mainly because some kernel subsystems and
|
||||
user-space applications are designed to use as much memory as available. Then,
|
||||
guests could report only small amount of memory as free to host, results in
|
||||
memory utilization drop of the systems. Running the proactive reclamation in
|
||||
guests could mitigate this problem.
|
||||
|
||||
How It Works?
|
||||
=============
|
||||
|
||||
DAMON_RECLAIM finds memory regions that didn't accessed for specific time
|
||||
duration and page out. To avoid it consuming too much CPU for the paging out
|
||||
operation, a speed limit can be configured. Under the speed limit, it pages
|
||||
out memory regions that didn't accessed longer time first. System
|
||||
administrators can also configure under what situation this scheme should
|
||||
automatically activated and deactivated with three memory pressure watermarks.
|
||||
|
||||
Interface: Module Parameters
|
||||
============================
|
||||
|
||||
To use this feature, you should first ensure your system is running on a kernel
|
||||
that is built with ``CONFIG_DAMON_RECLAIM=y``.
|
||||
|
||||
To let sysadmins enable or disable it and tune for the given system,
|
||||
DAMON_RECLAIM utilizes module parameters. That is, you can put
|
||||
``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write
|
||||
proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files.
|
||||
|
||||
Note that the parameter values except ``enabled`` are applied only when
|
||||
DAMON_RECLAIM starts. Therefore, if you want to apply new parameter values in
|
||||
runtime and DAMON_RECLAIM is already enabled, you should disable and re-enable
|
||||
it via ``enabled`` parameter file. Writing of the new values to proper
|
||||
parameter values should be done before the re-enablement.
|
||||
|
||||
Below are the description of each parameter.
|
||||
|
||||
enabled
|
||||
-------
|
||||
|
||||
Enable or disable DAMON_RECLAIM.
|
||||
|
||||
You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``.
|
||||
Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do
|
||||
no real monitoring and reclamation due to the watermarks-based activation
|
||||
condition. Refer to below descriptions for the watermarks parameter for this.
|
||||
|
||||
min_age
|
||||
-------
|
||||
|
||||
Time threshold for cold memory regions identification in microseconds.
|
||||
|
||||
If a memory region is not accessed for this or longer time, DAMON_RECLAIM
|
||||
identifies the region as cold, and reclaims it.
|
||||
|
||||
120 seconds by default.
|
||||
|
||||
quota_ms
|
||||
--------
|
||||
|
||||
Limit of time for the reclamation in milliseconds.
|
||||
|
||||
DAMON_RECLAIM tries to use only up to this time within a time window
|
||||
(quota_reset_interval_ms) for trying reclamation of cold pages. This can be
|
||||
used for limiting CPU consumption of DAMON_RECLAIM. If the value is zero, the
|
||||
limit is disabled.
|
||||
|
||||
10 ms by default.
|
||||
|
||||
quota_sz
|
||||
--------
|
||||
|
||||
Limit of size of memory for the reclamation in bytes.
|
||||
|
||||
DAMON_RECLAIM charges amount of memory which it tried to reclaim within a time
|
||||
window (quota_reset_interval_ms) and makes no more than this limit is tried.
|
||||
This can be used for limiting consumption of CPU and IO. If this value is
|
||||
zero, the limit is disabled.
|
||||
|
||||
128 MiB by default.
|
||||
|
||||
quota_reset_interval_ms
|
||||
-----------------------
|
||||
|
||||
The time/size quota charge reset interval in milliseconds.
|
||||
|
||||
The charget reset interval for the quota of time (quota_ms) and size
|
||||
(quota_sz). That is, DAMON_RECLAIM does not try reclamation for more than
|
||||
quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms
|
||||
milliseconds.
|
||||
|
||||
1 second by default.
|
||||
|
||||
wmarks_interval
|
||||
---------------
|
||||
|
||||
Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is
|
||||
enabled but inactive due to its watermarks rule.
|
||||
|
||||
wmarks_high
|
||||
-----------
|
||||
|
||||
Free memory rate (per thousand) for the high watermark.
|
||||
|
||||
If free memory of the system in bytes per thousand bytes is higher than this,
|
||||
DAMON_RECLAIM becomes inactive, so it does nothing but only periodically checks
|
||||
the watermarks.
|
||||
|
||||
wmarks_mid
|
||||
----------
|
||||
|
||||
Free memory rate (per thousand) for the middle watermark.
|
||||
|
||||
If free memory of the system in bytes per thousand bytes is between this and
|
||||
the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring and
|
||||
the reclaiming.
|
||||
|
||||
wmarks_low
|
||||
----------
|
||||
|
||||
Free memory rate (per thousand) for the low watermark.
|
||||
|
||||
If free memory of the system in bytes per thousand bytes is lower than this,
|
||||
DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks the
|
||||
watermarks. In the case, the system falls back to the LRU-list based page
|
||||
granularity reclamation logic.
|
||||
|
||||
sample_interval
|
||||
---------------
|
||||
|
||||
Sampling interval for the monitoring in microseconds.
|
||||
|
||||
The sampling interval of DAMON for the cold memory monitoring. Please refer to
|
||||
the DAMON documentation (:doc:`usage`) for more detail.
|
||||
|
||||
aggr_interval
|
||||
-------------
|
||||
|
||||
Aggregation interval for the monitoring in microseconds.
|
||||
|
||||
The aggregation interval of DAMON for the cold memory monitoring. Please
|
||||
refer to the DAMON documentation (:doc:`usage`) for more detail.
|
||||
|
||||
min_nr_regions
|
||||
--------------
|
||||
|
||||
Minimum number of monitoring regions.
|
||||
|
||||
The minimal number of monitoring regions of DAMON for the cold memory
|
||||
monitoring. This can be used to set lower-bound of the monitoring quality.
|
||||
But, setting this too high could result in increased monitoring overhead.
|
||||
Please refer to the DAMON documentation (:doc:`usage`) for more detail.
|
||||
|
||||
max_nr_regions
|
||||
--------------
|
||||
|
||||
Maximum number of monitoring regions.
|
||||
|
||||
The maximum number of monitoring regions of DAMON for the cold memory
|
||||
monitoring. This can be used to set upper-bound of the monitoring overhead.
|
||||
However, setting this too low could result in bad monitoring quality. Please
|
||||
refer to the DAMON documentation (:doc:`usage`) for more detail.
|
||||
|
||||
monitor_region_start
|
||||
--------------------
|
||||
|
||||
Start of target memory region in physical address.
|
||||
|
||||
The start physical address of memory region that DAMON_RECLAIM will do work
|
||||
against. That is, DAMON_RECLAIM will find cold memory regions in this region
|
||||
and reclaims. By default, biggest System RAM is used as the region.
|
||||
|
||||
monitor_region_end
|
||||
------------------
|
||||
|
||||
End of target memory region in physical address.
|
||||
|
||||
The end physical address of memory region that DAMON_RECLAIM will do work
|
||||
against. That is, DAMON_RECLAIM will find cold memory regions in this region
|
||||
and reclaims. By default, biggest System RAM is used as the region.
|
||||
|
||||
kdamond_pid
|
||||
-----------
|
||||
|
||||
PID of the DAMON thread.
|
||||
|
||||
If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else,
|
||||
-1.
|
||||
|
||||
Example
|
||||
=======
|
||||
|
||||
Below runtime example commands make DAMON_RECLAIM to find memory regions that
|
||||
not accessed for 30 seconds or more and pages out. The reclamation is limited
|
||||
to be done only up to 1 GiB per second to avoid DAMON_RECLAIM consuming too
|
||||
much CPU time for the paging out operation. It also asks DAMON_RECLAIM to do
|
||||
nothing if the system's free memory rate is more than 50%, but start the real
|
||||
works if it becomes lower than 40%. If DAMON_RECLAIM doesn't make progress and
|
||||
therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to
|
||||
do nothing again, so that we can fall back to the LRU-list based page
|
||||
granularity reclamation. ::
|
||||
|
||||
# cd /sys/modules/damon_reclaim/parameters
|
||||
# echo 30000000 > min_age
|
||||
# echo $((1 * 1024 * 1024 * 1024)) > quota_sz
|
||||
# echo 1000 > quota_reset_interval_ms
|
||||
# echo 500 > wmarks_high
|
||||
# echo 400 > wmarks_mid
|
||||
# echo 200 > wmarks_low
|
||||
# echo Y > enabled
|
||||
|
||||
.. [1] https://research.google/pubs/pub48551/
|
||||
.. [2] https://lwn.net/Articles/787611/
|
||||
.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
|
@ -6,39 +6,9 @@ Getting Started
|
||||
|
||||
This document briefly describes how you can use DAMON by demonstrating its
|
||||
default user space tool. Please note that this document describes only a part
|
||||
of its features for brevity. Please refer to :doc:`usage` for more details.
|
||||
|
||||
|
||||
TL; DR
|
||||
======
|
||||
|
||||
Follow the commands below to monitor and visualize the memory access pattern of
|
||||
your workload. ::
|
||||
|
||||
# # build the kernel with CONFIG_DAMON_*=y, install it, and reboot
|
||||
# mount -t debugfs none /sys/kernel/debug/
|
||||
# git clone https://github.com/awslabs/damo
|
||||
# ./damo/damo record $(pidof <your workload>)
|
||||
# ./damo/damo report heat --plot_ascii
|
||||
|
||||
The final command draws the access heatmap of ``<your workload>``. The heatmap
|
||||
shows which memory region (x-axis) is accessed when (y-axis) and how frequently
|
||||
(number; the higher the more accesses have been observed). ::
|
||||
|
||||
111111111111111111111111111111111111111111111111111111110000
|
||||
111121111111111111111111111111211111111111111111111111110000
|
||||
000000000000000000000000000000000000000000000000001555552000
|
||||
000000000000000000000000000000000000000000000222223555552000
|
||||
000000000000000000000000000000000000000011111677775000000000
|
||||
000000000000000000000000000000000000000488888000000000000000
|
||||
000000000000000000000000000000000177888400000000000000000000
|
||||
000000000000000000000000000046666522222100000000000000000000
|
||||
000000000000000000000014444344444300000000000000000000000000
|
||||
000000000000000002222245555510000000000000000000000000000000
|
||||
# access_frequency: 0 1 2 3 4 5 6 7 8 9
|
||||
# x-axis: space (140286319947776-140286426374096: 101.496 MiB)
|
||||
# y-axis: time (605442256436361-605479951866441: 37.695430s)
|
||||
# resolution: 60x10 (1.692 MiB and 3.770s for each character)
|
||||
of its features for brevity. Please refer to the usage `doc
|
||||
<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more
|
||||
details.
|
||||
|
||||
|
||||
Prerequisites
|
||||
@ -91,24 +61,74 @@ pattern in the ``damon.data`` file.
|
||||
Visualizing Recorded Patterns
|
||||
=============================
|
||||
|
||||
The following three commands visualize the recorded access patterns and save
|
||||
the results as separate image files. ::
|
||||
You can visualize the pattern in a heatmap, showing which memory region
|
||||
(x-axis) got accessed when (y-axis) and how frequently (number).::
|
||||
|
||||
$ damo report heats --heatmap access_pattern_heatmap.png
|
||||
$ damo report wss --range 0 101 1 --plot wss_dist.png
|
||||
$ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
|
||||
$ sudo damo report heats --heatmap stdout
|
||||
22222222222222222222222222222222222222211111111111111111111111111111111111111100
|
||||
44444444444444444444444444444444444444434444444444444444444444444444444444443200
|
||||
44444444444444444444444444444444444444433444444444444444444444444444444444444200
|
||||
33333333333333333333333333333333333333344555555555555555555555555555555555555200
|
||||
33333333333333333333333333333333333344444444444444444444444444444444444444444200
|
||||
22222222222222222222222222222222222223355555555555555555555555555555555555555200
|
||||
00000000000000000000000000000000000000288888888888888888888888888888888888888400
|
||||
00000000000000000000000000000000000000288888888888888888888888888888888888888400
|
||||
33333333333333333333333333333333333333355555555555555555555555555555555555555200
|
||||
88888888888888888888888888888888888888600000000000000000000000000000000000000000
|
||||
88888888888888888888888888888888888888600000000000000000000000000000000000000000
|
||||
33333333333333333333333333333333333333444444444444444444444444444444444444443200
|
||||
00000000000000000000000000000000000000288888888888888888888888888888888888888400
|
||||
[...]
|
||||
# access_frequency: 0 1 2 3 4 5 6 7 8 9
|
||||
# x-axis: space (139728247021568-139728453431248: 196.848 MiB)
|
||||
# y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)
|
||||
# resolution: 80x40 (2.461 MiB and 1.758 s for each character)
|
||||
|
||||
- ``access_pattern_heatmap.png`` will visualize the data access pattern in a
|
||||
heatmap, showing which memory region (y-axis) got accessed when (x-axis)
|
||||
and how frequently (color).
|
||||
- ``wss_dist.png`` will show the distribution of the working set size.
|
||||
- ``wss_chron_change.png`` will show how the working set size has
|
||||
chronologically changed.
|
||||
You can also visualize the distribution of the working set size, sorted by the
|
||||
size.::
|
||||
|
||||
You can view the visualizations of this example workload at [1]_.
|
||||
Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
|
||||
$ sudo damo report wss --range 0 101 10
|
||||
# <percentile> <wss>
|
||||
# target_id 18446632103789443072
|
||||
# avr: 107.708 MiB
|
||||
0 0 B | |
|
||||
10 95.328 MiB |**************************** |
|
||||
20 95.332 MiB |**************************** |
|
||||
30 95.340 MiB |**************************** |
|
||||
40 95.387 MiB |**************************** |
|
||||
50 95.387 MiB |**************************** |
|
||||
60 95.398 MiB |**************************** |
|
||||
70 95.398 MiB |**************************** |
|
||||
80 95.504 MiB |**************************** |
|
||||
90 190.703 MiB |********************************************************* |
|
||||
100 196.875 MiB |***********************************************************|
|
||||
|
||||
.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
|
||||
.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
|
||||
.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
|
||||
.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
|
||||
Using ``--sortby`` option with the above command, you can show how the working
|
||||
set size has chronologically changed.::
|
||||
|
||||
$ sudo damo report wss --range 0 101 10 --sortby time
|
||||
# <percentile> <wss>
|
||||
# target_id 18446632103789443072
|
||||
# avr: 107.708 MiB
|
||||
0 3.051 MiB | |
|
||||
10 190.703 MiB |***********************************************************|
|
||||
20 95.336 MiB |***************************** |
|
||||
30 95.328 MiB |***************************** |
|
||||
40 95.387 MiB |***************************** |
|
||||
50 95.332 MiB |***************************** |
|
||||
60 95.320 MiB |***************************** |
|
||||
70 95.398 MiB |***************************** |
|
||||
80 95.398 MiB |***************************** |
|
||||
90 95.340 MiB |***************************** |
|
||||
100 95.398 MiB |***************************** |
|
||||
|
||||
|
||||
Data Access Pattern Aware Memory Management
|
||||
===========================================
|
||||
|
||||
Below three commands make every memory region of size >=4K that doesn't
|
||||
accessed for >=60 seconds in your workload to be swapped out. ::
|
||||
|
||||
$ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme
|
||||
$ echo "4K max 0 0 60s max pageout" >> test_scheme
|
||||
$ damo schemes -c test_scheme <pid of your workload>
|
||||
|
@ -10,15 +10,16 @@ DAMON provides below three interfaces for different users.
|
||||
This is for privileged people such as system administrators who want a
|
||||
just-working human-friendly interface. Using this, users can use the DAMON’s
|
||||
major features in a human-friendly way. It may not be highly tuned for
|
||||
special cases, though. It supports only virtual address spaces monitoring.
|
||||
special cases, though. It supports both virtual and physical address spaces
|
||||
monitoring.
|
||||
- *debugfs interface.*
|
||||
This is for privileged user space programmers who want more optimized use of
|
||||
DAMON. Using this, users can use DAMON’s major features by reading
|
||||
from and writing to special debugfs files. Therefore, you can write and use
|
||||
your personalized DAMON debugfs wrapper programs that reads/writes the
|
||||
debugfs files instead of you. The DAMON user space tool is also a reference
|
||||
implementation of such programs. It supports only virtual address spaces
|
||||
monitoring.
|
||||
implementation of such programs. It supports both virtual and physical
|
||||
address spaces monitoring.
|
||||
- *Kernel Space Programming Interface.*
|
||||
This is for kernel space programmers. Using this, users can utilize every
|
||||
feature of DAMON most flexibly and efficiently by writing kernel space
|
||||
@ -34,8 +35,9 @@ the reason, this document describes only the debugfs interface
|
||||
debugfs Interface
|
||||
=================
|
||||
|
||||
DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under
|
||||
its debugfs directory, ``<debugfs>/damon/``.
|
||||
DAMON exports five files, ``attrs``, ``target_ids``, ``init_regions``,
|
||||
``schemes`` and ``monitor_on`` under its debugfs directory,
|
||||
``<debugfs>/damon/``.
|
||||
|
||||
|
||||
Attributes
|
||||
@ -71,9 +73,106 @@ check it again::
|
||||
# cat target_ids
|
||||
42 4242
|
||||
|
||||
Users can also monitor the physical memory address space of the system by
|
||||
writing a special keyword, "``paddr\n``" to the file. Because physical address
|
||||
space monitoring doesn't support multiple targets, reading the file will show a
|
||||
fake value, ``42``, as below::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo paddr > target_ids
|
||||
# cat target_ids
|
||||
42
|
||||
|
||||
Note that setting the target ids doesn't start the monitoring.
|
||||
|
||||
|
||||
Initial Monitoring Target Regions
|
||||
---------------------------------
|
||||
|
||||
In case of the virtual address space monitoring, DAMON automatically sets and
|
||||
updates the monitoring target regions so that entire memory mappings of target
|
||||
processes can be covered. However, users can want to limit the monitoring
|
||||
region to specific address ranges, such as the heap, the stack, or specific
|
||||
file-mapped area. Or, some users can know the initial access pattern of their
|
||||
workloads and therefore want to set optimal initial regions for the 'adaptive
|
||||
regions adjustment'.
|
||||
|
||||
In contrast, DAMON do not automatically sets and updates the monitoring target
|
||||
regions in case of physical memory monitoring. Therefore, users should set the
|
||||
monitoring target regions by themselves.
|
||||
|
||||
In such cases, users can explicitly set the initial monitoring target regions
|
||||
as they want, by writing proper values to the ``init_regions`` file. Each line
|
||||
of the input should represent one region in below form.::
|
||||
|
||||
<target id> <start address> <end address>
|
||||
|
||||
The ``target id`` should already in ``target_ids`` file, and the regions should
|
||||
be passed in address order. For example, below commands will set a couple of
|
||||
address ranges, ``1-100`` and ``100-200`` as the initial monitoring target
|
||||
region of process 42, and another couple of address ranges, ``20-40`` and
|
||||
``50-100`` as that of process 4242.::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo "42 1 100
|
||||
42 100 200
|
||||
4242 20 40
|
||||
4242 50 100" > init_regions
|
||||
|
||||
Note that this sets the initial monitoring target regions only. In case of
|
||||
virtual memory monitoring, DAMON will automatically updates the boundary of the
|
||||
regions after one ``regions update interval``. Therefore, users should set the
|
||||
``regions update interval`` large enough in this case, if they don't want the
|
||||
update.
|
||||
|
||||
|
||||
Schemes
|
||||
-------
|
||||
|
||||
For usual DAMON-based data access aware memory management optimizations, users
|
||||
would simply want the system to apply a memory management action to a memory
|
||||
region of a specific size having a specific access frequency for a specific
|
||||
time. DAMON receives such formalized operation schemes from the user and
|
||||
applies those to the target processes. It also counts the total number and
|
||||
size of regions that each scheme is applied. This statistics can be used for
|
||||
online analysis or tuning of the schemes.
|
||||
|
||||
Users can get and set the schemes by reading from and writing to ``schemes``
|
||||
debugfs file. Reading the file also shows the statistics of each scheme. To
|
||||
the file, each of the schemes should be represented in each line in below form:
|
||||
|
||||
min-size max-size min-acc max-acc min-age max-age action
|
||||
|
||||
Note that the ranges are closed interval. Bytes for the size of regions
|
||||
(``min-size`` and ``max-size``), number of monitored accesses per aggregate
|
||||
interval for access frequency (``min-acc`` and ``max-acc``), number of
|
||||
aggregate intervals for the age of regions (``min-age`` and ``max-age``), and a
|
||||
predefined integer for memory management actions should be used. The supported
|
||||
numbers and their meanings are as below.
|
||||
|
||||
- 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
|
||||
- 1: Call ``madvise()`` for the region with ``MADV_COLD``
|
||||
- 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
|
||||
- 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
|
||||
- 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
|
||||
- 5: Do nothing but count the statistics
|
||||
|
||||
You can disable schemes by simply writing an empty string to the file. For
|
||||
example, below commands applies a scheme saying "If a memory region of size in
|
||||
[4KiB, 8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
|
||||
interval in [10, 20], page out the region", check the entered scheme again, and
|
||||
finally remove the scheme. ::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo "4096 8192 0 5 10 20 2" > schemes
|
||||
# cat schemes
|
||||
4096 8192 0 5 10 20 2 0 0
|
||||
# echo > schemes
|
||||
|
||||
The last two integers in the 4th line of above example is the total number and
|
||||
the total size of the regions that the scheme is applied.
|
||||
|
||||
|
||||
Turning On/Off
|
||||
--------------
|
||||
|
||||
|
@ -128,7 +128,9 @@ hugepages
|
||||
implicitly specifies the number of huge pages of default size to
|
||||
allocate. If the number of huge pages of default size is implicitly
|
||||
specified, it can not be overwritten by a hugepagesz,hugepages
|
||||
parameter pair for the default size.
|
||||
parameter pair for the default size. This parameter also has a
|
||||
node format. The node format specifies the number of huge pages
|
||||
to allocate on specific nodes.
|
||||
|
||||
For example, on an architecture with 2M default huge page size::
|
||||
|
||||
@ -138,6 +140,14 @@ hugepages
|
||||
indicating that the hugepages=512 parameter is ignored. If a hugepages
|
||||
parameter is preceded by an invalid hugepagesz parameter, it will
|
||||
be ignored.
|
||||
|
||||
Node format example::
|
||||
|
||||
hugepagesz=2M hugepages=0:1,1:2
|
||||
|
||||
It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
|
||||
If the node number is invalid, the parameter will be ignored.
|
||||
|
||||
default_hugepagesz
|
||||
Specify the default huge page size. This parameter can
|
||||
only be specified once on the command line. default_hugepagesz can
|
||||
@ -234,8 +244,12 @@ will exist, of the form::
|
||||
|
||||
hugepages-${size}kB
|
||||
|
||||
Inside each of these directories, the same set of files will exist::
|
||||
Inside each of these directories, the set of files contained in ``/proc``
|
||||
will exist. In addition, two additional interfaces for demoting huge
|
||||
pages may exist::
|
||||
|
||||
demote
|
||||
demote_size
|
||||
nr_hugepages
|
||||
nr_hugepages_mempolicy
|
||||
nr_overcommit_hugepages
|
||||
@ -243,7 +257,29 @@ Inside each of these directories, the same set of files will exist::
|
||||
resv_hugepages
|
||||
surplus_hugepages
|
||||
|
||||
which function as described above for the default huge page-sized case.
|
||||
The demote interfaces provide the ability to split a huge page into
|
||||
smaller huge pages. For example, the x86 architecture supports both
|
||||
1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
|
||||
2MB huge pages. Demote interfaces are not available for the smallest
|
||||
huge page size. The demote interfaces are:
|
||||
|
||||
demote_size
|
||||
is the size of demoted pages. When a page is demoted a corresponding
|
||||
number of huge pages of demote_size will be created. By default,
|
||||
demote_size is set to the next smaller huge page size. If there are
|
||||
multiple smaller huge page sizes, demote_size can be set to any of
|
||||
these smaller sizes. Only huge page sizes less than the current huge
|
||||
pages size are allowed.
|
||||
|
||||
demote
|
||||
is used to demote a number of huge pages. A user with root privileges
|
||||
can write to this file. It may not be possible to demote the
|
||||
requested number of huge pages. To determine how many pages were
|
||||
actually demoted, compare the value of nr_hugepages before and after
|
||||
writing to the demote interface. demote is a write only interface.
|
||||
|
||||
The interfaces which are the same as in ``/proc`` (all except demote and
|
||||
demote_size) function as described above for the default huge page-sized case.
|
||||
|
||||
.. _mem_policy_and_hp_alloc:
|
||||
|
||||
|
@ -37,5 +37,7 @@ the Linux memory management.
|
||||
numaperf
|
||||
pagemap
|
||||
soft-dirty
|
||||
swap_numa
|
||||
transhuge
|
||||
userfaultfd
|
||||
zswap
|
||||
|
@ -165,9 +165,8 @@ Or alternatively::
|
||||
|
||||
% echo 1 > /sys/devices/system/memory/memoryXXX/online
|
||||
|
||||
The kernel will select the target zone automatically, usually defaulting to
|
||||
``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel
|
||||
command line or if the memory block would intersect the ZONE_MOVABLE already.
|
||||
The kernel will select the target zone automatically, depending on the
|
||||
configured ``online_policy``.
|
||||
|
||||
One can explicitly request to associate an offline memory block with
|
||||
ZONE_MOVABLE by::
|
||||
@ -198,6 +197,9 @@ Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
|
||||
|
||||
% echo online > /sys/devices/system/memory/auto_online_blocks
|
||||
|
||||
Similarly to manual onlining, with ``online`` the kernel will select the
|
||||
target zone automatically, depending on the configured ``online_policy``.
|
||||
|
||||
Modifying the auto-online behavior will only affect all subsequently added
|
||||
memory blocks only.
|
||||
|
||||
@ -393,11 +395,16 @@ command line parameters are relevant:
|
||||
======================== =======================================================
|
||||
``memhp_default_state`` configure auto-onlining by essentially setting
|
||||
``/sys/devices/system/memory/auto_online_blocks``.
|
||||
``movablecore`` configure automatic zone selection of the kernel. When
|
||||
set, the kernel will default to ZONE_MOVABLE, unless
|
||||
other zones can be kept contiguous.
|
||||
``movable_node`` configure automatic zone selection in the kernel when
|
||||
using the ``contig-zones`` online policy. When
|
||||
set, the kernel will default to ZONE_MOVABLE when
|
||||
onlining a memory block, unless other zones can be kept
|
||||
contiguous.
|
||||
======================== =======================================================
|
||||
|
||||
See Documentation/admin-guide/kernel-parameters.txt for a more generic
|
||||
description of these command line parameters.
|
||||
|
||||
Module Parameters
|
||||
------------------
|
||||
|
||||
@ -410,24 +417,118 @@ them with ``memory_hotplug.`` such as::
|
||||
|
||||
and they can be observed (and some even modified at runtime) via::
|
||||
|
||||
/sys/modules/memory_hotplug/parameters/
|
||||
/sys/module/memory_hotplug/parameters/
|
||||
|
||||
The following module parameters are currently defined:
|
||||
|
||||
======================== =======================================================
|
||||
``memmap_on_memory`` read-write: Allocate memory for the memmap from the
|
||||
added memory block itself. Even if enabled, actual
|
||||
support depends on various other system properties and
|
||||
should only be regarded as a hint whether the behavior
|
||||
would be desired.
|
||||
================================ ===============================================
|
||||
``memmap_on_memory`` read-write: Allocate memory for the memmap from
|
||||
the added memory block itself. Even if enabled,
|
||||
actual support depends on various other system
|
||||
properties and should only be regarded as a
|
||||
hint whether the behavior would be desired.
|
||||
|
||||
While allocating the memmap from the memory block
|
||||
itself makes memory hotplug less likely to fail and
|
||||
keeps the memmap on the same NUMA node in any case, it
|
||||
can fragment physical memory in a way that huge pages
|
||||
in bigger granularity cannot be formed on hotplugged
|
||||
memory.
|
||||
======================== =======================================================
|
||||
While allocating the memmap from the memory
|
||||
block itself makes memory hotplug less likely
|
||||
to fail and keeps the memmap on the same NUMA
|
||||
node in any case, it can fragment physical
|
||||
memory in a way that huge pages in bigger
|
||||
granularity cannot be formed on hotplugged
|
||||
memory.
|
||||
``online_policy`` read-write: Set the basic policy used for
|
||||
automatic zone selection when onlining memory
|
||||
blocks without specifying a target zone.
|
||||
``contig-zones`` has been the kernel default
|
||||
before this parameter was added. After an
|
||||
online policy was configured and memory was
|
||||
online, the policy should not be changed
|
||||
anymore.
|
||||
|
||||
When set to ``contig-zones``, the kernel will
|
||||
try keeping zones contiguous. If a memory block
|
||||
intersects multiple zones or no zone, the
|
||||
behavior depends on the ``movable_node`` kernel
|
||||
command line parameter: default to ZONE_MOVABLE
|
||||
if set, default to the applicable kernel zone
|
||||
(usually ZONE_NORMAL) if not set.
|
||||
|
||||
When set to ``auto-movable``, the kernel will
|
||||
try onlining memory blocks to ZONE_MOVABLE if
|
||||
possible according to the configuration and
|
||||
memory device details. With this policy, one
|
||||
can avoid zone imbalances when eventually
|
||||
hotplugging a lot of memory later and still
|
||||
wanting to be able to hotunplug as much as
|
||||
possible reliably, very desirable in
|
||||
virtualized environments. This policy ignores
|
||||
the ``movable_node`` kernel command line
|
||||
parameter and isn't really applicable in
|
||||
environments that require it (e.g., bare metal
|
||||
with hotunpluggable nodes) where hotplugged
|
||||
memory might be exposed via the
|
||||
firmware-provided memory map early during boot
|
||||
to the system instead of getting detected,
|
||||
added and onlined later during boot (such as
|
||||
done by virtio-mem or by some hypervisors
|
||||
implementing emulated DIMMs). As one example, a
|
||||
hotplugged DIMM will be onlined either
|
||||
completely to ZONE_MOVABLE or completely to
|
||||
ZONE_NORMAL, not a mixture.
|
||||
As another example, as many memory blocks
|
||||
belonging to a virtio-mem device will be
|
||||
onlined to ZONE_MOVABLE as possible,
|
||||
special-casing units of memory blocks that can
|
||||
only get hotunplugged together. *This policy
|
||||
does not protect from setups that are
|
||||
problematic with ZONE_MOVABLE and does not
|
||||
change the zone of memory blocks dynamically
|
||||
after they were onlined.*
|
||||
``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL
|
||||
memory ratio in % for the ``auto-movable``
|
||||
online policy. Whether the ratio applies only
|
||||
for the system across all NUMA nodes or also
|
||||
per NUMA nodes depends on the
|
||||
``auto_movable_numa_aware`` configuration.
|
||||
|
||||
All accounting is based on present memory pages
|
||||
in the zones combined with accounting per
|
||||
memory device. Memory dedicated to the CMA
|
||||
allocator is accounted as MOVABLE, although
|
||||
residing on one of the kernel zones. The
|
||||
possible ratio depends on the actual workload.
|
||||
The kernel default is "301" %, for example,
|
||||
allowing for hotplugging 24 GiB to a 8 GiB VM
|
||||
and automatically onlining all hotplugged
|
||||
memory to ZONE_MOVABLE in many setups. The
|
||||
additional 1% deals with some pages being not
|
||||
present, for example, because of some firmware
|
||||
allocations.
|
||||
|
||||
Note that ZONE_NORMAL memory provided by one
|
||||
memory device does not allow for more
|
||||
ZONE_MOVABLE memory for a different memory
|
||||
device. As one example, onlining memory of a
|
||||
hotplugged DIMM to ZONE_NORMAL will not allow
|
||||
for another hotplugged DIMM to get onlined to
|
||||
ZONE_MOVABLE automatically. In contrast, memory
|
||||
hotplugged by a virtio-mem device that got
|
||||
onlined to ZONE_NORMAL will allow for more
|
||||
ZONE_MOVABLE memory within *the same*
|
||||
virtio-mem device.
|
||||
``auto_movable_numa_aware`` read-write: Configure whether the
|
||||
``auto_movable_ratio`` in the ``auto-movable``
|
||||
online policy also applies per NUMA
|
||||
node in addition to the whole system across all
|
||||
NUMA nodes. The kernel default is "Y".
|
||||
|
||||
Disabling NUMA awareness can be helpful when
|
||||
dealing with NUMA nodes that should be
|
||||
completely hotunpluggable, onlining the memory
|
||||
completely to ZONE_MOVABLE automatically if
|
||||
possible.
|
||||
|
||||
Parameter availability depends on CONFIG_NUMA.
|
||||
================================ ===============================================
|
||||
|
||||
ZONE_MOVABLE
|
||||
============
|
||||
|
@ -90,13 +90,14 @@ Short descriptions to the page flags
|
||||
====================================
|
||||
|
||||
0 - LOCKED
|
||||
page is being locked for exclusive access, e.g. by undergoing read/write IO
|
||||
The page is being locked for exclusive access, e.g. by undergoing read/write
|
||||
IO.
|
||||
7 - SLAB
|
||||
page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
|
||||
The page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator.
|
||||
When compound page is used, SLUB/SLQB will only set this flag on the head
|
||||
page; SLOB will not flag it at all.
|
||||
10 - BUDDY
|
||||
a free memory block managed by the buddy system allocator
|
||||
A free memory block managed by the buddy system allocator.
|
||||
The buddy system organizes free memory in blocks of various orders.
|
||||
An order N block has 2^N physically contiguous pages, with the BUDDY flag
|
||||
set for and _only_ for the first page.
|
||||
@ -112,65 +113,65 @@ Short descriptions to the page flags
|
||||
16 - COMPOUND_TAIL
|
||||
A compound page tail (see description above).
|
||||
17 - HUGE
|
||||
this is an integral part of a HugeTLB page
|
||||
This is an integral part of a HugeTLB page.
|
||||
19 - HWPOISON
|
||||
hardware detected memory corruption on this page: don't touch the data!
|
||||
Hardware detected memory corruption on this page: don't touch the data!
|
||||
20 - NOPAGE
|
||||
no page frame exists at the requested address
|
||||
No page frame exists at the requested address.
|
||||
21 - KSM
|
||||
identical memory pages dynamically shared between one or more processes
|
||||
Identical memory pages dynamically shared between one or more processes.
|
||||
22 - THP
|
||||
contiguous pages which construct transparent hugepages
|
||||
Contiguous pages which construct transparent hugepages.
|
||||
23 - OFFLINE
|
||||
page is logically offline
|
||||
The page is logically offline.
|
||||
24 - ZERO_PAGE
|
||||
zero page for pfn_zero or huge_zero page
|
||||
Zero page for pfn_zero or huge_zero page.
|
||||
25 - IDLE
|
||||
page has not been accessed since it was marked idle (see
|
||||
The page has not been accessed since it was marked idle (see
|
||||
:ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
|
||||
Note that this flag may be stale in case the page was accessed via
|
||||
a PTE. To make sure the flag is up-to-date one has to read
|
||||
``/sys/kernel/mm/page_idle/bitmap`` first.
|
||||
26 - PGTABLE
|
||||
page is in use as a page table
|
||||
The page is in use as a page table.
|
||||
|
||||
IO related page flags
|
||||
---------------------
|
||||
|
||||
1 - ERROR
|
||||
IO error occurred
|
||||
IO error occurred.
|
||||
3 - UPTODATE
|
||||
page has up-to-date data
|
||||
The page has up-to-date data.
|
||||
ie. for file backed page: (in-memory data revision >= on-disk one)
|
||||
4 - DIRTY
|
||||
page has been written to, hence contains new data
|
||||
The page has been written to, hence contains new data.
|
||||
i.e. for file backed page: (in-memory data revision > on-disk one)
|
||||
8 - WRITEBACK
|
||||
page is being synced to disk
|
||||
The page is being synced to disk.
|
||||
|
||||
LRU related page flags
|
||||
----------------------
|
||||
|
||||
5 - LRU
|
||||
page is in one of the LRU lists
|
||||
The page is in one of the LRU lists.
|
||||
6 - ACTIVE
|
||||
page is in the active LRU list
|
||||
The page is in the active LRU list.
|
||||
18 - UNEVICTABLE
|
||||
page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||
The page is in the unevictable (non-)LRU list It is somehow pinned and
|
||||
not a candidate for LRU page reclaims, e.g. ramfs pages,
|
||||
shmctl(SHM_LOCK) and mlock() memory segments
|
||||
shmctl(SHM_LOCK) and mlock() memory segments.
|
||||
2 - REFERENCED
|
||||
page has been referenced since last LRU list enqueue/requeue
|
||||
The page has been referenced since last LRU list enqueue/requeue.
|
||||
9 - RECLAIM
|
||||
page will be reclaimed soon after its pageout IO completed
|
||||
The page will be reclaimed soon after its pageout IO completed.
|
||||
11 - MMAP
|
||||
a memory mapped page
|
||||
A memory mapped page.
|
||||
12 - ANON
|
||||
a memory mapped page that is not part of a file
|
||||
A memory mapped page that is not part of a file.
|
||||
13 - SWAPCACHE
|
||||
page is mapped to swap space, i.e. has an associated swap entry
|
||||
The page is mapped to swap space, i.e. has an associated swap entry.
|
||||
14 - SWAPBACKED
|
||||
page is backed by swap/RAM
|
||||
The page is backed by swap/RAM.
|
||||
|
||||
The page-types tool in the tools/vm directory can be used to query the
|
||||
above flags.
|
||||
|
@ -57,7 +57,6 @@ The third argument (arg) passes a pointer of struct memory_notify::
|
||||
unsigned long start_pfn;
|
||||
unsigned long nr_pages;
|
||||
int status_change_nid_normal;
|
||||
int status_change_nid_high;
|
||||
int status_change_nid;
|
||||
}
|
||||
|
||||
@ -65,8 +64,6 @@ The third argument (arg) passes a pointer of struct memory_notify::
|
||||
- nr_pages is # of pages of online/offline memory.
|
||||
- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
|
||||
is (will be) set/clear, if this is -1, then nodemask status is not changed.
|
||||
- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
|
||||
is (will be) set/clear, if this is -1, then nodemask status is not changed.
|
||||
- status_change_nid is set node id when N_MEMORY of nodemask is (will be)
|
||||
set/clear. It means a new(memoryless) node gets new memory by online and a
|
||||
node loses all memory. If this is -1, then nodemask status is not changed.
|
||||
|
@ -231,10 +231,14 @@ Guarded allocations are set up based on the sample interval. After expiration
|
||||
of the sample interval, the next allocation through the main allocator (SLAB or
|
||||
SLUB) returns a guarded allocation from the KFENCE object pool (allocation
|
||||
sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and
|
||||
the next allocation is set up after the expiration of the interval. To "gate" a
|
||||
KFENCE allocation through the main allocator's fast-path without overhead,
|
||||
KFENCE relies on static branches via the static keys infrastructure. The static
|
||||
branch is toggled to redirect the allocation to KFENCE.
|
||||
the next allocation is set up after the expiration of the interval.
|
||||
|
||||
When using ``CONFIG_KFENCE_STATIC_KEYS=y``, KFENCE allocations are "gated"
|
||||
through the main allocator's fast-path by relying on static branches via the
|
||||
static keys infrastructure. The static branch is toggled to redirect the
|
||||
allocation to KFENCE. Depending on sample interval, target workloads, and
|
||||
system architecture, this may perform better than the simple dynamic branch.
|
||||
Careful benchmarking is recommended.
|
||||
|
||||
KFENCE objects each reside on a dedicated page, at either the left or right
|
||||
page boundaries selected at random. The pages to the left and right of the
|
||||
@ -269,6 +273,17 @@ tail of KFENCE's freelist, so that the least recently freed objects are reused
|
||||
first, and the chances of detecting use-after-frees of recently freed objects
|
||||
is increased.
|
||||
|
||||
If pool utilization reaches 75% (default) or above, to reduce the risk of the
|
||||
pool eventually being fully occupied by allocated objects yet ensure diverse
|
||||
coverage of allocations, KFENCE limits currently covered allocations of the
|
||||
same source from further filling up the pool. The "source" of an allocation is
|
||||
based on its partial allocation stack trace. A side-effect is that this also
|
||||
limits frequent long-lived allocations (e.g. pagecache) of the same source
|
||||
filling up the pool permanently, which is the most common risk for the pool
|
||||
becoming full and the sampled allocation rate dropping to zero. The threshold
|
||||
at which to start limiting currently covered allocations can be configured via
|
||||
the boot parameter ``kfence.skip_covered_thresh`` (pool usage%).
|
||||
|
||||
Interface
|
||||
---------
|
||||
|
||||
|
142
Documentation/devicetree/bindings/pci/mediatek,mt7621-pcie.yaml
Normal file
142
Documentation/devicetree/bindings/pci/mediatek,mt7621-pcie.yaml
Normal file
@ -0,0 +1,142 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/pci/mediatek,mt7621-pcie.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: MediaTek MT7621 PCIe controller
|
||||
|
||||
maintainers:
|
||||
- Sergio Paracuellos <sergio.paracuellos@gmail.com>
|
||||
|
||||
description: |+
|
||||
MediaTek MT7621 PCIe subsys supports a single Root Complex (RC)
|
||||
with 3 Root Ports. Each Root Port supports a Gen1 1-lane Link
|
||||
|
||||
allOf:
|
||||
- $ref: /schemas/pci/pci-bus.yaml#
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
const: mediatek,mt7621-pci
|
||||
|
||||
reg:
|
||||
items:
|
||||
- description: host-pci bridge registers
|
||||
- description: pcie port 0 RC control registers
|
||||
- description: pcie port 1 RC control registers
|
||||
- description: pcie port 2 RC control registers
|
||||
|
||||
ranges:
|
||||
maxItems: 2
|
||||
|
||||
patternProperties:
|
||||
'pcie@[0-2],0':
|
||||
type: object
|
||||
$ref: /schemas/pci/pci-bus.yaml#
|
||||
|
||||
properties:
|
||||
resets:
|
||||
maxItems: 1
|
||||
|
||||
clocks:
|
||||
maxItems: 1
|
||||
|
||||
phys:
|
||||
maxItems: 1
|
||||
|
||||
required:
|
||||
- "#interrupt-cells"
|
||||
- interrupt-map-mask
|
||||
- interrupt-map
|
||||
- resets
|
||||
- clocks
|
||||
- phys
|
||||
- phy-names
|
||||
- ranges
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- ranges
|
||||
- "#interrupt-cells"
|
||||
- interrupt-map-mask
|
||||
- interrupt-map
|
||||
- reset-gpios
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
#include <dt-bindings/gpio/gpio.h>
|
||||
#include <dt-bindings/interrupt-controller/mips-gic.h>
|
||||
|
||||
pcie: pcie@1e140000 {
|
||||
compatible = "mediatek,mt7621-pci";
|
||||
reg = <0x1e140000 0x100>,
|
||||
<0x1e142000 0x100>,
|
||||
<0x1e143000 0x100>,
|
||||
<0x1e144000 0x100>;
|
||||
|
||||
#address-cells = <3>;
|
||||
#size-cells = <2>;
|
||||
pinctrl-names = "default";
|
||||
pinctrl-0 = <&pcie_pins>;
|
||||
device_type = "pci";
|
||||
ranges = <0x02000000 0 0x60000000 0x60000000 0 0x10000000>, /* pci memory */
|
||||
<0x01000000 0 0x1e160000 0x1e160000 0 0x00010000>; /* io space */
|
||||
#interrupt-cells = <1>;
|
||||
interrupt-map-mask = <0xF800 0 0 0>;
|
||||
interrupt-map = <0x0000 0 0 0 &gic GIC_SHARED 4 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<0x0800 0 0 0 &gic GIC_SHARED 24 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<0x1000 0 0 0 &gic GIC_SHARED 25 IRQ_TYPE_LEVEL_HIGH>;
|
||||
reset-gpios = <&gpio 19 GPIO_ACTIVE_LOW>;
|
||||
|
||||
pcie@0,0 {
|
||||
reg = <0x0000 0 0 0 0>;
|
||||
#address-cells = <3>;
|
||||
#size-cells = <2>;
|
||||
device_type = "pci";
|
||||
#interrupt-cells = <1>;
|
||||
interrupt-map-mask = <0 0 0 0>;
|
||||
interrupt-map = <0 0 0 0 &gic GIC_SHARED 4 IRQ_TYPE_LEVEL_HIGH>;
|
||||
resets = <&rstctrl 24>;
|
||||
clocks = <&clkctrl 24>;
|
||||
phys = <&pcie0_phy 1>;
|
||||
phy-names = "pcie-phy0";
|
||||
ranges;
|
||||
};
|
||||
|
||||
pcie@1,0 {
|
||||
reg = <0x0800 0 0 0 0>;
|
||||
#address-cells = <3>;
|
||||
#size-cells = <2>;
|
||||
device_type = "pci";
|
||||
#interrupt-cells = <1>;
|
||||
interrupt-map-mask = <0 0 0 0>;
|
||||
interrupt-map = <0 0 0 0 &gic GIC_SHARED 24 IRQ_TYPE_LEVEL_HIGH>;
|
||||
resets = <&rstctrl 25>;
|
||||
clocks = <&clkctrl 25>;
|
||||
phys = <&pcie0_phy 1>;
|
||||
phy-names = "pcie-phy1";
|
||||
ranges;
|
||||
};
|
||||
|
||||
pcie@2,0 {
|
||||
reg = <0x1000 0 0 0 0>;
|
||||
#address-cells = <3>;
|
||||
#size-cells = <2>;
|
||||
device_type = "pci";
|
||||
#interrupt-cells = <1>;
|
||||
interrupt-map-mask = <0 0 0 0>;
|
||||
interrupt-map = <0 0 0 0 &gic GIC_SHARED 25 IRQ_TYPE_LEVEL_HIGH>;
|
||||
resets = <&rstctrl 26>;
|
||||
clocks = <&clkctrl 26>;
|
||||
phys = <&pcie2_phy 0>;
|
||||
phy-names = "pcie-phy2";
|
||||
ranges;
|
||||
};
|
||||
};
|
||||
...
|
158
Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
Normal file
158
Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
Normal file
@ -0,0 +1,158 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/pci/qcom,pcie-ep.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: Qualcomm PCIe Endpoint Controller binding
|
||||
|
||||
maintainers:
|
||||
- Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
|
||||
allOf:
|
||||
- $ref: "pci-ep.yaml#"
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
const: qcom,sdx55-pcie-ep
|
||||
|
||||
reg:
|
||||
items:
|
||||
- description: Qualcomm-specific PARF configuration registers
|
||||
- description: DesignWare PCIe registers
|
||||
- description: External local bus interface registers
|
||||
- description: Address Translation Unit (ATU) registers
|
||||
- description: Memory region used to map remote RC address space
|
||||
- description: BAR memory region
|
||||
|
||||
reg-names:
|
||||
items:
|
||||
- const: parf
|
||||
- const: dbi
|
||||
- const: elbi
|
||||
- const: atu
|
||||
- const: addr_space
|
||||
- const: mmio
|
||||
|
||||
clocks:
|
||||
items:
|
||||
- description: PCIe Auxiliary clock
|
||||
- description: PCIe CFG AHB clock
|
||||
- description: PCIe Master AXI clock
|
||||
- description: PCIe Slave AXI clock
|
||||
- description: PCIe Slave Q2A AXI clock
|
||||
- description: PCIe Sleep clock
|
||||
- description: PCIe Reference clock
|
||||
|
||||
clock-names:
|
||||
items:
|
||||
- const: aux
|
||||
- const: cfg
|
||||
- const: bus_master
|
||||
- const: bus_slave
|
||||
- const: slave_q2a
|
||||
- const: sleep
|
||||
- const: ref
|
||||
|
||||
qcom,perst-regs:
|
||||
description: Reference to a syscon representing TCSR followed by the two
|
||||
offsets within syscon for Perst enable and Perst separation
|
||||
enable registers
|
||||
$ref: "/schemas/types.yaml#/definitions/phandle-array"
|
||||
items:
|
||||
minItems: 3
|
||||
maxItems: 3
|
||||
|
||||
interrupts:
|
||||
items:
|
||||
- description: PCIe Global interrupt
|
||||
- description: PCIe Doorbell interrupt
|
||||
|
||||
interrupt-names:
|
||||
items:
|
||||
- const: global
|
||||
- const: doorbell
|
||||
|
||||
reset-gpios:
|
||||
description: GPIO used as PERST# input signal
|
||||
maxItems: 1
|
||||
|
||||
wake-gpios:
|
||||
description: GPIO used as WAKE# output signal
|
||||
maxItems: 1
|
||||
|
||||
resets:
|
||||
maxItems: 1
|
||||
|
||||
reset-names:
|
||||
const: core
|
||||
|
||||
power-domains:
|
||||
maxItems: 1
|
||||
|
||||
phys:
|
||||
maxItems: 1
|
||||
|
||||
phy-names:
|
||||
const: pciephy
|
||||
|
||||
num-lanes:
|
||||
default: 2
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- reg-names
|
||||
- clocks
|
||||
- clock-names
|
||||
- qcom,perst-regs
|
||||
- interrupts
|
||||
- interrupt-names
|
||||
- reset-gpios
|
||||
- resets
|
||||
- reset-names
|
||||
- power-domains
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
#include <dt-bindings/clock/qcom,gcc-sdx55.h>
|
||||
#include <dt-bindings/gpio/gpio.h>
|
||||
#include <dt-bindings/interrupt-controller/arm-gic.h>
|
||||
pcie_ep: pcie-ep@40000000 {
|
||||
compatible = "qcom,sdx55-pcie-ep";
|
||||
reg = <0x01c00000 0x3000>,
|
||||
<0x40000000 0xf1d>,
|
||||
<0x40000f20 0xc8>,
|
||||
<0x40001000 0x1000>,
|
||||
<0x40002000 0x1000>,
|
||||
<0x01c03000 0x3000>;
|
||||
reg-names = "parf", "dbi", "elbi", "atu", "addr_space",
|
||||
"mmio";
|
||||
|
||||
clocks = <&gcc GCC_PCIE_AUX_CLK>,
|
||||
<&gcc GCC_PCIE_CFG_AHB_CLK>,
|
||||
<&gcc GCC_PCIE_MSTR_AXI_CLK>,
|
||||
<&gcc GCC_PCIE_SLV_AXI_CLK>,
|
||||
<&gcc GCC_PCIE_SLV_Q2A_AXI_CLK>,
|
||||
<&gcc GCC_PCIE_SLEEP_CLK>,
|
||||
<&gcc GCC_PCIE_0_CLKREF_CLK>;
|
||||
clock-names = "aux", "cfg", "bus_master", "bus_slave",
|
||||
"slave_q2a", "sleep", "ref";
|
||||
|
||||
qcom,perst-regs = <&tcsr 0xb258 0xb270>;
|
||||
|
||||
interrupts = <GIC_SPI 140 IRQ_TYPE_LEVEL_HIGH>,
|
||||
<GIC_SPI 145 IRQ_TYPE_LEVEL_HIGH>;
|
||||
interrupt-names = "global", "doorbell";
|
||||
reset-gpios = <&tlmm 57 GPIO_ACTIVE_LOW>;
|
||||
wake-gpios = <&tlmm 53 GPIO_ACTIVE_LOW>;
|
||||
resets = <&gcc GCC_PCIE_BCR>;
|
||||
reset-names = "core";
|
||||
power-domains = <&gcc PCIE_GDSC>;
|
||||
phys = <&pcie0_lane>;
|
||||
phy-names = "pciephy";
|
||||
max-link-speed = <3>;
|
||||
num-lanes = <2>;
|
||||
};
|
@ -12,6 +12,7 @@
|
||||
- "qcom,pcie-ipq4019" for ipq4019
|
||||
- "qcom,pcie-ipq8074" for ipq8074
|
||||
- "qcom,pcie-qcs404" for qcs404
|
||||
- "qcom,pcie-sc8180x" for sc8180x
|
||||
- "qcom,pcie-sdm845" for sdm845
|
||||
- "qcom,pcie-sm8250" for sm8250
|
||||
- "qcom,pcie-ipq6018" for ipq6018
|
||||
@ -156,7 +157,7 @@
|
||||
- "pipe" PIPE clock
|
||||
|
||||
- clock-names:
|
||||
Usage: required for sm8250
|
||||
Usage: required for sc8180x and sm8250
|
||||
Value type: <stringlist>
|
||||
Definition: Should contain the following entries
|
||||
- "aux" Auxiliary clock
|
||||
@ -245,7 +246,7 @@
|
||||
- "ahb" AHB reset
|
||||
|
||||
- reset-names:
|
||||
Usage: required for sdm845 and sm8250
|
||||
Usage: required for sc8180x, sdm845 and sm8250
|
||||
Value type: <stringlist>
|
||||
Definition: Should contain the following entries
|
||||
- "pci" PCIe core reset
|
||||
|
141
Documentation/devicetree/bindings/pci/rockchip-dw-pcie.yaml
Normal file
141
Documentation/devicetree/bindings/pci/rockchip-dw-pcie.yaml
Normal file
@ -0,0 +1,141 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/pci/rockchip-dw-pcie.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: DesignWare based PCIe controller on Rockchip SoCs
|
||||
|
||||
maintainers:
|
||||
- Shawn Lin <shawn.lin@rock-chips.com>
|
||||
- Simon Xue <xxm@rock-chips.com>
|
||||
- Heiko Stuebner <heiko@sntech.de>
|
||||
|
||||
description: |+
|
||||
RK3568 SoC PCIe host controller is based on the Synopsys DesignWare
|
||||
PCIe IP and thus inherits all the common properties defined in
|
||||
designware-pcie.txt.
|
||||
|
||||
allOf:
|
||||
- $ref: /schemas/pci/pci-bus.yaml#
|
||||
|
||||
# We need a select here so we don't match all nodes with 'snps,dw-pcie'
|
||||
select:
|
||||
properties:
|
||||
compatible:
|
||||
contains:
|
||||
const: rockchip,rk3568-pcie
|
||||
required:
|
||||
- compatible
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
items:
|
||||
- const: rockchip,rk3568-pcie
|
||||
- const: snps,dw-pcie
|
||||
|
||||
reg:
|
||||
items:
|
||||
- description: Data Bus Interface (DBI) registers
|
||||
- description: Rockchip designed configuration registers
|
||||
- description: Config registers
|
||||
|
||||
reg-names:
|
||||
items:
|
||||
- const: dbi
|
||||
- const: apb
|
||||
- const: config
|
||||
|
||||
clocks:
|
||||
items:
|
||||
- description: AHB clock for PCIe master
|
||||
- description: AHB clock for PCIe slave
|
||||
- description: AHB clock for PCIe dbi
|
||||
- description: APB clock for PCIe
|
||||
- description: Auxiliary clock for PCIe
|
||||
|
||||
clock-names:
|
||||
items:
|
||||
- const: aclk_mst
|
||||
- const: aclk_slv
|
||||
- const: aclk_dbi
|
||||
- const: pclk
|
||||
- const: aux
|
||||
|
||||
msi-map: true
|
||||
|
||||
num-lanes: true
|
||||
|
||||
phys:
|
||||
maxItems: 1
|
||||
|
||||
phy-names:
|
||||
const: pcie-phy
|
||||
|
||||
power-domains:
|
||||
maxItems: 1
|
||||
|
||||
ranges:
|
||||
maxItems: 2
|
||||
|
||||
resets:
|
||||
maxItems: 1
|
||||
|
||||
reset-names:
|
||||
const: pipe
|
||||
|
||||
vpcie3v3-supply: true
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- reg-names
|
||||
- clocks
|
||||
- clock-names
|
||||
- msi-map
|
||||
- num-lanes
|
||||
- phys
|
||||
- phy-names
|
||||
- power-domains
|
||||
- resets
|
||||
- reset-names
|
||||
|
||||
unevaluatedProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
|
||||
bus {
|
||||
#address-cells = <2>;
|
||||
#size-cells = <2>;
|
||||
|
||||
pcie3x2: pcie@fe280000 {
|
||||
compatible = "rockchip,rk3568-pcie", "snps,dw-pcie";
|
||||
reg = <0x3 0xc0800000 0x0 0x390000>,
|
||||
<0x0 0xfe280000 0x0 0x10000>,
|
||||
<0x3 0x80000000 0x0 0x100000>;
|
||||
reg-names = "dbi", "apb", "config";
|
||||
bus-range = <0x20 0x2f>;
|
||||
clocks = <&cru 143>, <&cru 144>,
|
||||
<&cru 145>, <&cru 146>,
|
||||
<&cru 147>;
|
||||
clock-names = "aclk_mst", "aclk_slv",
|
||||
"aclk_dbi", "pclk",
|
||||
"aux";
|
||||
device_type = "pci";
|
||||
linux,pci-domain = <2>;
|
||||
max-link-speed = <2>;
|
||||
msi-map = <0x2000 &its 0x2000 0x1000>;
|
||||
num-lanes = <2>;
|
||||
phys = <&pcie30phy>;
|
||||
phy-names = "pcie-phy";
|
||||
power-domains = <&power 15>;
|
||||
ranges = <0x81000000 0x0 0x80800000 0x3 0x80800000 0x0 0x100000>,
|
||||
<0x83000000 0x0 0x80900000 0x3 0x80900000 0x0 0x3f700000>;
|
||||
resets = <&cru 193>;
|
||||
reset-names = "pipe";
|
||||
#address-cells = <3>;
|
||||
#size-cells = <2>;
|
||||
};
|
||||
};
|
||||
...
|
@ -63,7 +63,6 @@ memory_notify结构体的指针::
|
||||
unsigned long start_pfn;
|
||||
unsigned long nr_pages;
|
||||
int status_change_nid_normal;
|
||||
int status_change_nid_high;
|
||||
int status_change_nid;
|
||||
}
|
||||
|
||||
@ -74,9 +73,6 @@ memory_notify结构体的指针::
|
||||
- status_change_nid_normal是当nodemask的N_NORMAL_MEMORY被设置/清除时设置节
|
||||
点id,如果是-1,则nodemask状态不改变。
|
||||
|
||||
- status_change_nid_high是当nodemask的N_HIGH_MEMORY被设置/清除时设置的节点
|
||||
id,如果这个值为-1,那么nodemask状态不会改变。
|
||||
|
||||
- status_change_nid是当nodemask的N_MEMORY被(将)设置/清除时设置的节点id。这
|
||||
意味着一个新的(没上线的)节点通过联机获得新的内存,而一个节点失去了所有的内
|
||||
存。如果这个值为-1,那么nodemask的状态就不会改变。
|
||||
|
@ -35,13 +35,17 @@ two parts:
|
||||
1. Identification of the monitoring target address range for the address space.
|
||||
2. Access check of specific address range in the target space.
|
||||
|
||||
DAMON currently provides the implementation of the primitives for only the
|
||||
virtual address spaces. Below two subsections describe how it works.
|
||||
DAMON currently provides the implementations of the primitives for the physical
|
||||
and virtual address spaces. Below two subsections describe how those work.
|
||||
|
||||
|
||||
VMA-based Target Address Range Construction
|
||||
-------------------------------------------
|
||||
|
||||
This is only for the virtual address space primitives implementation. That for
|
||||
the physical address space simply asks users to manually set the monitoring
|
||||
target address ranges.
|
||||
|
||||
Only small parts in the super-huge virtual address space of the processes are
|
||||
mapped to the physical memory and accessed. Thus, tracking the unmapped
|
||||
address regions is just wasteful. However, because DAMON can deal with some
|
||||
@ -71,15 +75,18 @@ to make a reasonable trade-off. Below shows this in detail::
|
||||
PTE Accessed-bit Based Access Check
|
||||
-----------------------------------
|
||||
|
||||
The implementation for the virtual address space uses PTE Accessed-bit for
|
||||
basic access checks. It finds the relevant PTE Accessed bit from the address
|
||||
by walking the page table for the target task of the address. In this way, the
|
||||
implementation finds and clears the bit for next sampling target address and
|
||||
checks whether the bit set again after one sampling period. This could disturb
|
||||
other kernel subsystems using the Accessed bits, namely Idle page tracking and
|
||||
the reclaim logic. To avoid such disturbances, DAMON makes it mutually
|
||||
exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page
|
||||
flags to solve the conflict with the reclaim logic, as Idle page tracking does.
|
||||
Both of the implementations for physical and virtual address spaces use PTE
|
||||
Accessed-bit for basic access checks. Only one difference is the way of
|
||||
finding the relevant PTE Accessed bit(s) from the address. While the
|
||||
implementation for the virtual address walks the page table for the target task
|
||||
of the address, the implementation for the physical address walks every page
|
||||
table having a mapping to the address. In this way, the implementations find
|
||||
and clear the bit(s) for next sampling target address and checks whether the
|
||||
bit(s) set again after one sampling period. This could disturb other kernel
|
||||
subsystems using the Accessed bits, namely Idle page tracking and the reclaim
|
||||
logic. To avoid such disturbances, DAMON makes it mutually exclusive with Idle
|
||||
page tracking and uses ``PG_idle`` and ``PG_young`` page flags to solve the
|
||||
conflict with the reclaim logic, as Idle page tracking does.
|
||||
|
||||
|
||||
Address Space Independent Core Mechanisms
|
||||
|
@ -36,10 +36,9 @@ constructions and actual access checks can be implemented and configured on the
|
||||
DAMON core by the users. In this way, DAMON users can monitor any address
|
||||
space with any access check technique.
|
||||
|
||||
Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
|
||||
Nonetheless, DAMON provides vma/rmap tracking and PTE Accessed bit check based
|
||||
implementations of the address space dependent functions for the virtual memory
|
||||
by default, for a reference and convenient use. In near future, we will
|
||||
provide those for physical memory address space.
|
||||
and the physical memory by default, for a reference and convenient use.
|
||||
|
||||
|
||||
Can I simply monitor page granularity?
|
||||
|
@ -27,4 +27,3 @@ workloads and systems.
|
||||
faq
|
||||
design
|
||||
api
|
||||
plans
|
||||
|
@ -3,27 +3,11 @@ Linux Memory Management Documentation
|
||||
=====================================
|
||||
|
||||
This is a collection of documents about the Linux memory management (mm)
|
||||
subsystem. If you are looking for advice on simply allocating memory,
|
||||
see the :ref:`memory_allocation`.
|
||||
|
||||
User guides for MM features
|
||||
===========================
|
||||
|
||||
The following documents provide guides for controlling and tuning
|
||||
various features of the Linux memory management
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
swap_numa
|
||||
zswap
|
||||
|
||||
Kernel developers MM documentation
|
||||
==================================
|
||||
|
||||
The below documents describe MM internals with different level of
|
||||
details ranging from notes and mailing list responses to elaborate
|
||||
descriptions of data structures and algorithms.
|
||||
subsystem internals with different level of details ranging from notes and
|
||||
mailing list responses for elaborating descriptions of data structures and
|
||||
algorithms. If you are looking for advice on simply allocating memory, see the
|
||||
:ref:`memory_allocation`. For controlling and tuning guides, see the
|
||||
:doc:`admin guide <../admin-guide/mm/index>`.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
@ -85,5 +85,26 @@ Usage
|
||||
cat /sys/kernel/debug/page_owner > page_owner_full.txt
|
||||
./page_owner_sort page_owner_full.txt sorted_page_owner.txt
|
||||
|
||||
The general output of ``page_owner_full.txt`` is as follows:
|
||||
|
||||
Page allocated via order XXX, ...
|
||||
PFN XXX ...
|
||||
// Detailed stack
|
||||
|
||||
Page allocated via order XXX, ...
|
||||
PFN XXX ...
|
||||
// Detailed stack
|
||||
|
||||
The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows
|
||||
in buf, uses regexp to extract the page order value, counts the times
|
||||
and pages of buf, and finally sorts them according to the times.
|
||||
|
||||
See the result about who allocated each page
|
||||
in the ``sorted_page_owner.txt``.
|
||||
in the ``sorted_page_owner.txt``. General output:
|
||||
|
||||
XXX times, XXX pages:
|
||||
Page allocated via order XXX, ...
|
||||
// Detailed stack
|
||||
|
||||
By default, ``page_owner_sort`` is sorted according to the times of buf.
|
||||
If you want to sort by the pages nums of buf, use the ``-m`` parameter.
|
||||
|
42
MAINTAINERS
42
MAINTAINERS
@ -1297,6 +1297,13 @@ S: Maintained
|
||||
F: Documentation/devicetree/bindings/iommu/apple,dart.yaml
|
||||
F: drivers/iommu/apple-dart.c
|
||||
|
||||
APPLE PCIE CONTROLLER DRIVER
|
||||
M: Alyssa Rosenzweig <alyssa@rosenzweig.io>
|
||||
M: Marc Zyngier <maz@kernel.org>
|
||||
L: linux-pci@vger.kernel.org
|
||||
S: Maintained
|
||||
F: drivers/pci/controller/pcie-apple.c
|
||||
|
||||
APPLE SMC DRIVER
|
||||
M: Henrik Rydberg <rydberg@bitmath.org>
|
||||
L: linux-hwmon@vger.kernel.org
|
||||
@ -5220,7 +5227,7 @@ F: net/ax25/ax25_timer.c
|
||||
F: net/ax25/sysctl_net_ax25.c
|
||||
|
||||
DATA ACCESS MONITOR
|
||||
M: SeongJae Park <sjpark@amazon.de>
|
||||
M: SeongJae Park <sj@kernel.org>
|
||||
L: linux-mm@kvack.org
|
||||
S: Maintained
|
||||
F: Documentation/admin-guide/mm/damon/
|
||||
@ -12005,6 +12012,12 @@ S: Maintained
|
||||
F: Documentation/devicetree/bindings/i2c/i2c-mt7621.txt
|
||||
F: drivers/i2c/busses/i2c-mt7621.c
|
||||
|
||||
MEDIATEK MT7621 PCIE CONTROLLER DRIVER
|
||||
M: Sergio Paracuellos <sergio.paracuellos@gmail.com>
|
||||
S: Maintained
|
||||
F: Documentation/devicetree/bindings/pci/mediatek,mt7621-pcie.yaml
|
||||
F: drivers/pci/controller/pcie-mt7621.c
|
||||
|
||||
MEDIATEK MT7621 PHY PCI DRIVER
|
||||
M: Sergio Paracuellos <sergio.paracuellos@gmail.com>
|
||||
S: Maintained
|
||||
@ -14647,9 +14660,12 @@ M: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
|
||||
R: Krzysztof Wilczyński <kw@linux.com>
|
||||
L: linux-pci@vger.kernel.org
|
||||
S: Supported
|
||||
Q: https://patchwork.kernel.org/project/linux-pci/list/
|
||||
B: https://bugzilla.kernel.org
|
||||
C: irc://irc.oftc.net/linux-pci
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/lpieralisi/pci.git
|
||||
F: Documentation/PCI/endpoint/*
|
||||
F: Documentation/misc-devices/pci-endpoint-test.rst
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/kishon/pci-endpoint.git
|
||||
F: drivers/misc/pci_endpoint_test.c
|
||||
F: drivers/pci/endpoint/
|
||||
F: tools/pci/
|
||||
@ -14695,15 +14711,21 @@ R: Rob Herring <robh@kernel.org>
|
||||
R: Krzysztof Wilczyński <kw@linux.com>
|
||||
L: linux-pci@vger.kernel.org
|
||||
S: Supported
|
||||
Q: http://patchwork.ozlabs.org/project/linux-pci/list/
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/lpieralisi/pci.git/
|
||||
Q: https://patchwork.kernel.org/project/linux-pci/list/
|
||||
B: https://bugzilla.kernel.org
|
||||
C: irc://irc.oftc.net/linux-pci
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/lpieralisi/pci.git
|
||||
F: drivers/pci/controller/
|
||||
F: drivers/pci/pci-bridge-emul.c
|
||||
F: drivers/pci/pci-bridge-emul.h
|
||||
|
||||
PCI SUBSYSTEM
|
||||
M: Bjorn Helgaas <bhelgaas@google.com>
|
||||
L: linux-pci@vger.kernel.org
|
||||
S: Supported
|
||||
Q: http://patchwork.ozlabs.org/project/linux-pci/list/
|
||||
Q: https://patchwork.kernel.org/project/linux-pci/list/
|
||||
B: https://bugzilla.kernel.org
|
||||
C: irc://irc.oftc.net/linux-pci
|
||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git
|
||||
F: Documentation/PCI/
|
||||
F: Documentation/devicetree/bindings/pci/
|
||||
@ -14803,7 +14825,15 @@ M: Stanimir Varbanov <svarbanov@mm-sol.com>
|
||||
L: linux-pci@vger.kernel.org
|
||||
L: linux-arm-msm@vger.kernel.org
|
||||
S: Maintained
|
||||
F: drivers/pci/controller/dwc/*qcom*
|
||||
F: drivers/pci/controller/dwc/pcie-qcom.c
|
||||
|
||||
PCIE ENDPOINT DRIVER FOR QUALCOMM
|
||||
M: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
|
||||
L: linux-pci@vger.kernel.org
|
||||
L: linux-arm-msm@vger.kernel.org
|
||||
S: Maintained
|
||||
F: Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
|
||||
F: drivers/pci/controller/dwc/pcie-qcom-ep.c
|
||||
|
||||
PCIE DRIVER FOR ROCKCHIP
|
||||
M: Shawn Lin <shawn.lin@rock-chips.com>
|
||||
|
15
Makefile
15
Makefile
@ -1015,6 +1015,21 @@ ifdef CONFIG_CC_IS_GCC
|
||||
KBUILD_CFLAGS += -Wno-maybe-uninitialized
|
||||
endif
|
||||
|
||||
ifdef CONFIG_CC_IS_GCC
|
||||
# The allocators already balk at large sizes, so silence the compiler
|
||||
# warnings for bounds checks involving those possible values. While
|
||||
# -Wno-alloc-size-larger-than would normally be used here, earlier versions
|
||||
# of gcc (<9.1) weirdly don't handle the option correctly when _other_
|
||||
# warnings are produced (?!). Using -Walloc-size-larger-than=SIZE_MAX
|
||||
# doesn't work (as it is documented to), silently resolving to "0" prior to
|
||||
# version 9.1 (and producing an error more recently). Numeric values larger
|
||||
# than PTRDIFF_MAX also don't work prior to version 9.1, which are silently
|
||||
# ignored, continuing to default to PTRDIFF_MAX. So, left with no other
|
||||
# choice, we must perform a versioned check to disable this warning.
|
||||
# https://lore.kernel.org/lkml/20210824115859.187f272f@canb.auug.org.au
|
||||
KBUILD_CFLAGS += $(call cc-ifversion, -ge, 0901, -Wno-alloc-size-larger-than)
|
||||
endif
|
||||
|
||||
# disable invalid "can't wrap" optimizations for signed / pointers
|
||||
KBUILD_CFLAGS += -fno-strict-overflow
|
||||
|
||||
|
@ -233,7 +233,7 @@ albacore_init_arch(void)
|
||||
unsigned long size;
|
||||
|
||||
size = initrd_end - initrd_start;
|
||||
memblock_free(__pa(initrd_start), PAGE_ALIGN(size));
|
||||
memblock_free((void *)initrd_start, PAGE_ALIGN(size));
|
||||
if (!move_initrd(pci_mem))
|
||||
printk("irongate_init_arch: initrd too big "
|
||||
"(%ldK)\ndisabling initrd\n",
|
||||
|
@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
|
||||
|
||||
low_mem_sz = size;
|
||||
in_use = 1;
|
||||
memblock_add_node(base, size, 0);
|
||||
memblock_add_node(base, size, 0, MEMBLOCK_NONE);
|
||||
} else {
|
||||
#ifdef CONFIG_HIGHMEM
|
||||
high_mem_start = base;
|
||||
high_mem_sz = size;
|
||||
in_use = 1;
|
||||
memblock_add_node(base, size, 1);
|
||||
memblock_add_node(base, size, 1, MEMBLOCK_NONE);
|
||||
memblock_reserve(base, size);
|
||||
#endif
|
||||
}
|
||||
@ -173,7 +173,7 @@ static void __init highmem_init(void)
|
||||
#ifdef CONFIG_HIGHMEM
|
||||
unsigned long tmp;
|
||||
|
||||
memblock_free(high_mem_start, high_mem_sz);
|
||||
memblock_phys_free(high_mem_start, high_mem_sz);
|
||||
for (tmp = min_high_pfn; tmp < max_high_pfn; tmp++)
|
||||
free_highmem_page(pfn_to_page(tmp));
|
||||
#endif
|
||||
|
@ -339,7 +339,7 @@ err_fabric:
|
||||
err_sysctrl:
|
||||
iounmap(relocation);
|
||||
err_reloc:
|
||||
memblock_free(hip04_boot_method[0], hip04_boot_method[1]);
|
||||
memblock_phys_free(hip04_boot_method[0], hip04_boot_method[1]);
|
||||
err:
|
||||
return ret;
|
||||
}
|
||||
|
@ -158,7 +158,7 @@ phys_addr_t __init arm_memblock_steal(phys_addr_t size, phys_addr_t align)
|
||||
panic("Failed to steal %pa bytes at %pS\n",
|
||||
&size, (void *)_RET_IP_);
|
||||
|
||||
memblock_free(phys, size);
|
||||
memblock_phys_free(phys, size);
|
||||
memblock_remove(phys, size);
|
||||
|
||||
return phys;
|
||||
|
@ -1163,6 +1163,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
|
||||
def_bool y
|
||||
depends on NUMA
|
||||
|
||||
config NEED_PER_CPU_PAGE_FIRST_CHUNK
|
||||
def_bool y
|
||||
depends on NUMA
|
||||
|
||||
source "kernel/Kconfig.hz"
|
||||
|
||||
config ARCH_SPARSEMEM_ENABLE
|
||||
|
@ -287,6 +287,22 @@ static void __init kasan_init_depth(void)
|
||||
init_task.kasan_depth = 0;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_KASAN_VMALLOC
|
||||
void __init kasan_populate_early_vm_area_shadow(void *start, unsigned long size)
|
||||
{
|
||||
unsigned long shadow_start, shadow_end;
|
||||
|
||||
if (!is_vmalloc_or_module_addr(start))
|
||||
return;
|
||||
|
||||
shadow_start = (unsigned long)kasan_mem_to_shadow(start);
|
||||
shadow_start = ALIGN_DOWN(shadow_start, PAGE_SIZE);
|
||||
shadow_end = (unsigned long)kasan_mem_to_shadow(start + size);
|
||||
shadow_end = ALIGN(shadow_end, PAGE_SIZE);
|
||||
kasan_map_populate(shadow_start, shadow_end, NUMA_NO_NODE);
|
||||
}
|
||||
#endif
|
||||
|
||||
void __init kasan_init(void)
|
||||
{
|
||||
kasan_init_shadow();
|
||||
|
@ -738,8 +738,8 @@ void __init paging_init(void)
|
||||
cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
|
||||
init_mm.pgd = swapper_pg_dir;
|
||||
|
||||
memblock_free(__pa_symbol(init_pg_dir),
|
||||
__pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
|
||||
memblock_phys_free(__pa_symbol(init_pg_dir),
|
||||
__pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
|
||||
|
||||
memblock_allow_resize();
|
||||
}
|
||||
|
@ -153,7 +153,7 @@ find_memory (void)
|
||||
efi_memmap_walk(find_max_min_low_pfn, NULL);
|
||||
max_pfn = max_low_pfn;
|
||||
|
||||
memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
|
||||
memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
|
||||
|
||||
find_initrd();
|
||||
|
||||
|
@ -378,7 +378,7 @@ int __init register_active_ranges(u64 start, u64 len, int nid)
|
||||
#endif
|
||||
|
||||
if (start < end)
|
||||
memblock_add_node(__pa(start), end - start, nid);
|
||||
memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
|
||||
m68k_memory[0].addr = _rambase;
|
||||
m68k_memory[0].size = _ramend - _rambase;
|
||||
|
||||
memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
|
||||
memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
|
||||
MEMBLOCK_NONE);
|
||||
|
||||
/* compute total pages in system */
|
||||
num_pages = PFN_DOWN(_ramend - _rambase);
|
||||
|
@ -410,7 +410,8 @@ void __init paging_init(void)
|
||||
|
||||
min_addr = m68k_memory[0].addr;
|
||||
max_addr = min_addr + m68k_memory[0].size;
|
||||
memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
|
||||
memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
|
||||
MEMBLOCK_NONE);
|
||||
for (i = 1; i < m68k_num_memory;) {
|
||||
if (m68k_memory[i].addr < min_addr) {
|
||||
printk("Ignoring memory chunk at 0x%lx:0x%lx before the first chunk\n",
|
||||
@ -421,7 +422,8 @@ void __init paging_init(void)
|
||||
(m68k_num_memory - i) * sizeof(struct m68k_mem_info));
|
||||
continue;
|
||||
}
|
||||
memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
|
||||
memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
|
||||
MEMBLOCK_NONE);
|
||||
addr = m68k_memory[i].addr + m68k_memory[i].size;
|
||||
if (addr > max_addr)
|
||||
max_addr = addr;
|
||||
|
@ -587,13 +587,12 @@ static void pcibios_fixup_resources(struct pci_dev *dev)
|
||||
}
|
||||
DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID, pcibios_fixup_resources);
|
||||
|
||||
int pcibios_add_device(struct pci_dev *dev)
|
||||
int pcibios_device_add(struct pci_dev *dev)
|
||||
{
|
||||
dev->irq = of_irq_parse_and_map_pci(dev, 0, 0);
|
||||
|
||||
return 0;
|
||||
}
|
||||
EXPORT_SYMBOL(pcibios_add_device);
|
||||
|
||||
/*
|
||||
* Reparent resource children of pr that conflict with res
|
||||
|
@ -77,7 +77,9 @@ void __init szmem(unsigned int node)
|
||||
(u32)node_id, mem_type, mem_start, mem_size);
|
||||
pr_info(" start_pfn:0x%llx, end_pfn:0x%llx, num_physpages:0x%lx\n",
|
||||
start_pfn, end_pfn, num_physpages);
|
||||
memblock_add_node(PFN_PHYS(start_pfn), PFN_PHYS(node_psize), node);
|
||||
memblock_add_node(PFN_PHYS(start_pfn),
|
||||
PFN_PHYS(node_psize), node,
|
||||
MEMBLOCK_NONE);
|
||||
break;
|
||||
case SYSTEM_RAM_RESERVED:
|
||||
pr_info("Node%d: mem_type:%d, mem_start:0x%llx, mem_size:0x%llx MB\n",
|
||||
|
@ -529,7 +529,7 @@ static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size,
|
||||
|
||||
static void __init pcpu_fc_free(void *ptr, size_t size)
|
||||
{
|
||||
memblock_free_early(__pa(ptr), size);
|
||||
memblock_free(ptr, size);
|
||||
}
|
||||
|
||||
void __init setup_per_cpu_areas(void)
|
||||
|
@ -51,7 +51,8 @@ choice
|
||||
select SYS_SUPPORTS_HIGHMEM
|
||||
select MIPS_GIC
|
||||
select CLKSRC_MIPS_GIC
|
||||
select HAVE_PCI if PCI_MT7621
|
||||
select HAVE_PCI
|
||||
select PCI_DRIVERS_GENERIC
|
||||
select SOC_BUS
|
||||
endchoice
|
||||
|
||||
|
@ -341,7 +341,8 @@ static void __init szmem(void)
|
||||
continue;
|
||||
}
|
||||
memblock_add_node(PFN_PHYS(slot_getbasepfn(node, slot)),
|
||||
PFN_PHYS(slot_psize), node);
|
||||
PFN_PHYS(slot_psize), node,
|
||||
MEMBLOCK_NONE);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -69,10 +69,10 @@ static void __init ip30_mem_init(void)
|
||||
total_mem += size;
|
||||
|
||||
if (addr >= IP30_REAL_MEMORY_START)
|
||||
memblock_free(addr, size);
|
||||
memblock_phys_free(addr, size);
|
||||
else if ((addr + size) > IP30_REAL_MEMORY_START)
|
||||
memblock_free(IP30_REAL_MEMORY_START,
|
||||
size - IP30_MAX_PROM_MEMORY);
|
||||
memblock_phys_free(IP30_REAL_MEMORY_START,
|
||||
size - IP30_MAX_PROM_MEMORY);
|
||||
}
|
||||
pr_info("Detected %luMB of physical memory.\n", MEM_SHIFT(total_mem));
|
||||
}
|
||||
|
@ -274,7 +274,6 @@ CONFIG_NLS_UTF8=y
|
||||
CONFIG_ENCRYPTED_KEYS=y
|
||||
CONFIG_SECURITY=y
|
||||
CONFIG_HARDENED_USERCOPY=y
|
||||
# CONFIG_HARDENED_USERCOPY_FALLBACK is not set
|
||||
CONFIG_HARDENED_USERCOPY_PAGESPAN=y
|
||||
CONFIG_FORTIFY_SOURCE=y
|
||||
CONFIG_SECURITY_LOCKDOWN_LSM=y
|
||||
|
@ -31,7 +31,7 @@ struct machdep_calls {
|
||||
#ifdef CONFIG_PM
|
||||
void (*iommu_restore)(void);
|
||||
#endif
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG
|
||||
unsigned long (*memory_block_size)(void);
|
||||
#endif
|
||||
#endif /* CONFIG_PPC64 */
|
||||
|
@ -55,11 +55,6 @@ void eeh_pe_dev_mode_mark(struct eeh_pe *pe, int mode);
|
||||
void eeh_sysfs_add_device(struct pci_dev *pdev);
|
||||
void eeh_sysfs_remove_device(struct pci_dev *pdev);
|
||||
|
||||
static inline const char *eeh_driver_name(struct pci_dev *pdev)
|
||||
{
|
||||
return (pdev && pdev->driver) ? pdev->driver->name : "<null>";
|
||||
}
|
||||
|
||||
#endif /* CONFIG_EEH */
|
||||
|
||||
#define PCI_BUSNO(bdfn) ((bdfn >> 8) & 0xff)
|
||||
|
@ -6,21 +6,8 @@
|
||||
#include <linux/elf.h>
|
||||
#include <linux/uaccess.h>
|
||||
|
||||
#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
|
||||
|
||||
#include <asm-generic/sections.h>
|
||||
|
||||
extern bool init_mem_is_free;
|
||||
|
||||
static inline int arch_is_kernel_initmem_freed(unsigned long addr)
|
||||
{
|
||||
if (!init_mem_is_free)
|
||||
return 0;
|
||||
|
||||
return addr >= (unsigned long)__init_begin &&
|
||||
addr < (unsigned long)__init_end;
|
||||
}
|
||||
|
||||
extern char __head_end[];
|
||||
|
||||
#ifdef __powerpc64__
|
||||
|
@ -1095,8 +1095,8 @@ static int __init dt_cpu_ftrs_scan_callback(unsigned long node, const char
|
||||
|
||||
cpufeatures_setup_finished();
|
||||
|
||||
memblock_free(__pa(dt_cpu_features),
|
||||
sizeof(struct dt_cpu_feature)*nr_dt_cpu_features);
|
||||
memblock_free(dt_cpu_features,
|
||||
sizeof(struct dt_cpu_feature) * nr_dt_cpu_features);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
@ -399,6 +399,14 @@ out:
|
||||
return ret;
|
||||
}
|
||||
|
||||
static inline const char *eeh_driver_name(struct pci_dev *pdev)
|
||||
{
|
||||
if (pdev)
|
||||
return dev_driver_string(&pdev->dev);
|
||||
|
||||
return "<null>";
|
||||
}
|
||||
|
||||
/**
|
||||
* eeh_dev_check_failure - Check if all 1's data is due to EEH slot freeze
|
||||
* @edev: eeh device
|
||||
|
@ -104,13 +104,13 @@ static bool eeh_edev_actionable(struct eeh_dev *edev)
|
||||
*/
|
||||
static inline struct pci_driver *eeh_pcid_get(struct pci_dev *pdev)
|
||||
{
|
||||
if (!pdev || !pdev->driver)
|
||||
if (!pdev || !pdev->dev.driver)
|
||||
return NULL;
|
||||
|
||||
if (!try_module_get(pdev->driver->driver.owner))
|
||||
if (!try_module_get(pdev->dev.driver->owner))
|
||||
return NULL;
|
||||
|
||||
return pdev->driver;
|
||||
return to_pci_driver(pdev->dev.driver);
|
||||
}
|
||||
|
||||
/**
|
||||
@ -122,10 +122,10 @@ static inline struct pci_driver *eeh_pcid_get(struct pci_dev *pdev)
|
||||
*/
|
||||
static inline void eeh_pcid_put(struct pci_dev *pdev)
|
||||
{
|
||||
if (!pdev || !pdev->driver)
|
||||
if (!pdev || !pdev->dev.driver)
|
||||
return;
|
||||
|
||||
module_put(pdev->driver->driver.owner);
|
||||
module_put(pdev->dev.driver->owner);
|
||||
}
|
||||
|
||||
/**
|
||||
|
@ -322,8 +322,8 @@ void __init free_unused_pacas(void)
|
||||
|
||||
new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
|
||||
if (new_ptrs_size < paca_ptrs_size)
|
||||
memblock_free(__pa(paca_ptrs) + new_ptrs_size,
|
||||
paca_ptrs_size - new_ptrs_size);
|
||||
memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
|
||||
paca_ptrs_size - new_ptrs_size);
|
||||
|
||||
paca_nr_cpu_ids = nr_cpu_ids;
|
||||
paca_ptrs_size = new_ptrs_size;
|
||||
@ -331,8 +331,8 @@ void __init free_unused_pacas(void)
|
||||
#ifdef CONFIG_PPC_BOOK3S_64
|
||||
if (early_radix_enabled()) {
|
||||
/* Ugly fixup, see new_slb_shadow() */
|
||||
memblock_free(__pa(paca_ptrs[boot_cpuid]->slb_shadow_ptr),
|
||||
sizeof(struct slb_shadow));
|
||||
memblock_phys_free(__pa(paca_ptrs[boot_cpuid]->slb_shadow_ptr),
|
||||
sizeof(struct slb_shadow));
|
||||
paca_ptrs[boot_cpuid]->slb_shadow_ptr = NULL;
|
||||
}
|
||||
#endif
|
||||
|
@ -1059,7 +1059,7 @@ void pcibios_bus_add_device(struct pci_dev *dev)
|
||||
ppc_md.pcibios_bus_add_device(dev);
|
||||
}
|
||||
|
||||
int pcibios_add_device(struct pci_dev *dev)
|
||||
int pcibios_device_add(struct pci_dev *dev)
|
||||
{
|
||||
struct irq_domain *d;
|
||||
|
||||
|
@ -822,7 +822,7 @@ static void __init smp_setup_pacas(void)
|
||||
set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
|
||||
}
|
||||
|
||||
memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
|
||||
memblock_free(cpu_to_phys_id, nr_cpu_ids * sizeof(u32));
|
||||
cpu_to_phys_id = NULL;
|
||||
}
|
||||
#endif
|
||||
|
@ -812,7 +812,7 @@ static void * __init pcpu_alloc_bootmem(unsigned int cpu, size_t size,
|
||||
|
||||
static void __init pcpu_free_bootmem(void *ptr, size_t size)
|
||||
{
|
||||
memblock_free(__pa(ptr), size);
|
||||
memblock_free(ptr, size);
|
||||
}
|
||||
|
||||
static int pcpu_cpu_distance(unsigned int from, unsigned int to)
|
||||
@ -912,7 +912,7 @@ void __init setup_per_cpu_areas(void)
|
||||
}
|
||||
#endif
|
||||
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG
|
||||
unsigned long memory_block_size_bytes(void)
|
||||
{
|
||||
if (ppc_md.memory_block_size)
|
||||
|
@ -229,17 +229,22 @@ static int __init pseries_alloc_bootmem_huge_page(struct hstate *hstate)
|
||||
m->hstate = hstate;
|
||||
return 1;
|
||||
}
|
||||
|
||||
bool __init hugetlb_node_alloc_supported(void)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
#endif
|
||||
|
||||
|
||||
int __init alloc_bootmem_huge_page(struct hstate *h)
|
||||
int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
|
||||
{
|
||||
|
||||
#ifdef CONFIG_PPC_BOOK3S_64
|
||||
if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled())
|
||||
return pseries_alloc_bootmem_huge_page(h);
|
||||
#endif
|
||||
return __alloc_bootmem_huge_page(h);
|
||||
return __alloc_bootmem_huge_page(h, nid);
|
||||
}
|
||||
|
||||
#ifndef CONFIG_PPC_BOOK3S_64
|
||||
|
@ -2981,7 +2981,7 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
|
||||
if (!phb->hose) {
|
||||
pr_err(" Can't allocate PCI controller for %pOF\n",
|
||||
np);
|
||||
memblock_free(__pa(phb), sizeof(struct pnv_phb));
|
||||
memblock_free(phb, sizeof(struct pnv_phb));
|
||||
return;
|
||||
}
|
||||
|
||||
|
@ -51,7 +51,7 @@
|
||||
* to "new_size", calculated above. Implementing this is a convoluted process
|
||||
* which requires several hooks in the PCI core:
|
||||
*
|
||||
* 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
|
||||
* 1. In pcibios_device_add() we call pnv_pci_ioda_fixup_iov().
|
||||
*
|
||||
* At this point the device has been probed and the device's BARs are sized,
|
||||
* but no resource allocations have been done. The SR-IOV BARs are sized
|
||||
|
@ -440,7 +440,7 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
|
||||
}
|
||||
#endif /* CONFIG_KEXEC_CORE */
|
||||
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG
|
||||
static unsigned long pnv_memory_block_size(void)
|
||||
{
|
||||
/*
|
||||
@ -553,7 +553,7 @@ define_machine(powernv) {
|
||||
#ifdef CONFIG_KEXEC_CORE
|
||||
.kexec_cpu_down = pnv_kexec_cpu_down,
|
||||
#endif
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG
|
||||
.memory_block_size = pnv_memory_block_size,
|
||||
#endif
|
||||
};
|
||||
|
@ -1088,7 +1088,7 @@ define_machine(pseries) {
|
||||
.machine_kexec = pSeries_machine_kexec,
|
||||
.kexec_cpu_down = pseries_kexec_cpu_down,
|
||||
#endif
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
||||
#ifdef CONFIG_MEMORY_HOTPLUG
|
||||
.memory_block_size = pseries_memory_block_size,
|
||||
#endif
|
||||
};
|
||||
|
@ -57,8 +57,7 @@ void __init svm_swiotlb_init(void)
|
||||
return;
|
||||
|
||||
|
||||
memblock_free_early(__pa(vstart),
|
||||
PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
|
||||
memblock_free(vstart, PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
|
||||
panic("SVM: Cannot allocate SWIOTLB buffer");
|
||||
}
|
||||
|
||||
|
@ -230,13 +230,13 @@ static void __init init_resources(void)
|
||||
|
||||
/* Clean-up any unused pre-allocated resources */
|
||||
if (res_idx >= 0)
|
||||
memblock_free(__pa(mem_res), (res_idx + 1) * sizeof(*mem_res));
|
||||
memblock_free(mem_res, (res_idx + 1) * sizeof(*mem_res));
|
||||
return;
|
||||
|
||||
error:
|
||||
/* Better an empty resource tree than an inconsistent one */
|
||||
release_child_resources(&iomem_resource);
|
||||
memblock_free(__pa(mem_res), mem_res_sz);
|
||||
memblock_free(mem_res, mem_res_sz);
|
||||
}
|
||||
|
||||
|
||||
|
@ -153,12 +153,15 @@ config S390
|
||||
select HAVE_DEBUG_KMEMLEAK
|
||||
select HAVE_DMA_CONTIGUOUS
|
||||
select HAVE_DYNAMIC_FTRACE
|
||||
select HAVE_DYNAMIC_FTRACE_WITH_ARGS
|
||||
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
|
||||
select HAVE_DYNAMIC_FTRACE_WITH_REGS
|
||||
select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES
|
||||
select HAVE_EFFICIENT_UNALIGNED_ACCESS
|
||||
select HAVE_FAST_GUP
|
||||
select HAVE_FENTRY
|
||||
select HAVE_FTRACE_MCOUNT_RECORD
|
||||
select HAVE_FUNCTION_ARG_ACCESS_API
|
||||
select HAVE_FUNCTION_ERROR_INJECTION
|
||||
select HAVE_FUNCTION_GRAPH_TRACER
|
||||
select HAVE_FUNCTION_TRACER
|
||||
@ -190,6 +193,7 @@ config S390
|
||||
select HAVE_REGS_AND_STACK_ACCESS_API
|
||||
select HAVE_RELIABLE_STACKTRACE
|
||||
select HAVE_RSEQ
|
||||
select HAVE_SAMPLE_FTRACE_DIRECT
|
||||
select HAVE_SOFTIRQ_ON_OWN_STACK
|
||||
select HAVE_SYSCALL_TRACEPOINTS
|
||||
select HAVE_VIRT_CPU_ACCOUNTING
|
||||
@ -434,6 +438,14 @@ endchoice
|
||||
config 64BIT
|
||||
def_bool y
|
||||
|
||||
config COMMAND_LINE_SIZE
|
||||
int "Maximum size of kernel command line"
|
||||
default 4096
|
||||
range 896 1048576
|
||||
help
|
||||
This allows you to specify the maximum length of the kernel command
|
||||
line.
|
||||
|
||||
config COMPAT
|
||||
def_bool y
|
||||
prompt "Kernel support for 31 bit emulation"
|
||||
@ -938,6 +950,8 @@ menu "Selftests"
|
||||
|
||||
config S390_UNWIND_SELFTEST
|
||||
def_tristate n
|
||||
depends on KUNIT
|
||||
default KUNIT_ALL_TESTS
|
||||
prompt "Test unwind functions"
|
||||
help
|
||||
This option enables s390 specific stack unwinder testing kernel
|
||||
@ -946,4 +960,16 @@ config S390_UNWIND_SELFTEST
|
||||
|
||||
Say N if you are unsure.
|
||||
|
||||
config S390_KPROBES_SANITY_TEST
|
||||
def_tristate n
|
||||
prompt "Enable s390 specific kprobes tests"
|
||||
depends on KPROBES
|
||||
depends on KUNIT
|
||||
help
|
||||
This option enables an s390 specific kprobes test module. This option
|
||||
is not useful for distributions or general kernels, but only for kernel
|
||||
developers working on architecture code.
|
||||
|
||||
Say N if you are unsure.
|
||||
|
||||
endmenu
|
||||
|
@ -24,6 +24,7 @@ struct vmlinux_info {
|
||||
unsigned long dynsym_start;
|
||||
unsigned long rela_dyn_start;
|
||||
unsigned long rela_dyn_end;
|
||||
unsigned long amode31_size;
|
||||
};
|
||||
|
||||
/* Symbols defined by linker scripts */
|
||||
|
@ -184,35 +184,23 @@ iplstart:
|
||||
bas %r14,.Lloader # load parameter file
|
||||
ltr %r2,%r2 # got anything ?
|
||||
bz .Lnopf
|
||||
chi %r2,895
|
||||
bnh .Lnotrunc
|
||||
la %r2,895
|
||||
l %r3,MAX_COMMAND_LINE_SIZE+ARCH_OFFSET-PARMAREA(%r12)
|
||||
ahi %r3,-1
|
||||
clr %r2,%r3
|
||||
bl .Lnotrunc
|
||||
lr %r2,%r3
|
||||
.Lnotrunc:
|
||||
l %r4,.Linitrd
|
||||
clc 0(3,%r4),.L_hdr # if it is HDRx
|
||||
bz .Lagain1 # skip dataset header
|
||||
clc 0(3,%r4),.L_eof # if it is EOFx
|
||||
bz .Lagain1 # skip dateset trailer
|
||||
la %r5,0(%r4,%r2)
|
||||
lr %r3,%r2
|
||||
la %r3,COMMAND_LINE-PARMAREA(%r12) # load adr. of command line
|
||||
mvc 0(256,%r3),0(%r4)
|
||||
mvc 256(256,%r3),256(%r4)
|
||||
mvc 512(256,%r3),512(%r4)
|
||||
mvc 768(122,%r3),768(%r4)
|
||||
slr %r0,%r0
|
||||
b .Lcntlp
|
||||
.Ldelspc:
|
||||
ic %r0,0(%r2,%r3)
|
||||
chi %r0,0x20 # is it a space ?
|
||||
be .Lcntlp
|
||||
ahi %r2,1
|
||||
b .Leolp
|
||||
.Lcntlp:
|
||||
brct %r2,.Ldelspc
|
||||
.Leolp:
|
||||
slr %r0,%r0
|
||||
stc %r0,0(%r2,%r3) # terminate buffer
|
||||
|
||||
lr %r5,%r2
|
||||
la %r6,COMMAND_LINE-PARMAREA(%r12)
|
||||
lr %r7,%r2
|
||||
ahi %r7,1
|
||||
mvcl %r6,%r4
|
||||
.Lnopf:
|
||||
|
||||
#
|
||||
@ -317,6 +305,7 @@ SYM_CODE_START_LOCAL(startup_normal)
|
||||
xc 0x300(256),0x300
|
||||
xc 0xe00(256),0xe00
|
||||
xc 0xf00(256),0xf00
|
||||
lctlg %c0,%c15,.Lctl-.LPG0(%r13) # load control registers
|
||||
stcke __LC_BOOT_CLOCK
|
||||
mvc __LC_LAST_UPDATE_CLOCK(8),__LC_BOOT_CLOCK+1
|
||||
spt 6f-.LPG0(%r13)
|
||||
@ -335,6 +324,22 @@ SYM_CODE_END(startup_normal)
|
||||
.quad 0x0000000180000000,startup_pgm_check_handler
|
||||
.Lio_new_psw:
|
||||
.quad 0x0002000180000000,0x1f0 # disabled wait
|
||||
.Lctl: .quad 0x04040000 # cr0: AFP registers & secondary space
|
||||
.quad 0 # cr1: primary space segment table
|
||||
.quad 0 # cr2: dispatchable unit control table
|
||||
.quad 0 # cr3: instruction authorization
|
||||
.quad 0xffff # cr4: instruction authorization
|
||||
.quad 0 # cr5: primary-aste origin
|
||||
.quad 0 # cr6: I/O interrupts
|
||||
.quad 0 # cr7: secondary space segment table
|
||||
.quad 0x0000000000008000 # cr8: access registers translation
|
||||
.quad 0 # cr9: tracing off
|
||||
.quad 0 # cr10: tracing off
|
||||
.quad 0 # cr11: tracing off
|
||||
.quad 0 # cr12: tracing off
|
||||
.quad 0 # cr13: home space segment table
|
||||
.quad 0xc0000000 # cr14: machine check handling off
|
||||
.quad 0 # cr15: linkage stack operations
|
||||
|
||||
#include "head_kdump.S"
|
||||
|
||||
@ -377,11 +382,10 @@ SYM_DATA_START(parmarea)
|
||||
.quad 0 # OLDMEM_BASE
|
||||
.quad 0 # OLDMEM_SIZE
|
||||
.quad kernel_version # points to kernel version string
|
||||
.quad COMMAND_LINE_SIZE
|
||||
|
||||
.org COMMAND_LINE
|
||||
.byte "root=/dev/ram0 ro"
|
||||
.byte 0
|
||||
.org PARMAREA+__PARMAREA_SIZE
|
||||
SYM_DATA_END(parmarea)
|
||||
|
||||
.org HEAD_END
|
||||
|
@ -170,10 +170,10 @@ static inline int has_ebcdic_char(const char *str)
|
||||
|
||||
void setup_boot_command_line(void)
|
||||
{
|
||||
parmarea.command_line[ARCH_COMMAND_LINE_SIZE - 1] = 0;
|
||||
parmarea.command_line[COMMAND_LINE_SIZE - 1] = 0;
|
||||
/* convert arch command line to ascii if necessary */
|
||||
if (has_ebcdic_char(parmarea.command_line))
|
||||
EBCASC(parmarea.command_line, ARCH_COMMAND_LINE_SIZE);
|
||||
EBCASC(parmarea.command_line, COMMAND_LINE_SIZE);
|
||||
/* copy arch command line */
|
||||
strcpy(early_command_line, strim(parmarea.command_line));
|
||||
|
||||
|
@ -175,6 +175,6 @@ void print_pgm_check_info(void)
|
||||
gpregs[12], gpregs[13], gpregs[14], gpregs[15]);
|
||||
print_stacktrace();
|
||||
decompressor_printk("Last Breaking-Event-Address:\n");
|
||||
decompressor_printk(" [<%016lx>] %pS\n", (unsigned long)S390_lowcore.breaking_event_addr,
|
||||
(void *)S390_lowcore.breaking_event_addr);
|
||||
decompressor_printk(" [<%016lx>] %pS\n", (unsigned long)S390_lowcore.pgm_last_break,
|
||||
(void *)S390_lowcore.pgm_last_break);
|
||||
}
|
||||
|
@ -15,6 +15,7 @@
|
||||
#include "uv.h"
|
||||
|
||||
unsigned long __bootdata_preserved(__kaslr_offset);
|
||||
unsigned long __bootdata(__amode31_base);
|
||||
unsigned long __bootdata_preserved(VMALLOC_START);
|
||||
unsigned long __bootdata_preserved(VMALLOC_END);
|
||||
struct page *__bootdata_preserved(vmemmap);
|
||||
@ -259,6 +260,12 @@ static void offset_vmlinux_info(unsigned long offset)
|
||||
vmlinux.dynsym_start += offset;
|
||||
}
|
||||
|
||||
static unsigned long reserve_amode31(unsigned long safe_addr)
|
||||
{
|
||||
__amode31_base = PAGE_ALIGN(safe_addr);
|
||||
return safe_addr + vmlinux.amode31_size;
|
||||
}
|
||||
|
||||
void startup_kernel(void)
|
||||
{
|
||||
unsigned long random_lma;
|
||||
@ -273,6 +280,7 @@ void startup_kernel(void)
|
||||
setup_lpp();
|
||||
store_ipl_parmblock();
|
||||
safe_addr = mem_safe_offset();
|
||||
safe_addr = reserve_amode31(safe_addr);
|
||||
safe_addr = read_ipl_report(safe_addr);
|
||||
uv_query_info();
|
||||
rescue_initrd(safe_addr);
|
||||
|
@ -61,7 +61,8 @@ CONFIG_PROTECTED_VIRTUALIZATION_GUEST=y
|
||||
CONFIG_CMM=m
|
||||
CONFIG_APPLDATA_BASE=y
|
||||
CONFIG_KVM=m
|
||||
CONFIG_S390_UNWIND_SELFTEST=y
|
||||
CONFIG_S390_UNWIND_SELFTEST=m
|
||||
CONFIG_S390_KPROBES_SANITY_TEST=m
|
||||
CONFIG_KPROBES=y
|
||||
CONFIG_JUMP_LABEL=y
|
||||
CONFIG_STATIC_KEYS_SELFTEST=y
|
||||
@ -776,7 +777,6 @@ CONFIG_CRC8=m
|
||||
CONFIG_RANDOM32_SELFTEST=y
|
||||
CONFIG_DMA_CMA=y
|
||||
CONFIG_CMA_SIZE_MBYTES=0
|
||||
CONFIG_DMA_API_DEBUG=y
|
||||
CONFIG_PRINTK_TIME=y
|
||||
CONFIG_DYNAMIC_DEBUG=y
|
||||
CONFIG_DEBUG_INFO=y
|
||||
@ -839,8 +839,13 @@ CONFIG_BPF_KPROBE_OVERRIDE=y
|
||||
CONFIG_HIST_TRIGGERS=y
|
||||
CONFIG_FTRACE_STARTUP_TEST=y
|
||||
# CONFIG_EVENT_TRACE_STARTUP_TEST is not set
|
||||
CONFIG_SAMPLES=y
|
||||
CONFIG_SAMPLE_TRACE_PRINTK=m
|
||||
CONFIG_SAMPLE_FTRACE_DIRECT=m
|
||||
CONFIG_DEBUG_ENTRY=y
|
||||
CONFIG_CIO_INJECT=y
|
||||
CONFIG_KUNIT=m
|
||||
CONFIG_KUNIT_DEBUGFS=y
|
||||
CONFIG_NOTIFIER_ERROR_INJECTION=m
|
||||
CONFIG_NETDEV_NOTIFIER_ERROR_INJECT=m
|
||||
CONFIG_FAULT_INJECTION=y
|
||||
|
@ -60,6 +60,7 @@ CONFIG_CMM=m
|
||||
CONFIG_APPLDATA_BASE=y
|
||||
CONFIG_KVM=m
|
||||
CONFIG_S390_UNWIND_SELFTEST=m
|
||||
CONFIG_S390_KPROBES_SANITY_TEST=m
|
||||
CONFIG_KPROBES=y
|
||||
CONFIG_JUMP_LABEL=y
|
||||
# CONFIG_GCC_PLUGINS is not set
|
||||
@ -788,6 +789,11 @@ CONFIG_FTRACE_SYSCALLS=y
|
||||
CONFIG_BLK_DEV_IO_TRACE=y
|
||||
CONFIG_BPF_KPROBE_OVERRIDE=y
|
||||
CONFIG_HIST_TRIGGERS=y
|
||||
CONFIG_SAMPLES=y
|
||||
CONFIG_SAMPLE_TRACE_PRINTK=m
|
||||
CONFIG_SAMPLE_FTRACE_DIRECT=m
|
||||
CONFIG_KUNIT=m
|
||||
CONFIG_KUNIT_DEBUGFS=y
|
||||
CONFIG_LKDTM=m
|
||||
CONFIG_PERCPU_TEST=m
|
||||
CONFIG_ATOMIC64_SELFTEST=y
|
||||
|
@ -16,20 +16,24 @@
|
||||
|
||||
#ifdef CONFIG_HAVE_MARCH_Z196_FEATURES
|
||||
/* Fast-BCR without checkpoint synchronization */
|
||||
#define __ASM_BARRIER "bcr 14,0\n"
|
||||
#define __ASM_BCR_SERIALIZE "bcr 14,0\n"
|
||||
#else
|
||||
#define __ASM_BARRIER "bcr 15,0\n"
|
||||
#define __ASM_BCR_SERIALIZE "bcr 15,0\n"
|
||||
#endif
|
||||
|
||||
#define mb() do { asm volatile(__ASM_BARRIER : : : "memory"); } while (0)
|
||||
static __always_inline void bcr_serialize(void)
|
||||
{
|
||||
asm volatile(__ASM_BCR_SERIALIZE : : : "memory");
|
||||
}
|
||||
|
||||
#define rmb() barrier()
|
||||
#define wmb() barrier()
|
||||
#define dma_rmb() mb()
|
||||
#define dma_wmb() mb()
|
||||
#define __smp_mb() mb()
|
||||
#define __smp_rmb() rmb()
|
||||
#define __smp_wmb() wmb()
|
||||
#define mb() bcr_serialize()
|
||||
#define rmb() barrier()
|
||||
#define wmb() barrier()
|
||||
#define dma_rmb() mb()
|
||||
#define dma_wmb() mb()
|
||||
#define __smp_mb() mb()
|
||||
#define __smp_rmb() rmb()
|
||||
#define __smp_wmb() wmb()
|
||||
|
||||
#define __smp_store_release(p, v) \
|
||||
do { \
|
||||
|
@ -188,7 +188,7 @@ static inline bool arch_test_and_set_bit_lock(unsigned long nr,
|
||||
volatile unsigned long *ptr)
|
||||
{
|
||||
if (arch_test_bit(nr, ptr))
|
||||
return 1;
|
||||
return true;
|
||||
return arch_test_and_set_bit(nr, ptr);
|
||||
}
|
||||
|
||||
|
@ -12,6 +12,7 @@
|
||||
#ifndef __ASSEMBLY__
|
||||
|
||||
#include <linux/types.h>
|
||||
#include <linux/jump_label.h>
|
||||
|
||||
struct cpuid
|
||||
{
|
||||
@ -21,5 +22,7 @@ struct cpuid
|
||||
unsigned int unused : 16;
|
||||
} __attribute__ ((packed, aligned(8)));
|
||||
|
||||
DECLARE_STATIC_KEY_FALSE(cpu_has_bear);
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
#endif /* _ASM_S390_CPU_H */
|
||||
|
@ -462,7 +462,7 @@ arch_initcall(VNAME(var, reg))
|
||||
*
|
||||
* @var: Name of debug_info_t variable
|
||||
* @name: Name of debug log (e.g. used for debugfs entry)
|
||||
* @pages_per_area: Number of pages per area
|
||||
* @pages: Number of pages per area
|
||||
* @nr_areas: Number of debug areas
|
||||
* @buf_size: Size of data area in each debug entry
|
||||
* @view: Pointer to debug view struct
|
||||
|
@ -17,7 +17,6 @@
|
||||
|
||||
void ftrace_caller(void);
|
||||
|
||||
extern char ftrace_graph_caller_end;
|
||||
extern void *ftrace_func;
|
||||
|
||||
struct dyn_arch_ftrace { };
|
||||
@ -42,6 +41,35 @@ static inline unsigned long ftrace_call_adjust(unsigned long addr)
|
||||
return addr;
|
||||
}
|
||||
|
||||
struct ftrace_regs {
|
||||
struct pt_regs regs;
|
||||
};
|
||||
|
||||
static __always_inline struct pt_regs *arch_ftrace_get_regs(struct ftrace_regs *fregs)
|
||||
{
|
||||
return &fregs->regs;
|
||||
}
|
||||
|
||||
static __always_inline void ftrace_instruction_pointer_set(struct ftrace_regs *fregs,
|
||||
unsigned long ip)
|
||||
{
|
||||
struct pt_regs *regs = arch_ftrace_get_regs(fregs);
|
||||
|
||||
regs->psw.addr = ip;
|
||||
}
|
||||
|
||||
/*
|
||||
* When an ftrace registered caller is tracing a function that is
|
||||
* also set by a register_ftrace_direct() call, it needs to be
|
||||
* differentiated in the ftrace_caller trampoline. To do this,
|
||||
* place the direct caller in the ORIG_GPR2 part of pt_regs. This
|
||||
* tells the ftrace_caller that there's a direct caller.
|
||||
*/
|
||||
static inline void arch_ftrace_set_direct_caller(struct pt_regs *regs, unsigned long addr)
|
||||
{
|
||||
regs->orig_gpr2 = addr;
|
||||
}
|
||||
|
||||
/*
|
||||
* Even though the system call numbers are identical for s390/s390x a
|
||||
* different system call table is used for compat tasks. This may lead
|
||||
@ -68,4 +96,32 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
|
||||
}
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
|
||||
#ifdef CONFIG_FUNCTION_TRACER
|
||||
|
||||
#define FTRACE_NOP_INSN .word 0xc004, 0x0000, 0x0000 /* brcl 0,0 */
|
||||
|
||||
#ifndef CC_USING_HOTPATCH
|
||||
|
||||
#define FTRACE_GEN_MCOUNT_RECORD(name) \
|
||||
.section __mcount_loc, "a", @progbits; \
|
||||
.quad name; \
|
||||
.previous;
|
||||
|
||||
#else /* !CC_USING_HOTPATCH */
|
||||
|
||||
#define FTRACE_GEN_MCOUNT_RECORD(name)
|
||||
|
||||
#endif /* !CC_USING_HOTPATCH */
|
||||
|
||||
#define FTRACE_GEN_NOP_ASM(name) \
|
||||
FTRACE_GEN_MCOUNT_RECORD(name) \
|
||||
FTRACE_NOP_INSN
|
||||
|
||||
#else /* CONFIG_FUNCTION_TRACER */
|
||||
|
||||
#define FTRACE_GEN_NOP_ASM(name)
|
||||
|
||||
#endif /* CONFIG_FUNCTION_TRACER */
|
||||
|
||||
#endif /* _ASM_S390_FTRACE_H */
|
||||
|
@ -2,6 +2,8 @@
|
||||
#ifndef _ASM_S390_JUMP_LABEL_H
|
||||
#define _ASM_S390_JUMP_LABEL_H
|
||||
|
||||
#define HAVE_JUMP_LABEL_BATCH
|
||||
|
||||
#ifndef __ASSEMBLY__
|
||||
|
||||
#include <linux/types.h>
|
||||
|
@ -16,9 +16,7 @@
|
||||
|
||||
static inline void klp_arch_set_pc(struct ftrace_regs *fregs, unsigned long ip)
|
||||
{
|
||||
struct pt_regs *regs = ftrace_get_regs(fregs);
|
||||
|
||||
regs->psw.addr = ip;
|
||||
ftrace_instruction_pointer_set(fregs, ip);
|
||||
}
|
||||
|
||||
#endif
|
||||
|
@ -65,7 +65,7 @@ struct lowcore {
|
||||
__u32 external_damage_code; /* 0x00f4 */
|
||||
__u64 failing_storage_address; /* 0x00f8 */
|
||||
__u8 pad_0x0100[0x0110-0x0100]; /* 0x0100 */
|
||||
__u64 breaking_event_addr; /* 0x0110 */
|
||||
__u64 pgm_last_break; /* 0x0110 */
|
||||
__u8 pad_0x0118[0x0120-0x0118]; /* 0x0118 */
|
||||
psw_t restart_old_psw; /* 0x0120 */
|
||||
psw_t external_old_psw; /* 0x0130 */
|
||||
@ -93,9 +93,10 @@ struct lowcore {
|
||||
psw_t return_psw; /* 0x0290 */
|
||||
psw_t return_mcck_psw; /* 0x02a0 */
|
||||
|
||||
__u64 last_break; /* 0x02b0 */
|
||||
|
||||
/* CPU accounting and timing values. */
|
||||
__u64 sys_enter_timer; /* 0x02b0 */
|
||||
__u8 pad_0x02b8[0x02c0-0x02b8]; /* 0x02b8 */
|
||||
__u64 sys_enter_timer; /* 0x02b8 */
|
||||
__u64 mcck_enter_timer; /* 0x02c0 */
|
||||
__u64 exit_timer; /* 0x02c8 */
|
||||
__u64 user_timer; /* 0x02d0 */
|
||||
@ -188,7 +189,7 @@ struct lowcore {
|
||||
__u32 tod_progreg_save_area; /* 0x1324 */
|
||||
__u32 cpu_timer_save_area[2]; /* 0x1328 */
|
||||
__u32 clock_comp_save_area[2]; /* 0x1330 */
|
||||
__u8 pad_0x1338[0x1340-0x1338]; /* 0x1338 */
|
||||
__u64 last_break_save_area; /* 0x1338 */
|
||||
__u32 access_regs_save_area[16]; /* 0x1340 */
|
||||
__u64 cregs_save_area[16]; /* 0x1380 */
|
||||
__u8 pad_0x1400[0x1800-0x1400]; /* 0x1400 */
|
||||
|
@ -12,6 +12,11 @@ void nospec_init_branches(void);
|
||||
void nospec_auto_detect(void);
|
||||
void nospec_revert(s32 *start, s32 *end);
|
||||
|
||||
static inline bool nospec_uses_trampoline(void)
|
||||
{
|
||||
return __is_defined(CC_USING_EXPOLINE) && !nospec_disable;
|
||||
}
|
||||
|
||||
#endif /* __ASSEMBLY__ */
|
||||
|
||||
#endif /* _ASM_S390_EXPOLINE_H */
|
||||
|
@ -583,11 +583,11 @@ static inline void cspg(unsigned long *ptr, unsigned long old, unsigned long new
|
||||
#define CRDTE_DTT_REGION1 0x1cUL
|
||||
|
||||
static inline void crdte(unsigned long old, unsigned long new,
|
||||
unsigned long table, unsigned long dtt,
|
||||
unsigned long *table, unsigned long dtt,
|
||||
unsigned long address, unsigned long asce)
|
||||
{
|
||||
union register_pair r1 = { .even = old, .odd = new, };
|
||||
union register_pair r2 = { .even = table | dtt, .odd = address, };
|
||||
union register_pair r2 = { .even = __pa(table) | dtt, .odd = address, };
|
||||
|
||||
asm volatile(".insn rrf,0xb98f0000,%[r1],%[r2],%[asce],0"
|
||||
: [r1] "+&d" (r1.pair)
|
||||
@ -1001,7 +1001,7 @@ static __always_inline void __ptep_ipte(unsigned long address, pte_t *ptep,
|
||||
unsigned long opt, unsigned long asce,
|
||||
int local)
|
||||
{
|
||||
unsigned long pto = (unsigned long) ptep;
|
||||
unsigned long pto = __pa(ptep);
|
||||
|
||||
if (__builtin_constant_p(opt) && opt == 0) {
|
||||
/* Invalidation + TLB flush for the pte */
|
||||
@ -1023,7 +1023,7 @@ static __always_inline void __ptep_ipte(unsigned long address, pte_t *ptep,
|
||||
static __always_inline void __ptep_ipte_range(unsigned long address, int nr,
|
||||
pte_t *ptep, int local)
|
||||
{
|
||||
unsigned long pto = (unsigned long) ptep;
|
||||
unsigned long pto = __pa(ptep);
|
||||
|
||||
/* Invalidate a range of ptes + TLB flush of the ptes */
|
||||
do {
|
||||
@ -1487,7 +1487,7 @@ static __always_inline void __pmdp_idte(unsigned long addr, pmd_t *pmdp,
|
||||
{
|
||||
unsigned long sto;
|
||||
|
||||
sto = (unsigned long) pmdp - pmd_index(addr) * sizeof(pmd_t);
|
||||
sto = __pa(pmdp) - pmd_index(addr) * sizeof(pmd_t);
|
||||
if (__builtin_constant_p(opt) && opt == 0) {
|
||||
/* flush without guest asce */
|
||||
asm volatile(
|
||||
@ -1513,7 +1513,7 @@ static __always_inline void __pudp_idte(unsigned long addr, pud_t *pudp,
|
||||
{
|
||||
unsigned long r3o;
|
||||
|
||||
r3o = (unsigned long) pudp - pud_index(addr) * sizeof(pud_t);
|
||||
r3o = __pa(pudp) - pud_index(addr) * sizeof(pud_t);
|
||||
r3o |= _ASCE_TYPE_REGION3;
|
||||
if (__builtin_constant_p(opt) && opt == 0) {
|
||||
/* flush without guest asce */
|
||||
|
@ -76,8 +76,7 @@ enum {
|
||||
* The pt_regs struct defines the way the registers are stored on
|
||||
* the stack during a system call.
|
||||
*/
|
||||
struct pt_regs
|
||||
{
|
||||
struct pt_regs {
|
||||
union {
|
||||
user_pt_regs user_regs;
|
||||
struct {
|
||||
@ -97,6 +96,7 @@ struct pt_regs
|
||||
};
|
||||
unsigned long flags;
|
||||
unsigned long cr1;
|
||||
unsigned long last_break;
|
||||
};
|
||||
|
||||
/*
|
||||
@ -197,6 +197,25 @@ const char *regs_query_register_name(unsigned int offset);
|
||||
unsigned long regs_get_register(struct pt_regs *regs, unsigned int offset);
|
||||
unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs, unsigned int n);
|
||||
|
||||
/**
|
||||
* regs_get_kernel_argument() - get Nth function argument in kernel
|
||||
* @regs: pt_regs of that context
|
||||
* @n: function argument number (start from 0)
|
||||
*
|
||||
* regs_get_kernel_argument() returns @n th argument of the function call.
|
||||
*/
|
||||
static inline unsigned long regs_get_kernel_argument(struct pt_regs *regs,
|
||||
unsigned int n)
|
||||
{
|
||||
unsigned int argoffset = STACK_FRAME_OVERHEAD / sizeof(long);
|
||||
|
||||
#define NR_REG_ARGUMENTS 5
|
||||
if (n < NR_REG_ARGUMENTS)
|
||||
return regs_get_register(regs, 2 + n);
|
||||
n -= NR_REG_ARGUMENTS;
|
||||
return regs_get_kernel_stack_nth(regs, argoffset + n);
|
||||
}
|
||||
|
||||
static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
|
||||
{
|
||||
return regs->gprs[15];
|
||||
|
@ -117,6 +117,7 @@ struct zpci_report_error_header {
|
||||
|
||||
extern char *sclp_early_sccb;
|
||||
|
||||
void sclp_early_adjust_va(void);
|
||||
void sclp_early_set_buffer(void *sccb);
|
||||
int sclp_early_read_info(void);
|
||||
int sclp_early_read_storage_info(void);
|
||||
|
@ -2,20 +2,8 @@
|
||||
#ifndef _S390_SECTIONS_H
|
||||
#define _S390_SECTIONS_H
|
||||
|
||||
#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
|
||||
|
||||
#include <asm-generic/sections.h>
|
||||
|
||||
extern bool initmem_freed;
|
||||
|
||||
static inline int arch_is_kernel_initmem_freed(unsigned long addr)
|
||||
{
|
||||
if (!initmem_freed)
|
||||
return 0;
|
||||
return addr >= (unsigned long)__init_begin &&
|
||||
addr < (unsigned long)__init_end;
|
||||
}
|
||||
|
||||
/*
|
||||
* .boot.data section contains variables "shared" between the decompressor and
|
||||
* the decompressed kernel. The decompressor will store values in them, and
|
||||
|
@ -11,8 +11,8 @@
|
||||
#include <linux/build_bug.h>
|
||||
|
||||
#define PARMAREA 0x10400
|
||||
#define HEAD_END 0x11000
|
||||
|
||||
#define COMMAND_LINE_SIZE CONFIG_COMMAND_LINE_SIZE
|
||||
/*
|
||||
* Machine features detected in early.c
|
||||
*/
|
||||
@ -43,6 +43,8 @@
|
||||
#define STARTUP_NORMAL_OFFSET 0x10000
|
||||
#define STARTUP_KDUMP_OFFSET 0x10010
|
||||
|
||||
#define LEGACY_COMMAND_LINE_SIZE 896
|
||||
|
||||
#ifndef __ASSEMBLY__
|
||||
|
||||
#include <asm/lowcore.h>
|
||||
@ -55,8 +57,9 @@ struct parmarea {
|
||||
unsigned long oldmem_base; /* 0x10418 */
|
||||
unsigned long oldmem_size; /* 0x10420 */
|
||||
unsigned long kernel_version; /* 0x10428 */
|
||||
char pad1[0x10480 - 0x10430]; /* 0x10430 - 0x10480 */
|
||||
char command_line[ARCH_COMMAND_LINE_SIZE]; /* 0x10480 */
|
||||
unsigned long max_command_line_size; /* 0x10430 */
|
||||
char pad1[0x10480-0x10438]; /* 0x10438 - 0x10480 */
|
||||
char command_line[COMMAND_LINE_SIZE]; /* 0x10480 */
|
||||
};
|
||||
|
||||
extern struct parmarea parmarea;
|
||||
|
@ -31,22 +31,18 @@ void *memmove(void *dest, const void *src, size_t n);
|
||||
#define __HAVE_ARCH_STRCMP /* arch function */
|
||||
#define __HAVE_ARCH_STRCPY /* inline & arch function */
|
||||
#define __HAVE_ARCH_STRLCAT /* arch function */
|
||||
#define __HAVE_ARCH_STRLCPY /* arch function */
|
||||
#define __HAVE_ARCH_STRLEN /* inline & arch function */
|
||||
#define __HAVE_ARCH_STRNCAT /* arch function */
|
||||
#define __HAVE_ARCH_STRNCPY /* arch function */
|
||||
#define __HAVE_ARCH_STRNLEN /* inline & arch function */
|
||||
#define __HAVE_ARCH_STRRCHR /* arch function */
|
||||
#define __HAVE_ARCH_STRSTR /* arch function */
|
||||
|
||||
/* Prototypes for non-inlined arch strings functions. */
|
||||
int memcmp(const void *s1, const void *s2, size_t n);
|
||||
int strcmp(const char *s1, const char *s2);
|
||||
size_t strlcat(char *dest, const char *src, size_t n);
|
||||
size_t strlcpy(char *dest, const char *src, size_t size);
|
||||
char *strncat(char *dest, const char *src, size_t n);
|
||||
char *strncpy(char *dest, const char *src, size_t n);
|
||||
char *strrchr(const char *s, int c);
|
||||
char *strstr(const char *s1, const char *s2);
|
||||
#endif /* !CONFIG_KASAN */
|
||||
|
||||
|
16
arch/s390/include/asm/text-patching.h
Normal file
16
arch/s390/include/asm/text-patching.h
Normal file
@ -0,0 +1,16 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
|
||||
#ifndef _ASM_S390_TEXT_PATCHING_H
|
||||
#define _ASM_S390_TEXT_PATCHING_H
|
||||
|
||||
#include <asm/barrier.h>
|
||||
|
||||
static __always_inline void sync_core(void)
|
||||
{
|
||||
bcr_serialize();
|
||||
}
|
||||
|
||||
void text_poke_sync(void);
|
||||
void text_poke_sync_lock(void);
|
||||
|
||||
#endif /* _ASM_S390_TEXT_PATCHING_H */
|
@ -1,14 +1 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
|
||||
/*
|
||||
* S390 version
|
||||
* Copyright IBM Corp. 1999, 2010
|
||||
*/
|
||||
|
||||
#ifndef _UAPI_ASM_S390_SETUP_H
|
||||
#define _UAPI_ASM_S390_SETUP_H
|
||||
|
||||
#define COMMAND_LINE_SIZE 4096
|
||||
|
||||
#define ARCH_COMMAND_LINE_SIZE 896
|
||||
|
||||
#endif /* _UAPI_ASM_S390_SETUP_H */
|
||||
|
@ -1,5 +1,8 @@
|
||||
// SPDX-License-Identifier: GPL-2.0
|
||||
#include <linux/module.h>
|
||||
#include <linux/cpu.h>
|
||||
#include <linux/smp.h>
|
||||
#include <asm/text-patching.h>
|
||||
#include <asm/alternative.h>
|
||||
#include <asm/facility.h>
|
||||
#include <asm/nospec-branch.h>
|
||||
@ -110,3 +113,20 @@ void __init apply_alternative_instructions(void)
|
||||
{
|
||||
apply_alternatives(__alt_instructions, __alt_instructions_end);
|
||||
}
|
||||
|
||||
static void do_sync_core(void *info)
|
||||
{
|
||||
sync_core();
|
||||
}
|
||||
|
||||
void text_poke_sync(void)
|
||||
{
|
||||
on_each_cpu(do_sync_core, NULL, 1);
|
||||
}
|
||||
|
||||
void text_poke_sync_lock(void)
|
||||
{
|
||||
cpus_read_lock();
|
||||
text_poke_sync();
|
||||
cpus_read_unlock();
|
||||
}
|
||||
|
@ -35,6 +35,7 @@ int main(void)
|
||||
OFFSET(__PT_ORIG_GPR2, pt_regs, orig_gpr2);
|
||||
OFFSET(__PT_FLAGS, pt_regs, flags);
|
||||
OFFSET(__PT_CR1, pt_regs, cr1);
|
||||
OFFSET(__PT_LAST_BREAK, pt_regs, last_break);
|
||||
DEFINE(__PT_SIZE, sizeof(struct pt_regs));
|
||||
BLANK();
|
||||
/* stack_frame offsets */
|
||||
@ -45,6 +46,7 @@ int main(void)
|
||||
OFFSET(__SF_SIE_SAVEAREA, stack_frame, empty1[2]);
|
||||
OFFSET(__SF_SIE_REASON, stack_frame, empty1[3]);
|
||||
OFFSET(__SF_SIE_FLAGS, stack_frame, empty1[4]);
|
||||
DEFINE(STACK_FRAME_OVERHEAD, sizeof(struct stack_frame));
|
||||
BLANK();
|
||||
/* idle data offsets */
|
||||
OFFSET(__CLOCK_IDLE_ENTER, s390_idle_data, clock_idle_enter);
|
||||
@ -77,7 +79,7 @@ int main(void)
|
||||
OFFSET(__LC_MCCK_CODE, lowcore, mcck_interruption_code);
|
||||
OFFSET(__LC_EXT_DAMAGE_CODE, lowcore, external_damage_code);
|
||||
OFFSET(__LC_MCCK_FAIL_STOR_ADDR, lowcore, failing_storage_address);
|
||||
OFFSET(__LC_LAST_BREAK, lowcore, breaking_event_addr);
|
||||
OFFSET(__LC_PGM_LAST_BREAK, lowcore, pgm_last_break);
|
||||
OFFSET(__LC_RETURN_LPSWE, lowcore, return_lpswe);
|
||||
OFFSET(__LC_RETURN_MCCK_LPSWE, lowcore, return_mcck_lpswe);
|
||||
OFFSET(__LC_RST_OLD_PSW, lowcore, restart_old_psw);
|
||||
@ -126,6 +128,7 @@ int main(void)
|
||||
OFFSET(__LC_PREEMPT_COUNT, lowcore, preempt_count);
|
||||
OFFSET(__LC_GMAP, lowcore, gmap);
|
||||
OFFSET(__LC_BR_R1, lowcore, br_r1_trampoline);
|
||||
OFFSET(__LC_LAST_BREAK, lowcore, last_break);
|
||||
/* software defined ABI-relevant lowcore locations 0xe00 - 0xe20 */
|
||||
OFFSET(__LC_DUMP_REIPL, lowcore, ipib);
|
||||
/* hardware defined lowcore locations 0x1000 - 0x18ff */
|
||||
@ -139,6 +142,7 @@ int main(void)
|
||||
OFFSET(__LC_TOD_PROGREG_SAVE_AREA, lowcore, tod_progreg_save_area);
|
||||
OFFSET(__LC_CPU_TIMER_SAVE_AREA, lowcore, cpu_timer_save_area);
|
||||
OFFSET(__LC_CLOCK_COMP_SAVE_AREA, lowcore, clock_comp_save_area);
|
||||
OFFSET(__LC_LAST_BREAK_SAVE_AREA, lowcore, last_break_save_area);
|
||||
OFFSET(__LC_AREGS_SAVE_AREA, lowcore, access_regs_save_area);
|
||||
OFFSET(__LC_CREGS_SAVE_AREA, lowcore, cregs_save_area);
|
||||
OFFSET(__LC_PGM_TDB, lowcore, pgm_tdb);
|
||||
@ -160,5 +164,6 @@ int main(void)
|
||||
DEFINE(OLDMEM_BASE, PARMAREA + offsetof(struct parmarea, oldmem_base));
|
||||
DEFINE(OLDMEM_SIZE, PARMAREA + offsetof(struct parmarea, oldmem_size));
|
||||
DEFINE(COMMAND_LINE, PARMAREA + offsetof(struct parmarea, command_line));
|
||||
DEFINE(MAX_COMMAND_LINE_SIZE, PARMAREA + offsetof(struct parmarea, max_command_line_size));
|
||||
return 0;
|
||||
}
|
||||
|
@ -29,7 +29,7 @@ static int diag8_noresponse(int cmdlen)
|
||||
asm volatile(
|
||||
" diag %[rx],%[ry],0x8\n"
|
||||
: [ry] "+&d" (cmdlen)
|
||||
: [rx] "d" ((addr_t) cpcmd_buf)
|
||||
: [rx] "d" (__pa(cpcmd_buf))
|
||||
: "cc");
|
||||
return cmdlen;
|
||||
}
|
||||
@ -39,8 +39,8 @@ static int diag8_response(int cmdlen, char *response, int *rlen)
|
||||
union register_pair rx, ry;
|
||||
int cc;
|
||||
|
||||
rx.even = (addr_t) cpcmd_buf;
|
||||
rx.odd = (addr_t) response;
|
||||
rx.even = __pa(cpcmd_buf);
|
||||
rx.odd = __pa(response);
|
||||
ry.even = cmdlen | 0x40000000L;
|
||||
ry.odd = *rlen;
|
||||
asm volatile(
|
||||
|
@ -152,7 +152,7 @@ void show_stack(struct task_struct *task, unsigned long *stack,
|
||||
static void show_last_breaking_event(struct pt_regs *regs)
|
||||
{
|
||||
printk("Last Breaking-Event-Address:\n");
|
||||
printk(" [<%016lx>] %pSR\n", regs->args[0], (void *)regs->args[0]);
|
||||
printk(" [<%016lx>] %pSR\n", regs->last_break, (void *)regs->last_break);
|
||||
}
|
||||
|
||||
void show_registers(struct pt_regs *regs)
|
||||
|
@ -280,7 +280,7 @@ char __bootdata(early_command_line)[COMMAND_LINE_SIZE];
|
||||
static void __init setup_boot_command_line(void)
|
||||
{
|
||||
/* copy arch command line */
|
||||
strlcpy(boot_command_line, early_command_line, ARCH_COMMAND_LINE_SIZE);
|
||||
strlcpy(boot_command_line, early_command_line, COMMAND_LINE_SIZE);
|
||||
}
|
||||
|
||||
static void __init check_image_bootable(void)
|
||||
@ -296,6 +296,7 @@ static void __init check_image_bootable(void)
|
||||
|
||||
void __init startup_init(void)
|
||||
{
|
||||
sclp_early_adjust_va();
|
||||
reset_tod_clock();
|
||||
check_image_bootable();
|
||||
time_early_init();
|
||||
|
@ -52,6 +52,22 @@ STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE
|
||||
|
||||
_LPP_OFFSET = __LC_LPP
|
||||
|
||||
.macro STBEAR address
|
||||
ALTERNATIVE "", ".insn s,0xb2010000,\address", 193
|
||||
.endm
|
||||
|
||||
.macro LBEAR address
|
||||
ALTERNATIVE "", ".insn s,0xb2000000,\address", 193
|
||||
.endm
|
||||
|
||||
.macro LPSWEY address,lpswe
|
||||
ALTERNATIVE "b \lpswe", ".insn siy,0xeb0000000071,\address,0", 193
|
||||
.endm
|
||||
|
||||
.macro MBEAR reg
|
||||
ALTERNATIVE "", __stringify(mvc __PT_LAST_BREAK(8,\reg),__LC_LAST_BREAK), 193
|
||||
.endm
|
||||
|
||||
.macro CHECK_STACK savearea
|
||||
#ifdef CONFIG_CHECK_STACK
|
||||
tml %r15,STACK_SIZE - CONFIG_STACK_GUARD
|
||||
@ -302,6 +318,7 @@ ENTRY(system_call)
|
||||
BPOFF
|
||||
lghi %r14,0
|
||||
.Lsysc_per:
|
||||
STBEAR __LC_LAST_BREAK
|
||||
lctlg %c1,%c1,__LC_KERNEL_ASCE
|
||||
lg %r12,__LC_CURRENT
|
||||
lg %r15,__LC_KERNEL_STACK
|
||||
@ -321,14 +338,16 @@ ENTRY(system_call)
|
||||
xgr %r11,%r11
|
||||
la %r2,STACK_FRAME_OVERHEAD(%r15) # pointer to pt_regs
|
||||
mvc __PT_R8(64,%r2),__LC_SAVE_AREA_SYNC
|
||||
MBEAR %r2
|
||||
lgr %r3,%r14
|
||||
brasl %r14,__do_syscall
|
||||
lctlg %c1,%c1,__LC_USER_ASCE
|
||||
mvc __LC_RETURN_PSW(16),STACK_FRAME_OVERHEAD+__PT_PSW(%r15)
|
||||
BPEXIT __TI_flags(%r12),_TIF_ISOLATE_BP
|
||||
LBEAR STACK_FRAME_OVERHEAD+__PT_LAST_BREAK(%r15)
|
||||
lmg %r0,%r15,STACK_FRAME_OVERHEAD+__PT_R0(%r15)
|
||||
stpt __LC_EXIT_TIMER
|
||||
b __LC_RETURN_LPSWE
|
||||
LPSWEY __LC_RETURN_PSW,__LC_RETURN_LPSWE
|
||||
ENDPROC(system_call)
|
||||
|
||||
#
|
||||
@ -340,9 +359,10 @@ ENTRY(ret_from_fork)
|
||||
lctlg %c1,%c1,__LC_USER_ASCE
|
||||
mvc __LC_RETURN_PSW(16),STACK_FRAME_OVERHEAD+__PT_PSW(%r15)
|
||||
BPEXIT __TI_flags(%r12),_TIF_ISOLATE_BP
|
||||
LBEAR STACK_FRAME_OVERHEAD+__PT_LAST_BREAK(%r15)
|
||||
lmg %r0,%r15,STACK_FRAME_OVERHEAD+__PT_R0(%r15)
|
||||
stpt __LC_EXIT_TIMER
|
||||
b __LC_RETURN_LPSWE
|
||||
LPSWEY __LC_RETURN_PSW,__LC_RETURN_LPSWE
|
||||
ENDPROC(ret_from_fork)
|
||||
|
||||
/*
|
||||
@ -382,6 +402,7 @@ ENTRY(pgm_check_handler)
|
||||
xc __SF_BACKCHAIN(8,%r15),__SF_BACKCHAIN(%r15)
|
||||
stmg %r0,%r7,__PT_R0(%r11)
|
||||
mvc __PT_R8(64,%r11),__LC_SAVE_AREA_SYNC
|
||||
mvc __PT_LAST_BREAK(8,%r11),__LC_PGM_LAST_BREAK
|
||||
stmg %r8,%r9,__PT_PSW(%r11)
|
||||
|
||||
# clear user controlled registers to prevent speculative use
|
||||
@ -401,8 +422,9 @@ ENTRY(pgm_check_handler)
|
||||
stpt __LC_EXIT_TIMER
|
||||
.Lpgm_exit_kernel:
|
||||
mvc __LC_RETURN_PSW(16),STACK_FRAME_OVERHEAD+__PT_PSW(%r15)
|
||||
LBEAR STACK_FRAME_OVERHEAD+__PT_LAST_BREAK(%r15)
|
||||
lmg %r0,%r15,STACK_FRAME_OVERHEAD+__PT_R0(%r15)
|
||||
b __LC_RETURN_LPSWE
|
||||
LPSWEY __LC_RETURN_PSW,__LC_RETURN_LPSWE
|
||||
|
||||
#
|
||||
# single stepped system call
|
||||
@ -412,7 +434,8 @@ ENTRY(pgm_check_handler)
|
||||
larl %r14,.Lsysc_per
|
||||
stg %r14,__LC_RETURN_PSW+8
|
||||
lghi %r14,1
|
||||
lpswe __LC_RETURN_PSW # branch to .Lsysc_per
|
||||
LBEAR __LC_PGM_LAST_BREAK
|
||||
LPSWEY __LC_RETURN_PSW,__LC_RETURN_LPSWE # branch to .Lsysc_per
|
||||
ENDPROC(pgm_check_handler)
|
||||
|
||||
/*
|
||||
@ -422,6 +445,7 @@ ENDPROC(pgm_check_handler)
|
||||
ENTRY(\name)
|
||||
STCK __LC_INT_CLOCK
|
||||
stpt __LC_SYS_ENTER_TIMER
|
||||
STBEAR __LC_LAST_BREAK
|
||||
BPOFF
|
||||
stmg %r8,%r15,__LC_SAVE_AREA_ASYNC
|
||||
lg %r12,__LC_CURRENT
|
||||
@ -453,6 +477,7 @@ ENTRY(\name)
|
||||
xgr %r10,%r10
|
||||
xc __PT_FLAGS(8,%r11),__PT_FLAGS(%r11)
|
||||
mvc __PT_R8(64,%r11),__LC_SAVE_AREA_ASYNC
|
||||
MBEAR %r11
|
||||
stmg %r8,%r9,__PT_PSW(%r11)
|
||||
tm %r8,0x0001 # coming from user space?
|
||||
jno 1f
|
||||
@ -465,8 +490,9 @@ ENTRY(\name)
|
||||
lctlg %c1,%c1,__LC_USER_ASCE
|
||||
BPEXIT __TI_flags(%r12),_TIF_ISOLATE_BP
|
||||
stpt __LC_EXIT_TIMER
|
||||
2: lmg %r0,%r15,__PT_R0(%r11)
|
||||
b __LC_RETURN_LPSWE
|
||||
2: LBEAR __PT_LAST_BREAK(%r11)
|
||||
lmg %r0,%r15,__PT_R0(%r11)
|
||||
LPSWEY __LC_RETURN_PSW,__LC_RETURN_LPSWE
|
||||
ENDPROC(\name)
|
||||
.endm
|
||||
|
||||
@ -505,6 +531,7 @@ ENTRY(mcck_int_handler)
|
||||
BPOFF
|
||||
la %r1,4095 # validate r1
|
||||
spt __LC_CPU_TIMER_SAVE_AREA-4095(%r1) # validate cpu timer
|
||||
LBEAR __LC_LAST_BREAK_SAVE_AREA-4095(%r1) # validate bear
|
||||
lmg %r0,%r15,__LC_GPREGS_SAVE_AREA-4095(%r1)# validate gprs
|
||||
lg %r12,__LC_CURRENT
|
||||
lmg %r8,%r9,__LC_MCK_OLD_PSW
|
||||
@ -591,8 +618,10 @@ ENTRY(mcck_int_handler)
|
||||
jno 0f
|
||||
BPEXIT __TI_flags(%r12),_TIF_ISOLATE_BP
|
||||
stpt __LC_EXIT_TIMER
|
||||
0: lmg %r11,%r15,__PT_R11(%r11)
|
||||
b __LC_RETURN_MCCK_LPSWE
|
||||
0: ALTERNATIVE "", __stringify(lghi %r12,__LC_LAST_BREAK_SAVE_AREA),193
|
||||
LBEAR 0(%r12)
|
||||
lmg %r11,%r15,__PT_R11(%r11)
|
||||
LPSWEY __LC_RETURN_MCCK_PSW,__LC_RETURN_MCCK_LPSWE
|
||||
|
||||
.Lmcck_panic:
|
||||
/*
|
||||
|
@ -70,5 +70,6 @@ extern struct exception_table_entry _stop_amode31_ex_table[];
|
||||
#define __amode31_data __section(".amode31.data")
|
||||
#define __amode31_ref __section(".amode31.refs")
|
||||
extern long _start_amode31_refs[], _end_amode31_refs[];
|
||||
extern unsigned long __amode31_base;
|
||||
|
||||
#endif /* _ENTRY_H */
|
||||
|
@ -17,6 +17,7 @@
|
||||
#include <linux/kprobes.h>
|
||||
#include <trace/syscall.h>
|
||||
#include <asm/asm-offsets.h>
|
||||
#include <asm/text-patching.h>
|
||||
#include <asm/cacheflush.h>
|
||||
#include <asm/ftrace.lds.h>
|
||||
#include <asm/nospec-branch.h>
|
||||
@ -80,17 +81,6 @@ asm(
|
||||
|
||||
#ifdef CONFIG_MODULES
|
||||
static char *ftrace_plt;
|
||||
|
||||
asm(
|
||||
" .data\n"
|
||||
"ftrace_plt_template:\n"
|
||||
" basr %r1,%r0\n"
|
||||
" lg %r1,0f-.(%r1)\n"
|
||||
" br %r1\n"
|
||||
"0: .quad ftrace_caller\n"
|
||||
"ftrace_plt_template_end:\n"
|
||||
" .previous\n"
|
||||
);
|
||||
#endif /* CONFIG_MODULES */
|
||||
|
||||
static const char *ftrace_shared_hotpatch_trampoline(const char **end)
|
||||
@ -116,7 +106,7 @@ static const char *ftrace_shared_hotpatch_trampoline(const char **end)
|
||||
|
||||
bool ftrace_need_init_nop(void)
|
||||
{
|
||||
return ftrace_shared_hotpatch_trampoline(NULL);
|
||||
return true;
|
||||
}
|
||||
|
||||
int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec)
|
||||
@ -175,28 +165,6 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void ftrace_generate_nop_insn(struct ftrace_insn *insn)
|
||||
{
|
||||
/* brcl 0,0 */
|
||||
insn->opc = 0xc004;
|
||||
insn->disp = 0;
|
||||
}
|
||||
|
||||
static void ftrace_generate_call_insn(struct ftrace_insn *insn,
|
||||
unsigned long ip)
|
||||
{
|
||||
unsigned long target;
|
||||
|
||||
/* brasl r0,ftrace_caller */
|
||||
target = FTRACE_ADDR;
|
||||
#ifdef CONFIG_MODULES
|
||||
if (is_module_addr((void *)ip))
|
||||
target = (unsigned long)ftrace_plt;
|
||||
#endif /* CONFIG_MODULES */
|
||||
insn->opc = 0xc005;
|
||||
insn->disp = (target - ip) / 2;
|
||||
}
|
||||
|
||||
static void brcl_disable(void *brcl)
|
||||
{
|
||||
u8 op = 0x04; /* set mask field to zero */
|
||||
@ -207,23 +175,7 @@ static void brcl_disable(void *brcl)
|
||||
int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
|
||||
unsigned long addr)
|
||||
{
|
||||
struct ftrace_insn orig, new, old;
|
||||
|
||||
if (ftrace_shared_hotpatch_trampoline(NULL)) {
|
||||
brcl_disable((void *)rec->ip);
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (copy_from_kernel_nofault(&old, (void *) rec->ip, sizeof(old)))
|
||||
return -EFAULT;
|
||||
/* Replace ftrace call with a nop. */
|
||||
ftrace_generate_call_insn(&orig, rec->ip);
|
||||
ftrace_generate_nop_insn(&new);
|
||||
|
||||
/* Verify that the to be replaced code matches what we expect. */
|
||||
if (memcmp(&orig, &old, sizeof(old)))
|
||||
return -EINVAL;
|
||||
s390_kernel_write((void *) rec->ip, &new, sizeof(new));
|
||||
brcl_disable((void *)rec->ip);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -236,23 +188,7 @@ static void brcl_enable(void *brcl)
|
||||
|
||||
int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
|
||||
{
|
||||
struct ftrace_insn orig, new, old;
|
||||
|
||||
if (ftrace_shared_hotpatch_trampoline(NULL)) {
|
||||
brcl_enable((void *)rec->ip);
|
||||
return 0;
|
||||
}
|
||||
|
||||
if (copy_from_kernel_nofault(&old, (void *) rec->ip, sizeof(old)))
|
||||
return -EFAULT;
|
||||
/* Replace nop with an ftrace call. */
|
||||
ftrace_generate_nop_insn(&orig);
|
||||
ftrace_generate_call_insn(&new, rec->ip);
|
||||
|
||||
/* Verify that the to be replaced code matches what we expect. */
|
||||
if (memcmp(&orig, &old, sizeof(old)))
|
||||
return -EINVAL;
|
||||
s390_kernel_write((void *) rec->ip, &new, sizeof(new));
|
||||
brcl_enable((void *)rec->ip);
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -264,22 +200,16 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
|
||||
|
||||
void arch_ftrace_update_code(int command)
|
||||
{
|
||||
if (ftrace_shared_hotpatch_trampoline(NULL))
|
||||
ftrace_modify_all_code(command);
|
||||
else
|
||||
ftrace_run_stop_machine(command);
|
||||
}
|
||||
|
||||
static void __ftrace_sync(void *dummy)
|
||||
{
|
||||
ftrace_modify_all_code(command);
|
||||
}
|
||||
|
||||
int ftrace_arch_code_modify_post_process(void)
|
||||
{
|
||||
if (ftrace_shared_hotpatch_trampoline(NULL)) {
|
||||
/* Send SIGP to the other CPUs, so they see the new code. */
|
||||
smp_call_function(__ftrace_sync, NULL, 1);
|
||||
}
|
||||
/*
|
||||
* Flush any pre-fetched instructions on all
|
||||
* CPUs to make the new code visible.
|
||||
*/
|
||||
text_poke_sync_lock();
|
||||
return 0;
|
||||
}
|
||||
|
||||
@ -294,10 +224,6 @@ static int __init ftrace_plt_init(void)
|
||||
panic("cannot allocate ftrace plt\n");
|
||||
|
||||
start = ftrace_shared_hotpatch_trampoline(&end);
|
||||
if (!start) {
|
||||
start = ftrace_plt_template;
|
||||
end = ftrace_plt_template_end;
|
||||
}
|
||||
memcpy(ftrace_plt, start, end - start);
|
||||
set_memory_ro((unsigned long)ftrace_plt, 1);
|
||||
return 0;
|
||||
@ -337,12 +263,14 @@ NOKPROBE_SYMBOL(prepare_ftrace_return);
|
||||
int ftrace_enable_ftrace_graph_caller(void)
|
||||
{
|
||||
brcl_disable(ftrace_graph_caller);
|
||||
text_poke_sync_lock();
|
||||
return 0;
|
||||
}
|
||||
|
||||
int ftrace_disable_ftrace_graph_caller(void)
|
||||
{
|
||||
brcl_enable(ftrace_graph_caller);
|
||||
text_poke_sync_lock();
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
@ -20,8 +20,6 @@ __HEAD
|
||||
ENTRY(startup_continue)
|
||||
larl %r1,tod_clock_base
|
||||
mvc 0(16,%r1),__LC_BOOT_CLOCK
|
||||
larl %r13,.LPG1 # get base
|
||||
lctlg %c0,%c15,.Lctl-.LPG1(%r13) # load control registers
|
||||
#
|
||||
# Setup stack
|
||||
#
|
||||
@ -42,19 +40,3 @@ ENTRY(startup_continue)
|
||||
.align 16
|
||||
.LPG1:
|
||||
.Ldw: .quad 0x0002000180000000,0x0000000000000000
|
||||
.Lctl: .quad 0x04040000 # cr0: AFP registers & secondary space
|
||||
.quad 0 # cr1: primary space segment table
|
||||
.quad 0 # cr2: dispatchable unit control table
|
||||
.quad 0 # cr3: instruction authorization
|
||||
.quad 0xffff # cr4: instruction authorization
|
||||
.quad 0 # cr5: primary-aste origin
|
||||
.quad 0 # cr6: I/O interrupts
|
||||
.quad 0 # cr7: secondary space segment table
|
||||
.quad 0x0000000000008000 # cr8: access registers translation
|
||||
.quad 0 # cr9: tracing off
|
||||
.quad 0 # cr10: tracing off
|
||||
.quad 0 # cr11: tracing off
|
||||
.quad 0 # cr12: tracing off
|
||||
.quad 0 # cr13: home space segment table
|
||||
.quad 0xc0000000 # cr14: machine check handling off
|
||||
.quad 0 # cr15: linkage stack operations
|
||||
|
@ -140,8 +140,11 @@ void noinstr do_io_irq(struct pt_regs *regs)
|
||||
|
||||
irq_enter();
|
||||
|
||||
if (user_mode(regs))
|
||||
if (user_mode(regs)) {
|
||||
update_timer_sys();
|
||||
if (static_branch_likely(&cpu_has_bear))
|
||||
current->thread.last_break = regs->last_break;
|
||||
}
|
||||
|
||||
from_idle = !user_mode(regs) && regs->psw.addr == (unsigned long)psw_idle_exit;
|
||||
if (from_idle)
|
||||
@ -171,8 +174,11 @@ void noinstr do_ext_irq(struct pt_regs *regs)
|
||||
|
||||
irq_enter();
|
||||
|
||||
if (user_mode(regs))
|
||||
if (user_mode(regs)) {
|
||||
update_timer_sys();
|
||||
if (static_branch_likely(&cpu_has_bear))
|
||||
current->thread.last_break = regs->last_break;
|
||||
}
|
||||
|
||||
regs->int_code = S390_lowcore.ext_int_code_addr;
|
||||
regs->int_parm = S390_lowcore.ext_params;
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user