From 947d2c2cd3340d11a1737d3f4781170cc3a4b74f Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sat, 13 Aug 2016 02:54:04 +0200 Subject: [PATCH 1/3] PM / sleep: Update some system sleep documentation Update some documentation related to system sleep to document new features and remove outdated information from it. Signed-off-by: Rafael J. Wysocki Acked-by: Pavel Machek Reviewed-by: Chen Yu --- Documentation/power/basic-pm-debugging.txt | 27 ++++- Documentation/power/interface.txt | 117 +++++++++++---------- 2 files changed, 85 insertions(+), 59 deletions(-) diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt index b96098ccfe69..708f87f78a75 100644 --- a/Documentation/power/basic-pm-debugging.txt +++ b/Documentation/power/basic-pm-debugging.txt @@ -164,7 +164,32 @@ load n/2 modules more and try again. Again, if you find the offending module(s), it(they) must be unloaded every time before hibernation, and please report the problem with it(them). -c) Advanced debugging +c) Using the "test_resume" hibernation option + +/sys/power/disk generally tells the kernel what to do after creating a +hibernation image. One of the available options is "test_resume" which +causes the just created image to be used for immediate restoration. Namely, +after doing: + +# echo test_resume > /sys/power/disk +# echo disk > /sys/power/state + +a hibernation image will be created and a resume from it will be triggered +immediately without involving the platform firmware in any way. + +That test can be used to check if failures to resume from hibernation are +related to bad interactions with the platform firmware. That is, if the above +works every time, but resume from actual hibernation does not work or is +unreliable, the platform firmware may be responsible for the failures. + +On architectures and platforms that support using different kernels to restore +hibernation images (that is, the kernel used to read the image from storage and +load it into memory is different from the one included in the image) or support +kernel address space randomization, it also can be used to check if failures +to resume may be related to the differences between the restore and image +kernels. + +d) Advanced debugging In case that hibernation does not work on your system even in the minimal configuration and compiling more drivers as modules is not practical or some diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt index f1f0f59a7c47..974916ff6608 100644 --- a/Documentation/power/interface.txt +++ b/Documentation/power/interface.txt @@ -1,75 +1,76 @@ -Power Management Interface +Power Management Interface for System Sleep +Copyright (c) 2016 Intel Corp., Rafael J. Wysocki -The power management subsystem provides a unified sysfs interface to -userspace, regardless of what architecture or platform one is -running. The interface exists in /sys/power/ directory (assuming sysfs -is mounted at /sys). +The power management subsystem provides userspace with a unified sysfs interface +for system sleep regardless of the underlying system architecture or platform. +The interface is located in the /sys/power/ directory (assuming that sysfs is +mounted at /sys). -/sys/power/state controls system power state. Reading from this file -returns what states are supported, which is hard-coded to 'freeze', -'standby' (Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk' -(Suspend-to-Disk). +/sys/power/state is the system sleep state control file. -Writing to this file one of those strings causes the system to -transition into that state. Please see the file -Documentation/power/states.txt for a description of each of those -states. +Reading from it returns a list of supported sleep states, encoded as: +'freeze' (Suspend-to-Idle) +'standby' (Power-On Suspend) +'mem' (Suspend-to-RAM) +'disk' (Suspend-to-Disk) -/sys/power/disk controls the operating mode of the suspend-to-disk -mechanism. Suspend-to-disk can be handled in several ways. We have a -few options for putting the system to sleep - using the platform driver -(e.g. ACPI or other suspend_ops), powering off the system or rebooting the -system (for testing). +Suspend-to-Idle is always supported. Suspend-to-Disk is always supported +too as long the kernel has been configured to support hibernation at all +(ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support +for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the +platform. -Additionally, /sys/power/disk can be used to turn on one of the two testing -modes of the suspend-to-disk mechanism: 'testproc' or 'test'. If the -suspend-to-disk mechanism is in the 'testproc' mode, writing 'disk' to -/sys/power/state will cause the kernel to disable nonboot CPUs and freeze -tasks, wait for 5 seconds, unfreeze tasks and enable nonboot CPUs. If it is -in the 'test' mode, writing 'disk' to /sys/power/state will cause the kernel -to disable nonboot CPUs and freeze tasks, shrink memory, suspend devices, wait -for 5 seconds, resume devices, unfreeze tasks and enable nonboot CPUs. Then, -we are able to look in the log messages and work out, for example, which code -is being slow and which device drivers are misbehaving. +If one of the strings listed in /sys/power/state is written to it, the system +will attempt to transition into the corresponding sleep state. Refer to +Documentation/power/states.txt for a description of each of those states. -Reading from this file will display all supported modes and the currently -selected one in brackets, for example +/sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). +Specifically, it tells the kernel what to do after creating a hibernation image. - [shutdown] reboot test testproc +Reading from it returns a list of supported options encoded as: -Writing to this file will accept one of +'platform' (put the system into sleep using a platform-provided method) +'shutdown' (shut the system down) +'reboot' (reboot the system) +'suspend' (trigger a Suspend-to-RAM transition) +'test_resume' (resume-after-hibernation test mode) - 'platform' (only if the platform supports it) - 'shutdown' - 'reboot' - 'testproc' - 'test' +The currently selected option is printed in square brackets. -/sys/power/image_size controls the size of the image created by -the suspend-to-disk mechanism. It can be written a string -representing a non-negative integer that will be used as an upper -limit of the image size, in bytes. The suspend-to-disk mechanism will -do its best to ensure the image size will not exceed that number. However, -if this turns out to be impossible, it will try to suspend anyway using the -smallest image possible. In particular, if "0" is written to this file, the -suspend image will be as small as possible. +The 'platform' option is only available if the platform provides a special +mechanism to put the system to sleep after creating a hibernation image (ACPI +does that, for example). The 'suspend' option is available if Suspend-to-RAM +is supported. Refer to Documentation/power/basic_pm_debugging.txt for the +description of the 'test_resume' option. -Reading from this file will display the current image size limit, which -is set to 2/5 of available RAM by default. +To select an option, write the string representing it to /sys/power/disk. -/sys/power/pm_trace controls the code which saves the last PM event point in -the RTC across reboots, so that you can debug a machine that just hangs -during suspend (or more commonly, during resume). Namely, the RTC is only -used to save the last PM event point if this file contains '1'. Initially it -contains '0' which may be changed to '1' by writing a string representing a -nonzero integer into it. +/sys/power/image_size controls the size of hibernation images. -To use this debugging feature you should attempt to suspend the machine, then -reboot it and run +It can be written a string representing a non-negative integer that will be +used as a best-effort upper limit of the image size, in bytes. The hibernation +core will do its best to ensure that the image size will not exceed that number. +However, if that turns out to be impossible to achieve, a hibernation image will +still be created and its size will be as small as possible. In particular, +writing '0' to this file will enforce hibernation images to be as small as +possible. - dmesg -s 1000000 | grep 'hash matches' +Reading from this file returns the current image size limit, which is set to +around 2/5 of available RAM by default. -CAUTION: Using it will cause your machine's real-time (CMOS) clock to be -set to a random invalid time after a resume. +/sys/power/pm_trace controls the PM trace mechanism saving the last suspend +or resume event point in the RTC across reboots. + +It helps to debug hard lockups or reboots due to device driver failures that +occur during system suspend or resume (which is more common) more effectively. + +If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume +event point in turn will be stored in the RTC memory (overwriting the actual +RTC information), so it will survive a system crash if one occurs right after +storing it and it can be used later to identify the driver that caused the crash +to happen (see Documentation/power/s2ram.txt for more information). + +Initially it contains '0' which may be changed to '1' by writing a string +representing a nonzero integer into it. From 5d87f493ddb1b86a0569fa3c4037fa9efc0c7183 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sun, 14 Aug 2016 04:07:32 +0200 Subject: [PATCH 2/3] x86/power/64: Use __pa() for physical address computation The value of temp_level4_pgt is the physical address of the top-level page directory, so use __pa() to compute it. Signed-off-by: Rafael J. Wysocki Acked-by: Ingo Molnar --- arch/x86/power/hibernate_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c index a3e3ccc87138..9634557a5444 100644 --- a/arch/x86/power/hibernate_64.c +++ b/arch/x86/power/hibernate_64.c @@ -113,7 +113,7 @@ static int set_up_temporary_mappings(void) return result; } - temp_level4_pgt = (unsigned long)pgd - __PAGE_OFFSET; + temp_level4_pgt = __pa(pgd); return 0; } From 924d8696751c4b9e58263bc82efdafcf875596a6 Mon Sep 17 00:00:00 2001 From: James Morse Date: Tue, 16 Aug 2016 10:46:38 +0100 Subject: [PATCH 3/3] PM / hibernate: Fix rtree_next_node() to avoid walking off list ends rtree_next_node() walks the linked list of leaf nodes to find the next block of pages in the struct memory_bitmap. If it walks off the end of the list of nodes, it walks the list of memory zones to find the next region of memory. If it walks off the end of the list of zones, it returns false. This leaves the struct bm_position's node and zone pointers pointing at their respective struct list_heads in struct mem_zone_bm_rtree. memory_bm_find_bit() uses struct bm_position's node and zone pointers to avoid walking lists and trees if the next bit appears in the same node/zone. It handles these values being stale. Swap rtree_next_node()s 'step then test' to 'test-next then step', this means if we reach the end of memory we return false and leave the node and zone pointers as they were. This fixes a panic on resume using AMD Seattle with 64K pages: [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done. [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds) [ 6.896453] PM: Using 3 thread(s) for decompression. [ 6.896453] PM: Loading and decompressing image data (5339 pages)... [ 7.318890] PM: Image loading progress: 0% [ 7.323395] Unable to handle kernel paging request at virtual address 00800040 [ 7.330611] pgd = ffff000008df0000 [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000 [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP [ 7.349825] Modules linked in: [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737 [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016 [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000 [ 7.375020] PC is at set_bit+0x18/0x30 [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30 [ 7.383362] pc : [] lr : [] pstate: 60000045 [ 7.390743] sp : ffff8003c0283b00 [ 7.473551] [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020) [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000) [ 7.800075] Call trace: [ 7.887097] [] set_bit+0x18/0x30 [ 7.891876] [] duplicate_memory_bitmap.constprop.38+0x54/0x70 [ 7.899172] [] snapshot_write_next+0x22c/0x47c [ 7.905166] [] load_image_lzo+0x754/0xa88 [ 7.910725] [] swsusp_read+0x144/0x230 [ 7.916025] [] load_image_and_restore+0x58/0x90 [ 7.922105] [] software_resume+0x2f0/0x338 [ 7.927752] [] do_one_initcall+0x38/0x11c [ 7.933314] [] kernel_init_freeable+0x14c/0x1ec [ 7.939395] [] kernel_init+0x10/0xfc [ 7.944520] [] ret_from_fork+0x10/0x40 [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22) [ 7.955909] ---[ end trace 0024a5986e6ff323 ]--- [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Here struct mem_zone_bm_rtree's start_pfn has been returned instead of struct rtree_node's addr as the node/zone pointers are corrupt after we walked off the end of the lists during mark_unsafe_pages(). This behaviour was exposed by commit 6dbecfd345a6 ("PM / hibernate: Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking off the end of the memory bitmap. Fixes: 3a20cb177961 (PM / Hibernate: Implement position keeping in radix tree) Signed-off-by: James Morse [ rjw: Subject ] Signed-off-by: Rafael J. Wysocki --- kernel/power/snapshot.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index d90df926b59f..2ba18691fe76 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -835,9 +835,9 @@ static bool memory_bm_pfn_present(struct memory_bitmap *bm, unsigned long pfn) */ static bool rtree_next_node(struct memory_bitmap *bm) { - bm->cur.node = list_entry(bm->cur.node->list.next, - struct rtree_node, list); - if (&bm->cur.node->list != &bm->cur.zone->leaves) { + if (!list_is_last(&bm->cur.node->list, &bm->cur.zone->leaves)) { + bm->cur.node = list_entry(bm->cur.node->list.next, + struct rtree_node, list); bm->cur.node_pfn += BM_BITS_PER_BLOCK; bm->cur.node_bit = 0; touch_softlockup_watchdog(); @@ -845,9 +845,9 @@ static bool rtree_next_node(struct memory_bitmap *bm) } /* No more nodes, goto next zone */ - bm->cur.zone = list_entry(bm->cur.zone->list.next, + if (!list_is_last(&bm->cur.zone->list, &bm->zones)) { + bm->cur.zone = list_entry(bm->cur.zone->list.next, struct mem_zone_bm_rtree, list); - if (&bm->cur.zone->list != &bm->zones) { bm->cur.node = list_entry(bm->cur.zone->leaves.next, struct rtree_node, list); bm->cur.node_pfn = 0;