Tuesday, September 27, 2016

LINUX SUSPEND / RESUME DEBUGGING TECHNIQUES

                       

initcall_debug : Adding the initcall_debug boot option to the kernel cmdline will trace initcalls and the driver pm callbacks during boot, suspend, and resume.

no_console_suspend : Adding the no_console_suspend boot option to the kernel cmdline disables suspending of consoles during suspend/hibernate.

ignore_loglevel :  Adding the ignore_loglevel boot option to the kernel cmdline prints all kernel messages to the console no matter what the current loglevel is, which is useful for debugging.

Serial console : To enable serial console, add console=ttyS0,115200 and no_console_suspend to the kernel cmdline.

Refer http://ayyappa-ch.blogspot.in/2015/07/serial-console-logging.html

Dynamic debug : Dynamic debug is designed to allow you to dynamically enable/disable kernel code to obtain additional kernel information. Currently, if  CONFIG_DYNAMIC_DEBUG is set, then all pr_debug()/dev_debug() calls can be dynamically enabled per-callsite.

Refer https://lwn.net/Articles/434856/

pm_async, pm_test:
Refer https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/hibernate-issues

enable PM_DEBUG, and PM_TRACE:
use a script like this:

#!/bin/sh
sync
echo 1 > /sys/power/pm_trace
echo mem > /sys/power/state ( or use suspend option from GUI)

   to suspend

if it doesn't come back up (which is usually the problem), reboot by holding the power button down, and look at the dmesg output for things like

Magic number: 4:156:725
hash matches drivers/base/power/resume.c:28
hash matches device 0000:01:00.0

which means that the last trace event was just before trying to resume  device 0000:01:00.0. Then   figure out what driver is controlling that device (lspci and /sys/devices/pci* is your friend), and see if you can fix it, disable it, or trace into its resume function.

If no device matches the hash (or any matches appear to be false positives), the culprit may be a device from a loadable kernel module that is not loaded until after the hash is checked. You can check the hash against the current devices again after more modules are loaded using sysfs:

cat /sys/power/pm_trace_dev_match
echo 1 > /sys/power/pm_trace

One of my issue pm_trace_dev_match shows acpi. It means issue exist in BIOS.

 Refer https://www.kernel.org/doc/Documentation/power/s2ram.txt


analyze_suspend : The analyze_syspend tool provides the capability for system developers to visualize the activity between suspend and resume, allowing them to identify inefficiencies and bottlenecks. For example, you can use following command to start:

./analyze_suspend.py -rtcwake 30 -f -m mem
And 30 seconds later the system resumes automatically and generates 3 files in the ./suspend-yymmddyy-hhmmss directory:

mem_dmesg.txt  mem_ftrace.txt  mem.html

You can first open the mem.html file with a browser, and then dig into mem_ftrace.txt for data details. You can get the analyze_suspend tool via git:

git clone https://github.com/01org/suspendresume.git

For more details, go to the homepage: https://01.org/suspendresume.

Test result: mem_dmesg.txt mem_ftrace.txt mem.html


Log Files: https://drive.google.com/drive/folders/0B_UViXaGblZQcHA1Mk1UUzB0V1E

Suspend/Resume Flow:





References:
1) https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/hibernate-issues
2) https://www.kernel.org/doc/Documentation/power/s2ram.txt
3) https://lwn.net/Articles/434856/
4) https://github.com/01org/suspendresume
5) https://wiki.ubuntu.com/DebuggingKernelSuspend