Date   

SWAT Rotation

Alexandre Belloni
 

Hello Leonardo,

You are the next one on the SWAT Team members list
(https://wiki.yoctoproject.org/wiki/Yocto_Build_Failure_Swat_Team#Members)
and SWAT duty will rotate to you at EOD 2021-07-02.

Please reply to let me know whether you will be able to work on this
task.

Thanks!

--
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Failed qemu bugs - notes for SWAT

Richard Purdie
 

I have some good news. Now that we have merged the rcu stall crash fix:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?id=1f802cdf857475eefb9f3ae5032a6f1ce3e01d97

I think that the cause of some of our qemu hangs under higher IO load
should be fixed. As such, I've gone through and closed out a number of qemu
AB-INT bugs that look like they're related to this.

If we do see anything that looks like some of the closed issues recur, I'd
suggest opening new bugs as the older ones are becoming unwieldy and it is
hard to tell what may have been the rcu stall and what wasn't.

For any new testimage/qemu failures, please do try and capture and include
the qemu boot log from WORKDIR/testimage/qemu_*.log if it isn't printed on the
console log. If you don't have access, ping someone who does. It is looking to
be of key importance that we get those logs. We have improved some of the code
paths but we're not quite there yet for some failure paths.

I also closed out the ltp bugs since we do have a fix there at least for the 
most common hang. I suspect there is still an arm hang lurking somewhere so
if we see that, it should have its own new bug (ping Alex or me too please as 
we'd like to poke at the hung build).

Cheers,

Richard


Re: Autobuilder "rcu stall" issue summary

Richard Purdie
 

On Fri, 2021-06-25 at 16:34 +0100, Richard Purdie via lists.yoctoproject.org wrote:
On Thu, 2021-06-24 at 17:31 +0100, Richard Purdie via lists.yoctoproject.org wrote:
I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!
I formed a new plan and set the stall detector to 3 seconds instead of 
21 seconds with this hack in master-next:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=71b6bc157d39e09e8f76a15a049168eb72bbd3d9

i.e. adding CONFIG_RCU_CPU_STALL_TIMEOUT=3

I ran two builds on the autobuilder, one of which was a heavy rebuild build
so should trigger a lot of IO. I've been looking over the build results and
we have a number of builds which failed or are in the process of doing so,
I've dumped 4 below (three arm, one x86-64).

My thinking is that if we reduced the stall detector limit and only saw 
occasional hangs at the same rate as before, it was likely qemu. If we saw
an increase in hangs (which we definitely have even on the preliminary 
incomplete results), it is more likely something in the kernel RCU stall
code is taking out the system. The BUG: in occasional logs also hints at
the latter.

My own next step is probably to hack a "stall report" trigger into sysrq
and try and in some of my own images locally, see if I can break this at
will.
I realised there was a potential lock issue in the rcu stall code in the
kernel. In looking into it with Paul we found that upstream have a fix
queued for the same issue:

https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?id=406a2f008f2e

This would account for the BUG: entries in the previous email. I plan to
test this with the short rcu stall timing, see if we still see the hangs.
Adding this fix to linux-yocto would seem helpful just to rule it out
if nothing else but this fix does look like the kind of thing that would
cause the issues we've been tracking.

Cheers,

Richard


Re: Autobuilder "rcu stall" issue summary

Richard Purdie
 

On Thu, 2021-06-24 at 17:31 +0100, Richard Purdie via lists.yoctoproject.org wrote:
I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!
I formed a new plan and set the stall detector to 3 seconds instead of 
21 seconds with this hack in master-next:

http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=71b6bc157d39e09e8f76a15a049168eb72bbd3d9

i.e. adding CONFIG_RCU_CPU_STALL_TIMEOUT=3

I ran two builds on the autobuilder, one of which was a heavy rebuild build
so should trigger a lot of IO. I've been looking over the build results and
we have a number of builds which failed or are in the process of doing so,
I've dumped 4 below (three arm, one x86-64).

My thinking is that if we reduced the stall detector limit and only saw 
occasional hangs at the same rate as before, it was likely qemu. If we saw
an increase in hangs (which we definitely have even on the preliminary 
incomplete results), it is more likely something in the kernel RCU stall
code is taking out the system. The BUG: in occasional logs also hints at
the latter.

My own next step is probably to hack a "stall report" trigger into sysrq
and try and in some of my own images locally, see if I can break this at
will.

Cheers,

Richard


qemuarm:

https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/3608

/home/pokybuild/yocto-worker/qemuarm/build/build/tmp/work/qemuarm-poky-linux-gnueabi/core-image-sato-sdk/1.0-
r0/testimage/qemu_boot_log.20210625121601

** (matchbox-desktop:715): [1;33mWARNING[0m **: [34m12:17:35.454[0m: Error loading icon: Icon 'terminal' not
present in theme Sato
[ 2173.684752] hellomod: loading out-of-tree module taints kernel.
[ 2173.726821] Hello world!
[ 2178.269807] Cleaning up hellomod.
[ 2426.747471] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2426.753269] rcu: 2-....: (32 ticks this GP) idle=21e/1/0x40000000 softirq=95871/95871 fqs=19
[ 2426.823824] (detected by 0, t=302 jiffies, g=214353, q=107)


qemuarm-oecore:

https://autobuilder.yoctoproject.org/typhoon/#/builders/47/builds/3601


OpenEmbedded nodistro.0 qemuarm /dev/hvc0

/home/pokybuild/yocto-worker/qemuarm-oecore/build/build/tmp-glibc/work/qemuarm-oe-linux-gnueabi/core-image-
sato/1.0-r0/testimage/qemu_boot_log.20210625114847

qemuarm login: [ 53.248661] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 53.250587] rcu: 0-...!: (1 ticks this GP) idle=4aa/0/0x1 softirq=1385/1385 fqs=0
[ 53.251571] (t=1019 jiffies g=1293 q=4)
[ 53.252192] rcu: rcu_preempt kthread starved for 1019 jiffies! g1293 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 -
cpu=1
[ 53.253026] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 53.253697] rcu: RCU grace-period kthread stack dump:
[ 53.254348] task:rcu_preempt state:R running task stack: 0 pid: 13 ppid: 2
flags:0x00000000
[ 53.258129] [<c0a73808>] (__schedule) from [<c0a73a74>] (schedule+0xc8/0x118)
[ 53.258925] [<c0a73a74>] (schedule) from [<c0a75ef8>] (schedule_timeout+0xcc/0xfc)
[ 53.259247] [<c0a75ef8>] (schedule_timeout) from [<c028b8b0>] (rcu_gp_kthread+0x2ac/0x92c)
[ 53.259619] [<c028b8b0>] (rcu_gp_kthread) from [<c024f314>] (kthread+0x158/0x164)
[ 53.259865] [<c024f314>] (kthread) from [<c0200138>] (ret_from_fork+0x14/0x3c)
[ 53.260207] Exception stack(0xc1539fb0 to 0xc1539ff8)
[ 53.260691] 9fa0: 00000000 00000000 00000000 00000000
[ 53.260973] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 53.261236] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 53.261718] NMI backtrace for cpu 0
[ 53.262040] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.43-yocto-standard #1
[ 53.262324] Hardware name: Generic DT based system
[ 53.262584] [<c020fe28>] (unwind_backtrace) from [<c020ab54>] (show_stack+0x18/0x1c)
[ 53.262847] [<c020ab54>] (show_stack) from [<c0a6e304>] (dump_stack+0x88/0xa4)
[ 53.263081] [<c0a6e304>] (dump_stack) from [<c06dc0e8>] (nmi_cpu_backtrace+0xb8/0xcc)
[ 53.263340] [<c06dc0e8>] (nmi_cpu_backtrace) from [<c06dc1b8>] (nmi_trigger_cpumask_backtrace+0xbc/0x164)
[ 53.263615] [<c06dc1b8>] (nmi_trigger_cpumask_backtrace) from [<c0a69248>] (rcu_dump_cpu_stacks+0xe8/0x114)
[ 53.264190] [<c0a69248>] (rcu_dump_cpu_stacks) from [<c028d9d8>] (rcu_sched_clock_irq+0x35c/0xa78)
[ 53.264463] [<c028d9d8>] (rcu_sched_clock_irq) from [<c0296f8c>] (update_process_times+0x5c/0x80)
[ 53.264757] [<c0296f8c>] (update_process_times) from [<c02a8220>] (tick_sched_timer+0x17c/0x1fc)
[ 53.265047] [<c02a8220>] (tick_sched_timer) from [<c0297fc8>] (__hrtimer_run_queues+0x19c/0x2c4)
[ 53.265317] [<c0297fc8>] (__hrtimer_run_queues) from [<c0298528>] (hrtimer_interrupt+0xc8/0x204)
[ 53.265559] [<c0298528>] (hrtimer_interrupt) from [<c08a112c>] (arch_timer_handler_virt+0x30/0x38)
[ 53.265802] [<c08a112c>] (arch_timer_handler_virt) from [<c027f844>] (handle_percpu_devid_irq+0x60/0x118)
[ 53.266079] [<c027f844>] (handle_percpu_devid_irq) from [<c0279bf0>] (generic_handle_irq+0x30/0x40)
[ 53.266320] [<c0279bf0>] (generic_handle_irq) from [<c027a218>] (__handle_domain_irq+0x98/0xa8)
[ 53.266554] [<c027a218>] (__handle_domain_irq) from [<c06eb518>] (gic_handle_irq+0x74/0x84)
[ 53.266778] [<c06eb518>] (gic_handle_irq) from [<c0201278>] (__irq_svc+0x58/0x8c)
[ 53.266998] Exception stack(0xc1201f20 to 0xc1201f68)
[ 53.267171] 1f20: 0000a4a8 cfd6de84 00000000 c0218860 c1200000 00000000 c1200000 c1201f80
[ 53.267390] 1f40: c1204f94 00000001 c1204fd4 00000000 c1684040 c1201f70 c0207930 c0207920
[ 53.267611] 1f60: 60010113 ffffffff
[ 53.267744] [<c0201278>] (__irq_svc) from [<c0207920>] (arch_cpu_idle+0x24/0x40)
[ 53.267946] [<c0207920>] (arch_cpu_idle) from [<c0a76898>] (default_idle_call+0x40/0x5c)
[ 53.268167] [<c0a76898>] (default_idle_call) from [<c025e110>] (do_idle+0x108/0x254)
[ 53.268384] [<c025e110>] (do_idle) from [<c025e534>] (cpu_startup_entry+0x20/0x24)
[ 53.268624] [<c025e534>] (cpu_startup_entry) from [<c1000e80>] (start_kernel+0x4a0/0x57c)
[ 73.311040] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 73.313047] (detected by 2, t=428 jiffies, g=1965, q=113)
[ 73.313886] rcu: All QSes seen, last rcu_preempt kthread activity 425 (-22941--23366),
jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 73.315413] rcu: rcu_preempt kthread starved for 425 jiffies! g1965 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 -
cpu=2
[ 73.316823] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 73.317775] rcu: RCU grace-period kthread stack dump:
[ 73.318304] task:rcu_preempt state:R stack: 0 pid: 13 ppid: 2 flags:0x00000000
[ 73.318759] [<c0a73808>] (__schedule) from [<c0a73a74>] (schedule+0xc8/0x118)
[ 73.319123] [<c0a73a74>] (schedule) from [<c0a75ef8>] (schedule_timeout+0xcc/0xfc)
[ 73.319500] [<c0a75ef8>] (schedule_timeout) from [<c028b8b0>] (rcu_gp_kthread+0x2ac/0x92c)
[ 73.319886] [<c028b8b0>] (rcu_gp_kthread) from [<c024f314>] (kthread+0x158/0x164)
[ 73.320201] [<c024f314>] (kthread) from [<c0200138>] (ret_from_fork+0x14/0x3c)
[ 73.320776] Exception stack(0xc1539fb0 to 0xc1539ff8)
[ 73.321088] 9fa0: 00000000 00000000 00000000 00000000
[ 73.321403] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 73.321698] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 73.322376] BUG: scheduling while atomic: swapper/2/0/0x00000002
[ 73.322697] Modules linked in:
[ 73.323092] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.43-yocto-standard #1
[ 73.323383] Hardware name: Generic DT based system
[ 73.323621] [<c020fe28>] (unwind_backtrace) from [<c020ab54>] (show_stack+0x18/0x1c)
[ 73.323908] [<c020ab54>] (show_stack) from [<c0a6e304>] (dump_stack+0x88/0xa4)
[ 73.324184] [<c0a6e304>] (dump_stack) from [<c02575e4>] (__schedule_bug+0x68/0x8c)
[ 73.324454] [<c02575e4>] (__schedule_bug) from [<c0a73578>] (__schedule+0x58/0x48c)
[ 73.324741] [<c0a73578>] (__schedule) from [<c0a73e00>] (schedule_idle+0x68/0x80)
[ 73.325046] [<c0a73e00>] (schedule_idle) from [<c025e534>] (cpu_startup_entry+0x20/0x24)
[ 73.325576] [<c025e534>] (cpu_startup_entry) from [<40201c2c>] (0x40201c2c)

qemuarm64-ltp (on arm64 host ubuntu1804-arm-1)

https://autobuilder.yoctoproject.org/typhoon/#/builders/96/builds/1888

/home/pokybuild/yocto-worker/qemuarm64-ltp/build/build/tmp/work/qemuarm64-poky-linux/core-image-sato/1.0-
r0/testimage/qemu_boot_log.20210625121516

qemuarm64 login: [ 21.445581] Process accounting resumed
[ 22.705496] Process accounting resumed
[ 331.583598] capability: warning: `capget01' uses 32-bit capabilities (legacy support in use)
[ 331.585764] capability: warning: `capget01' uses deprecated v2 capabilities in a way that may be insecure
[ 345.996555] Clock: inserting leap second 23:59:60 UTC
[ 713.924549] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 714.133632] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P9691/1:b..l
[ 714.255745] (detected by 1, t=822 jiffies, g=195937, q=3443)
[ 714.256974] task:msgstress03 state:R running task stack: 0 pid: 9589 ppid: 9543
flags:0x00000000
[ 714.258971] Call trace:
[ 714.259520] __switch_to+0x114/0x19c
[ 714.260274] __schedule+0x274/0x610
[ 714.261008] preempt_schedule_common+0x28/0x50
[ 714.261928] preempt_schedule+0x30/0x40
[ 714.262711] _raw_spin_unlock+0x38/0x50
[ 714.263507] do_msgrcv+0x284/0x4f0
[ 714.264230] __arm64_sys_msgrcv+0x38/0x44
[ 714.265042] el0_svc_common.constprop.0+0x9c/0x1c0
[ 714.266014] do_el0_svc+0x7c/0xa0
[ 714.266718] el0_svc+0x20/0x30
[ 714.267391] el0_sync_handler+0xa4/0x130
[ 714.268193] el0_sync+0x180/0x1c0
[ 714.268900] task:msgstress03 state:R running task stack: 0 pid: 9691 ppid: 32178
flags:0x00000000
[ 714.270830] Call trace:
[ 714.271369] __switch_to+0x114/0x19c
[ 714.313241] __schedule+0x274/0x610
[ 714.333040] preempt_schedule_irq+0x68/0x100
[ 714.333944] arm64_preempt_schedule_irq+0x2c/0x40
[ 714.334910] el1_irq+0xe4/0x180
[ 714.335606] wake_up_q+0xa8/0xd0
[ 714.336308] do_msgsnd+0x3f8/0x4e0
[ 714.337029] ksys_msgsnd+0x90/0xb0
[ 714.337750] __arm64_sys_msgsnd+0x2c/0x40
[ 714.338564] el0_svc_common.constprop.0+0x9c/0x1c0
[ 714.339552] do_el0_svc+0x7c/0xa0
[ 714.340266] el0_svc+0x20/0x30
[ 714.340936] el0_sync_handler+0xa4/0x130
[ 714.341735] el0_sync+0x180/0x1c0
[ 720.932547] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 721.016226] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3):
[ 721.072646] (detected by 0, t=752 jiffies, g=195941, q=7890)
[ 721.073841] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295072518-4295072517),
jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 750.072043] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 750.073887] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P1738/1:b..l
[ 750.116802] (detected by 0, t=773 jiffies, g=196493, q=15176)
[ 750.118095] task:msgstress04 state:R running task stack: 0 pid: 1733 ppid: 1482
flags:0x00000000
[ 750.120144] Call trace:
[ 750.120730] __switch_to+0x114/0x19c
[ 750.144388] __schedule+0x274/0x610
[ 750.145215] preempt_schedule_common+0x28/0x50
[ 750.146187] preempt_schedule+0x30/0x40
[ 750.147020] _raw_spin_unlock+0x38/0x50
[ 750.147853] do_msgsnd+0x3ec/0x4e0
[ 750.148610] ksys_msgsnd+0x90/0xb0
[ 750.149365] __arm64_sys_msgsnd+0x2c/0x40
[ 750.150219] el0_svc_common.constprop.0+0x9c/0x1c0
[ 750.151244] do_el0_svc+0x7c/0xa0
[ 750.151995] el0_svc+0x20/0x30
[ 750.152698] el0_sync_handler+0xa4/0x130
[ 750.153532] el0_sync+0x180/0x1c0
[ 750.154273] task:msgstress04 state:R running task stack: 0 pid: 1738 ppid: 1482
flags:0x00000000
[ 750.156271] Call trace:
[ 750.156842] __switch_to+0x114/0x19c
[ 750.157626] __schedule+0x274/0x610
[ 750.158395] preempt_schedule_common+0x28/0x50
[ 750.159363] preempt_schedule+0x30/0x40
[ 750.160196] _raw_spin_unlock+0x38/0x50
[ 750.161020] do_msgrcv+0x284/0x4f0
[ 750.161774] __arm64_sys_msgrcv+0x38/0x44
[ 750.162620] el0_svc_common.constprop.0+0x9c/0x1c0
[ 750.163643] do_el0_svc+0x7c/0xa0
[ 750.164388] el0_svc+0x20/0x30
[ 750.165086] el0_sync_handler+0xa4/0x130
[ 750.165918] el0_sync+0x180/0x1c0
[ 763.280549] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 763.282336] (detected by 0, t=752 jiffies, g=196741, q=7494)
[ 763.283694] rcu: All QSes seen, last rcu_preempt kthread activity 703 (4295083105-4295082402),
jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 763.286260] rcu: rcu_preempt kthread starved for 703 jiffies! g196741 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 -
cpu=3
[ 763.288363] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 763.290242] rcu: RCU grace-period kthread stack dump:
[ 763.291334] task:rcu_preempt state:R running task stack: 0 pid: 13 ppid: 2
flags:0x00000028
[ 763.293387] Call trace:
[ 763.293976] __switch_to+0x114/0x19c
[ 763.294779] __schedule+0x274/0x610
[ 763.295582] schedule+0x7c/0x110
[ 763.296343] schedule_timeout+0xb0/0x1c0
[ 763.297216] rcu_gp_kthread+0x4e0/0xa90
[ 763.298073] kthread+0x168/0x16c
[ 763.298831] ret_from_fork+0x10/0x3c
[ 763.299741] BUG: scheduling while atomic: msgstress04/18860/0x00000002
[ 763.301093] Modules linked in:
[ 763.301808] CPU: 0 PID: 18860 Comm: msgstress04 Not tainted 5.10.43-yocto-standard #1
[ 763.303502] Hardware name: linux,dummy-virt (DT)
[ 763.304491] Call trace:
[ 763.305056] dump_backtrace+0x0/0x1a0
[ 763.305852] show_stack+0x24/0x30
[ 763.306595] dump_stack+0xd4/0x110
[ 763.307357] __schedule_bug+0x68/0x80
[ 763.308160] __schedule+0x454/0x610
[ 763.308925] schedule+0x7c/0x110
[ 763.309655] do_msgsnd+0x294/0x4e0
[ 763.310406] ksys_msgsnd+0x90/0xb0
[ 763.311167] __arm64_sys_msgsnd+0x2c/0x40
[ 763.312024] el0_svc_common.constprop.0+0x9c/0x1c0
[ 763.313039] do_el0_svc+0x7c/0xa0
[ 763.313780] el0_svc+0x20/0x30
[ 763.314479] el0_sync_handler+0xa4/0x130
[ 763.315326] el0_sync+0x180/0x1c0

multilib core-image-sato ubuntu1804-ty-1

https://autobuilder.yoctoproject.org/typhoon/#/builders/44/builds/3622

/home/pokybuild/yocto-worker/multilib/build/build/tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-
r0/testimage/qemu_boot_log.20210625121139

libGL error: MESA-LOADER: failed to open bochs-drm: /usr/lib/dri/bochs-drm_dri.so: cannot open shared object
file: No such file or directory (search paths /usr/lib/dri)
libGL error: failed to load driver: bochs-drm
libGL error: pci id for fd 4: 1234:1111, driver (null)
libGL error: MESA-LOADER: failed to open bochs-drm: /usr/lib64/dri/bochs-drm_dri.so: cannot open shared object
file: No such file or directory (search paths /usr/lib64/dri)
libGL error: failed to load driver: bochs-drm
[settings daemon] Forking. run with -n to prevent fork
dbus-daemon[652]: Activating service name='org.a11y.atspi.Registry' requested by ':1.1' (uid=0 pid=669
comm="matchbox-panel --start-applets showdesktop,windows")
dbus-daemon[652]: Successfully activated service 'org.a11y.atspi.Registry'
SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry

** (matchbox-desktop:668): [1;33mWARNING[0m **: [34m12:12:24.699[0m: Error loading icon: Icon 'terminal' not
present in theme Sato

** (matchbox-desktop:668): [1;33mWARNING[0m **: [34m12:12:24.713[0m: Error loading icon: Icon 'net-48d24' not
present in theme Sato
[ 21.640654] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 21.642263] rcu: 3-...0: (0 ticks this GP) idle=18a/1/0x4000000000000000 softirq=1618/1618 fqs=425
[ 21.644170] (detected by 1, t=3155 jiffies, g=2653, q=16)
[ 30.951654] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:


Re: Autobuilder "rcu stall" issue summary

Denys Dmytriyenko
 

On Thu, Jun 24, 2021 at 01:57:43PM -0400, Bruce Ashfield wrote:
On Thu, Jun 24, 2021 at 12:31 PM Richard Purdie
<richard.purdie@...> wrote:

We have info all over the place relating to our "rcu stall" issue on the
autobuilder. This is an attempt to document what we know and what we don't.

We see builds fail periodically where there is an "rcu stall" mentioned
by the kernel and the image "locks up". An easier to read example is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/56/builds/1488/steps/14/logs/stdio

A core-image-minimal in testimage sitting using 394% CPU having run for 97
minutes of cpu time. A minimal image doesn't execute much at all and should
be over in a minute max.

An example we were able to "catch" and debug a bit at the expense of breaking
the output is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/3576/steps/12/logs/stdio

where a musl qemux86-64 core-image-sato-sdk stopped mid "configure" on target
during testimage.

We were able to look at the kernel log, the tail end of which showed:

[settings daemon] Forking. run with -n to prevent fork
[ 102.082744] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 102.093213] rcu: 1-...!: (1 GPs behind) idle=3f6/0/0x1 softirq=1374/1375 fqs=0
[ 102.096448] rcu: 3-...!: (12 ticks this GP) idle=e34/0/0x0 softirq=1658/1658 fqs=0
[ 102.099326] (detected by 1, t=21002 jiffies, g=1937, q=278)

The kernel was:

[ 0.000000] Linux version 5.10.43-yocto-standard (oe-user@oe-host) (x86_64-poky-linux-musl-gcc (GCC) 11.1.1 20210523, GNU ld (GNU Binutils) 2.36.1.20210209) #1 SMP PREEMPT Wed Jun 16 1
6:00:47 UTC 2021
[ 0.000000] Command line: root=/dev/vda rw mem=512M ip=192.168.7.4::192.168.7.3:255.255.255.0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_exp
edited=1 printk.time=1

Full log: https://www.rpsys.net/wp/rp/qemuboot-autobuild.log

In the above state, qemu was seem with 303% CPU usage and 403% cpu usage. I do
have strace output from the hung qemu: https://www.rpsys.net/wp/rp/qemu-strace.log

It was interesting to note that pinging the image did work and you can see that in
the strace log so something was still alive. You couldn't ssh in though and existing
ssh connections failed with timeouts.

We do also have the VM core dump for this which Alexandre is looking into
(from qmp with:
{"execute": "qmp_capabilities"}
{"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}}
)

I couldn't see any extra kernel log messages in the vmcore file.

We've seen this before in core-image-sato, core-image-sato-sdk and also with
qemuarm, the latter seeming to imply that KVM isn't the issue and that it
isn't arch specific, implying it is qemu related. The load itself doesn't appear
to matter, above it was a configure script but we've seen it on images without
toolchains in a variety of states/loads. We've seen on a variety of host workers,
ubuntu1804 being most common but it is a more common worker OS too.

I suspect that this is the same issue that causes images to lock up in the firmware
boot stage as in: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018
There may also be other related bugs, without looking at the kernel traces, it is hard
to say for sure https://bugzilla.yoctoproject.org/show_bug.cgi?id=14273 lists a load more.

We've not been able to replicate the issue at will. I have tried various stress-ng --all
loads locally with no success at replicating. I did compare the boot output from qemu
on my local system with the above kernel boot and it matches apart from timings and
kvm-clock msr value.

The suspicion is that some high IO load event somehow triggers qemu to fail. I
also suspect that the "rcu stall" is just a symptom of a wider issue. The stall is
suspicious as it looks incomplete.

I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!
Richard,

As mentioned earlier, we do have the kernel debug fragments available (as well
as a 'developer' kernel type). We can't globally enable the developer
ktype, since
we haven't defined BSP entry points for all of our reference boads (but I can do
that as a TODO item).

But we can get the same options enabled by adding:

KERNEL_FEATURES_append = "ktypes/developer/developer.cfg
features/debug/debug-kernel.scc"
Can this be keyed off of DEBUG_BUILD variable?


They turn on the prime suspects in the kernel to get more information. If there
are options we are missing, let me know and I'll add them to those fragments
so we can rely on them in the future.
--
Regards,
Denys Dmytriyenko <denis@...>
PGP: 0x420902729A92C964 - https://denix.org/0x420902729A92C964
Fingerprint: 25FC E4A5 8A72 2F69 1186 6D76 4209 0272 9A92 C964


Re: Autobuilder "rcu stall" issue summary

Bruce Ashfield <bruce.ashfield@...>
 

On Thu, Jun 24, 2021 at 12:31 PM Richard Purdie
<richard.purdie@...> wrote:

We have info all over the place relating to our "rcu stall" issue on the
autobuilder. This is an attempt to document what we know and what we don't.

We see builds fail periodically where there is an "rcu stall" mentioned
by the kernel and the image "locks up". An easier to read example is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/56/builds/1488/steps/14/logs/stdio

A core-image-minimal in testimage sitting using 394% CPU having run for 97
minutes of cpu time. A minimal image doesn't execute much at all and should
be over in a minute max.

An example we were able to "catch" and debug a bit at the expense of breaking
the output is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/3576/steps/12/logs/stdio

where a musl qemux86-64 core-image-sato-sdk stopped mid "configure" on target
during testimage.

We were able to look at the kernel log, the tail end of which showed:

[settings daemon] Forking. run with -n to prevent fork
[ 102.082744] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 102.093213] rcu: 1-...!: (1 GPs behind) idle=3f6/0/0x1 softirq=1374/1375 fqs=0
[ 102.096448] rcu: 3-...!: (12 ticks this GP) idle=e34/0/0x0 softirq=1658/1658 fqs=0
[ 102.099326] (detected by 1, t=21002 jiffies, g=1937, q=278)

The kernel was:

[ 0.000000] Linux version 5.10.43-yocto-standard (oe-user@oe-host) (x86_64-poky-linux-musl-gcc (GCC) 11.1.1 20210523, GNU ld (GNU Binutils) 2.36.1.20210209) #1 SMP PREEMPT Wed Jun 16 1
6:00:47 UTC 2021
[ 0.000000] Command line: root=/dev/vda rw mem=512M ip=192.168.7.4::192.168.7.3:255.255.255.0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_exp
edited=1 printk.time=1

Full log: https://www.rpsys.net/wp/rp/qemuboot-autobuild.log

In the above state, qemu was seem with 303% CPU usage and 403% cpu usage. I do
have strace output from the hung qemu: https://www.rpsys.net/wp/rp/qemu-strace.log

It was interesting to note that pinging the image did work and you can see that in
the strace log so something was still alive. You couldn't ssh in though and existing
ssh connections failed with timeouts.

We do also have the VM core dump for this which Alexandre is looking into
(from qmp with:
{"execute": "qmp_capabilities"}
{"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}}
)

I couldn't see any extra kernel log messages in the vmcore file.

We've seen this before in core-image-sato, core-image-sato-sdk and also with
qemuarm, the latter seeming to imply that KVM isn't the issue and that it
isn't arch specific, implying it is qemu related. The load itself doesn't appear
to matter, above it was a configure script but we've seen it on images without
toolchains in a variety of states/loads. We've seen on a variety of host workers,
ubuntu1804 being most common but it is a more common worker OS too.

I suspect that this is the same issue that causes images to lock up in the firmware
boot stage as in: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018
There may also be other related bugs, without looking at the kernel traces, it is hard
to say for sure https://bugzilla.yoctoproject.org/show_bug.cgi?id=14273 lists a load more.

We've not been able to replicate the issue at will. I have tried various stress-ng --all
loads locally with no success at replicating. I did compare the boot output from qemu
on my local system with the above kernel boot and it matches apart from timings and
kvm-clock msr value.

The suspicion is that some high IO load event somehow triggers qemu to fail. I
also suspect that the "rcu stall" is just a symptom of a wider issue. The stall is
suspicious as it looks incomplete.

I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!
Richard,

As mentioned earlier, we do have the kernel debug fragments available (as well
as a 'developer' kernel type). We can't globally enable the developer
ktype, since
we haven't defined BSP entry points for all of our reference boads (but I can do
that as a TODO item).

But we can get the same options enabled by adding:

KERNEL_FEATURES_append = "ktypes/developer/developer.cfg
features/debug/debug-kernel.scc"

They turn on the prime suspects in the kernel to get more information. If there
are options we are missing, let me know and I'll add them to those fragments
so we can rely on them in the future.

Bruce


Cheers,

Richard



--
- Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end
- "Use the force Harry" - Gandalf, Star Trek II


Autobuilder "rcu stall" issue summary

Richard Purdie
 

We have info all over the place relating to our "rcu stall" issue on the 
autobuilder. This is an attempt to document what we know and what we don't.

We see builds fail periodically where there is an "rcu stall" mentioned
by the kernel and the image "locks up". An easier to read example is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/56/builds/1488/steps/14/logs/stdio

A core-image-minimal in testimage sitting using 394% CPU having run for 97 
minutes of cpu time. A minimal image doesn't execute much at all and should 
be over in a minute max.

An example we were able to "catch" and debug a bit at the expense of breaking
the output is:

https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/3576/steps/12/logs/stdio

where a musl qemux86-64 core-image-sato-sdk stopped mid "configure" on target
during testimage.

We were able to look at the kernel log, the tail end of which showed:

[settings daemon] Forking. run with -n to prevent fork
[ 102.082744] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 102.093213] rcu: 1-...!: (1 GPs behind) idle=3f6/0/0x1 softirq=1374/1375 fqs=0
[ 102.096448] rcu: 3-...!: (12 ticks this GP) idle=e34/0/0x0 softirq=1658/1658 fqs=0
[ 102.099326] (detected by 1, t=21002 jiffies, g=1937, q=278)

The kernel was:

[ 0.000000] Linux version 5.10.43-yocto-standard (oe-user@oe-host) (x86_64-poky-linux-musl-gcc (GCC) 11.1.1 20210523, GNU ld (GNU Binutils) 2.36.1.20210209) #1 SMP PREEMPT Wed Jun 16 1
6:00:47 UTC 2021
[ 0.000000] Command line: root=/dev/vda rw mem=512M ip=192.168.7.4::192.168.7.3:255.255.255.0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_exp
edited=1 printk.time=1

Full log: https://www.rpsys.net/wp/rp/qemuboot-autobuild.log

In the above state, qemu was seem with 303% CPU usage and 403% cpu usage. I do
have strace output from the hung qemu: https://www.rpsys.net/wp/rp/qemu-strace.log

It was interesting to note that pinging the image did work and you can see that in
the strace log so something was still alive. You couldn't ssh in though and existing
ssh connections failed with timeouts.

We do also have the VM core dump for this which Alexandre is looking into 
(from qmp with:
{"execute": "qmp_capabilities"}
{"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}}
)

I couldn't see any extra kernel log messages in the vmcore file.

We've seen this before in core-image-sato, core-image-sato-sdk and also with
qemuarm, the latter seeming to imply that KVM isn't the issue and that it
isn't arch specific, implying it is qemu related. The load itself doesn't appear
to matter, above it was a configure script but we've seen it on images without
toolchains in a variety of states/loads. We've seen on a variety of host workers,
ubuntu1804 being most common but it is a more common worker OS too.

I suspect that this is the same issue that causes images to lock up in the firmware 
boot stage as in: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018
There may also be other related bugs, without looking at the kernel traces, it is hard
to say for sure https://bugzilla.yoctoproject.org/show_bug.cgi?id=14273 lists a load more.

We've not been able to replicate the issue at will. I have tried various stress-ng --all
loads locally with no success at replicating. I did compare the boot output from qemu
on my local system with the above kernel boot and it matches apart from timings and 
kvm-clock msr value.

The suspicion is that some high IO load event somehow triggers qemu to fail. I
also suspect that the "rcu stall" is just a symptom of a wider issue. The stall is
suspicious as it looks incomplete.

I'll stop here but will follow up on the mail if I remember more info. Anyone else
feel free to add and if anyone has any insight into what is happening or how to better
debug, I'm very open to it!

Cheers,

Richard


Re: ltp failures on autobuilder

Randy MacLeod
 

On 2021-06-16 10:17 a.m., Richard Purdie wrote:
On Wed, 2021-06-16 at 08:56 -0400, Paul Gortmaker wrote:
[Re: [swat] ltp failures on autobuilder] On 11/06/2021 (Fri 14:19) Richard Purdie wrote:

On Fri, 2021-06-11 at 12:36 +0100, Richard Purdie via lists.yoctoproject.org wrote:
as a .cfg to the kernel and that still reproduced the crash. However:

CONFIG_DEBUG_KERNEL=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_RCU_TRACE is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_CONSOLE_POLL is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_KGDB is not set
# CONFIG_KGDB_HONOUR_BLOCKLIST is not set
# CONFIG_KGDB_SERIAL_CONSOLE is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_KDB_KEYBOARD is not set
# CONFIG_DEBUG_MISC is not set
Isolated down to CONFIG_SCHED_DEBUG=y being the line which somehow "fixes"
the crash. I can enable all the above apart from that and we can reproduce
it.

Also, I changed gatesgarth to use qemu 5.2.0 copied in from hardknott and that
breaks it. Dropping the 27 CVE patches "fixes" it again. It is possible it
is one of the CVE fixes. Continuing to try and isolate.
For the mail archive trail, and for those not follwing the ongoing
research on IRC, we are hopeful that this fixes it.

https://lore.kernel.org/lkml/20210616125157.438837-1-paul.gortmaker@windriver.com/
Awesome work in tracking that down, much appreciated, thanks!

Curious what upstream will make of it now...

Dropped the wider list but as I said to Richard, I wanted to
confirm that this bug is really gone.

I ran 100 tests overnight and it seems to be dead Jim.


$ bitbake core-image-sato && \
for i in `seq 100`; do \
  echo "--- " $i " ---"; \
  timeout --kill-after=2m 10m bitbake core-image-sato -c testimage \
   && echo GOOD || echo BAD; \
done

$ ls -l tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/qemu_boot_log.202* | wc -l
100

$ grep -m1 BUG: tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/qemu_boot_log.2021* | wc -l
0

All the log files are there and roughly the same size
with similar but not idnetical contents due to differing timestamps and
non-deterministic ordering of some of the output.


../Randy



Cheers,

Richard
--
# Randy MacLeod
# Wind River Linux


Re: SWAT Rotation

Ross Burton <ross@...>
 

Yes, that's fine.

Ross

On Thu, 17 Jun 2021 at 16:25, Alexandre Belloni
<alexandre.belloni@...> wrote:

Hello Ross,

SWAT duty should rotate to you next week. Will you have some time to
work on that?

Thanks!

--
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


SWAT Rotation

Alexandre Belloni
 

Hello Ross,

SWAT duty should rotate to you next week. Will you have some time to
work on that?

Thanks!

--
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Swatbot failed, filed bug for oe-selftest issue

Richard Purdie
 

As a note for swat, I filed 14438 for:
https://autobuilder.yoctoproject.org/typhoon/#/builders/56/builds/1484
since I could see why it failed.

Not sure why swatbot has failed but the web interface isn't loading 
and we'll have to sort that :/

Cheers,

Richard


Re: ltp failures on autobuilder

Richard Purdie
 

On Wed, 2021-06-16 at 08:56 -0400, Paul Gortmaker wrote:
[Re: [swat] ltp failures on autobuilder] On 11/06/2021 (Fri 14:19) Richard Purdie wrote:

On Fri, 2021-06-11 at 12:36 +0100, Richard Purdie via lists.yoctoproject.org wrote:
as a .cfg to the kernel and that still reproduced the crash. However:

CONFIG_DEBUG_KERNEL=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_RCU_TRACE is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_CONSOLE_POLL is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_KGDB is not set
# CONFIG_KGDB_HONOUR_BLOCKLIST is not set
# CONFIG_KGDB_SERIAL_CONSOLE is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_KDB_KEYBOARD is not set
# CONFIG_DEBUG_MISC is not set
Isolated down to CONFIG_SCHED_DEBUG=y being the line which somehow "fixes"
the crash. I can enable all the above apart from that and we can reproduce
it.

Also, I changed gatesgarth to use qemu 5.2.0 copied in from hardknott and that
breaks it. Dropping the 27 CVE patches "fixes" it again. It is possible it
is one of the CVE fixes. Continuing to try and isolate.
For the mail archive trail, and for those not follwing the ongoing
research on IRC, we are hopeful that this fixes it.

https://lore.kernel.org/lkml/20210616125157.438837-1-paul.gortmaker@windriver.com/
Awesome work in tracking that down, much appreciated, thanks!

Curious what upstream will make of it now...

Cheers,

Richard


Re: ltp failures on autobuilder

Paul Gortmaker <paul.gortmaker@...>
 

[Re: [swat] ltp failures on autobuilder] On 11/06/2021 (Fri 14:19) Richard Purdie wrote:

On Fri, 2021-06-11 at 12:36 +0100, Richard Purdie via lists.yoctoproject.org wrote:
as a .cfg to the kernel and that still reproduced the crash. However:

CONFIG_DEBUG_KERNEL=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_RCU_TRACE is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_CONSOLE_POLL is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_KGDB is not set
# CONFIG_KGDB_HONOUR_BLOCKLIST is not set
# CONFIG_KGDB_SERIAL_CONSOLE is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_KDB_KEYBOARD is not set
# CONFIG_DEBUG_MISC is not set
Isolated down to CONFIG_SCHED_DEBUG=y being the line which somehow "fixes"
the crash. I can enable all the above apart from that and we can reproduce
it.

Also, I changed gatesgarth to use qemu 5.2.0 copied in from hardknott and that
breaks it. Dropping the 27 CVE patches "fixes" it again. It is possible it
is one of the CVE fixes. Continuing to try and isolate.
For the mail archive trail, and for those not follwing the ongoing
research on IRC, we are hopeful that this fixes it.

https://lore.kernel.org/lkml/20210616125157.438837-1-paul.gortmaker@windriver.com/

Paul.
--


Cheers,

Richard


Re: ltp failures on autobuilder

Richard Purdie
 

On Fri, 2021-06-11 at 12:36 +0100, Richard Purdie via lists.yoctoproject.org wrote:
as a .cfg to the kernel and that still reproduced the crash. However:

CONFIG_DEBUG_KERNEL=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_RCU_TRACE is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_CONSOLE_POLL is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_KGDB is not set
# CONFIG_KGDB_HONOUR_BLOCKLIST is not set
# CONFIG_KGDB_SERIAL_CONSOLE is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_KDB_KEYBOARD is not set
# CONFIG_DEBUG_MISC is not set
Isolated down to CONFIG_SCHED_DEBUG=y being the line which somehow "fixes"
the crash. I can enable all the above apart from that and we can reproduce
it.

Also, I changed gatesgarth to use qemu 5.2.0 copied in from hardknott and that
breaks it. Dropping the 27 CVE patches "fixes" it again. It is possible it
is one of the CVE fixes. Continuing to try and isolate.

Cheers,

Richard


Re: ltp failures on autobuilder

Richard Purdie
 

On Thu, 2021-06-10 at 18:02 +0100, Richard Purdie via lists.yoctoproject.org wrote:
Noting down what we know about the ltp issue:

We've seen intermittent issues on the autobuilder where some ltp tests fail or 
hang. I've been trying to figure out how to reproduce the issue and narrow down
the cause.

I was able to isolate a patch which reproduces the issue for me:

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t222&id=d7d65aae104caa03afc28837b0abe0b486d5a8b8

with master-next, setting:

IMAGE_INSTALL_append = ' ltp' 
TEST_SUITES = 'ping ssh ltp' 
also:

IMAGE_CLASSES += "testimage"
QEMU_USE_KVM_qemux86-64 = "True"


then 

bitbake core-image-sato; bitbake core-image-sato -c testimage

where the issue shows up as a kernel "BUG:" in the logs in WORKDIR/testimage/qemu_*

The above patch runs the minimum of ltp tests I could find which replicate the issue.

I've reproduced this on 5.10.1 -> 5.10.42, 5.4.123 and 5.13-rc5.
(and we've ruled out linux-yocto with plain kernels)
Also reproduced on both qemu 6.0.0 and 5.2.0.

My build machine is an Ubuntu 20.04.2 LTS with:
Linux version 5.4.0-74-generic (buildd@lgw01-amd64-038) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021
Good news (for me) is that Randy and Paul can now reproduce this with the above 
additional key pieces of config.

We have confirmed that the issue is present:

* with gcc 11.1.1 and 10.3
* in hardknott
* if QB_SMP is disabled (i.e. in a single processor qemu)
* on 18.04, 20.04 and 21.04 Ubuntu host distros which have varying 5.4 and 5.11 
host kernels

I was not able to make the bug appear with in gatesgarth as yet 
(gcc 10.2, 5.8 kernel, qemu 5.1.0) (had to hack -b /dev/null to the ltp commandline)

I did backport the qemu platform, smp and qemu commandline changes back to
gatesgarth and it still doesn't crash.

I also found that setting CONFIG_DEBUG_KERNEL makes the issue 'go away'. 
Since that is a large hammer, I tried:

CONFIG_DEBUG_KERNEL=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_SCHED_DEBUG is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_CONSOLE_POLL is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_KGDB is not set
# CONFIG_KGDB_HONOUR_BLOCKLIST is not set
# CONFIG_KGDB_SERIAL_CONSOLE is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_KDB_KEYBOARD is not set
# CONFIG_DEBUG_MISC is not set

as a .cfg to the kernel and that still reproduced the crash. However:

CONFIG_DEBUG_KERNEL=y
CONFIG_CGROUP_DEBUG=y
CONFIG_SCHED_DEBUG=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_RCU_TRACE is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_CONSOLE_POLL is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_KGDB is not set
# CONFIG_KGDB_HONOUR_BLOCKLIST is not set
# CONFIG_KGDB_SERIAL_CONSOLE is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
# CONFIG_KDB_KEYBOARD is not set
# CONFIG_DEBUG_MISC is not set

doesn't seem to want to reproduce the crash so something about
those three options seems to make things 'work'.

What does that all mean? No idea.

Cheers,

Richard


Re: ltp failures on autobuilder

Richard Purdie
 

On Thu, 2021-06-10 at 14:52 -0400, Randy MacLeod wrote:
On 2021-06-10 1:02 p.m., Richard Purdie wrote:
Noting down what we know about the ltp issue:

We've seen intermittent issues on the autobuilder where some ltp tests fail or
hang. I've been trying to figure out how to reproduce the issue and narrow down
the cause.

I was able to isolate a patch which reproduces the issue for me:

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t222&id=d7d65aae104caa03afc28837b0abe0b486d5a8b8

with master-next, setting:

IMAGE_INSTALL_append = ' ltp'
TEST_SUITES = 'ping ssh ltp'
It became clear there is also an:

IMAGE_CLASSES += "testimage"

but the INHERIT people are using shouldn't make any real difference to 
the test.


then

bitbake core-image-sato; bitbake core-image-sato -c testimage

where the issue shows up as a kernel "BUG:" in the logs in WORKDIR/testimage/qemu_*

The above patch runs the minimum of ltp tests I could find which replicate the issue.

I've reproduced this on 5.10.1 -> 5.10.42, 5.4.123 and 5.13-rc5.
(and we've ruled out linux-yocto with plain kernels)
Also reproduced on both qemu 6.0.0 and 5.2.0.

My build machine is an Ubuntu 20.04.2 LTS with:
Linux version 5.4.0-74-generic (buildd@lgw01-amd64-038) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021
I tried to reproduce this on a Ubuntu-18.04.3 system with:
   Linux ala-lpggp3 5.4.0-72-generic #80~18.04.1-Ubuntu SMP
     Mon Apr 12 23:26:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Using poky-contrib:

$ git status
On branch rpurdie/t222
Your branch is up to date with 'origin/rpurdie/t222'.

nothing to commit, working tree clean

$ git log --oneline -3
d7d65aae10 (HEAD -> rpurdie/t222, origin/rpurdie/t222)
            ltp: Simplify for kernel crash reproducer
e175e2855d linx-yocto/5.10: re-import aufs to v5.10
753ae7dcd5 linux-yocto: test-only. override LINUX_VERSION for qemux86-64

---

My local.conf was generated, then I added:

IMAGE_INSTALL_append = ' ltp'
TEST_SUITES = 'ping ssh ltp'
INHERIT += "testimage"


and with 11 runs of:

$ bitbake core-image-sato -c testimage

I did not see the error in any of the qemu logs.

(cd tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/; ls
qemu_boot_log.202106101*)
qemu_boot_log.20210610155947 qemu_boot_log.20210610170535
qemu_boot_log.20210610172026 qemu_boot_log.20210610173508
qemu_boot_log.20210610174959 qemu_boot_log.20210610180444
qemu_boot_log.20210610165758 qemu_boot_log.20210610171302
qemu_boot_log.20210610172743 qemu_boot_log.20210610174235
qemu_boot_log.20210610175720

$ rgrep BUG:
tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/*


There is an OOM run as RP sees as well and the typical lead-up to that is:

[ 225.248350]
   hrtimer: interrupt took 4935186 ns
I've never seen that hrtimer message...

[ 249.250283]
   option changes via remount are deprecated (pid=3001 comm=mount)
[ 250.200586]
   option changes via remount are deprecated (pid=3019 comm=mount)
[ 283.695208]
   memcg_test_1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
     order=0, oom_score_adj=0
[ 283.702108]
   CPU: 1 PID: 3798 Comm: memcg_test_1 Not tainted
     5.10.42-yocto-standard #1
but I do see the above.

Note that I am running this on a server without x11 forwarding.
Is your testing done on a local machine Richard? I doubt it matters
but I just want to be sure we understand how you are testing.
Mine is a local machine and I am using X11 forwarding over ssh to my laptop.

I am going to try on another server running ubu-21.04.
You mean 20.04?

Paul can't reproduce either which makes me wonder what detail we're missing...

In the interests of full disclosure, my local.conf also has:

BB_DISKMON_DIRS (set to default I think)
BUILD_REPRODUCIBLE_BINARIES_pn-nativesdk-python3 = '0'
CONF_VERSION = "1"
DISTRO ?= "poky"
EXTRA_IMAGE_FEATURES ?= "debug-tweaks"
IMAGE_INSTALL_append = " ssh-pregen-hostkeys"
PACKAGE_CLASSES = "package_rpm package_ipk package_deb"
PACKAGECONFIG_append_pn-nativesdk-qemu = " sdl"
PACKAGECONFIG_append_pn-qemu-system-native = " gtk+"
PACKAGECONFIG_append_pn-qemu-system-native = " sdl"
PATCHRESOLVE = "noop"
QEMU_USE_KVM_qemux86-64 = "True"
QEMU_USE_KVM_qemux86 = "True"
SANITY_TESTED_DISTROS = ""
USER_CLASSES ?= "buildstats image-prelink"

and I have SSTATE_DIR and DL_DIR set.

Cheers,

Richard


Re: ltp failures on autobuilder

Randy MacLeod
 

On 2021-06-10 1:02 p.m., Richard Purdie wrote:
Noting down what we know about the ltp issue:
We've seen intermittent issues on the autobuilder where some ltp tests fail or
hang. I've been trying to figure out how to reproduce the issue and narrow down
the cause.
I was able to isolate a patch which reproduces the issue for me:
http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t222&id=d7d65aae104caa03afc28837b0abe0b486d5a8b8
with master-next, setting:
IMAGE_INSTALL_append = ' ltp'
TEST_SUITES = 'ping ssh ltp'
then
bitbake core-image-sato; bitbake core-image-sato -c testimage
where the issue shows up as a kernel "BUG:" in the logs in WORKDIR/testimage/qemu_*
The above patch runs the minimum of ltp tests I could find which replicate the issue.
I've reproduced this on 5.10.1 -> 5.10.42, 5.4.123 and 5.13-rc5.
(and we've ruled out linux-yocto with plain kernels)
Also reproduced on both qemu 6.0.0 and 5.2.0.
My build machine is an Ubuntu 20.04.2 LTS with:
Linux version 5.4.0-74-generic (buildd@lgw01-amd64-038) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021
I tried to reproduce this on a Ubuntu-18.04.3 system with:
Linux ala-lpggp3 5.4.0-72-generic #80~18.04.1-Ubuntu SMP
Mon Apr 12 23:26:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Using poky-contrib:

$ git status
On branch rpurdie/t222
Your branch is up to date with 'origin/rpurdie/t222'.

nothing to commit, working tree clean

$ git log --oneline -3
d7d65aae10 (HEAD -> rpurdie/t222, origin/rpurdie/t222)
ltp: Simplify for kernel crash reproducer
e175e2855d linx-yocto/5.10: re-import aufs to v5.10
753ae7dcd5 linux-yocto: test-only. override LINUX_VERSION for qemux86-64

---

My local.conf was generated, then I added:

IMAGE_INSTALL_append = ' ltp'
TEST_SUITES = 'ping ssh ltp'
INHERIT += "testimage"


and with 11 runs of:

$ bitbake core-image-sato -c testimage

I did not see the error in any of the qemu logs.

(cd tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/; ls qemu_boot_log.202106101*)
qemu_boot_log.20210610155947 qemu_boot_log.20210610170535 qemu_boot_log.20210610172026 qemu_boot_log.20210610173508 qemu_boot_log.20210610174959 qemu_boot_log.20210610180444
qemu_boot_log.20210610165758 qemu_boot_log.20210610171302 qemu_boot_log.20210610172743 qemu_boot_log.20210610174235 qemu_boot_log.20210610175720

$ rgrep BUG: tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/*


There is an OOM run as RP sees as well and the typical lead-up to that is:

[ 225.248350]
hrtimer: interrupt took 4935186 ns
[ 249.250283]
option changes via remount are deprecated (pid=3001 comm=mount)
[ 250.200586]
option changes via remount are deprecated (pid=3019 comm=mount)
[ 283.695208]
memcg_test_1 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
[ 283.702108]
CPU: 1 PID: 3798 Comm: memcg_test_1 Not tainted
5.10.42-yocto-standard #1


Note that I am running this on a server without x11 forwarding.
Is your testing done on a local machine Richard? I doubt it matters
but I just want to be sure we understand how you are testing.


I am going to try on another server running ubu-21.04.

../Randy


Cheers,
Richard

--
# Randy MacLeod
# Wind River Linux


ltp failures on autobuilder

Richard Purdie
 

Noting down what we know about the ltp issue:

We've seen intermittent issues on the autobuilder where some ltp tests fail or 
hang. I've been trying to figure out how to reproduce the issue and narrow down
the cause.

I was able to isolate a patch which reproduces the issue for me:

http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=rpurdie/t222&id=d7d65aae104caa03afc28837b0abe0b486d5a8b8

with master-next, setting:

IMAGE_INSTALL_append = ' ltp' 
TEST_SUITES = 'ping ssh ltp' 

then 

bitbake core-image-sato; bitbake core-image-sato -c testimage

where the issue shows up as a kernel "BUG:" in the logs in WORKDIR/testimage/qemu_*

The above patch runs the minimum of ltp tests I could find which replicate the issue.

I've reproduced this on 5.10.1 -> 5.10.42, 5.4.123 and 5.13-rc5.
(and we've ruled out linux-yocto with plain kernels)
Also reproduced on both qemu 6.0.0 and 5.2.0.

My build machine is an Ubuntu 20.04.2 LTS with:
Linux version 5.4.0-74-generic (buildd@lgw01-amd64-038) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021

Cheers,

Richard


Re: SWAT Rotation

Larson, Chris
 

I would be willing, but would appreciate any links to the process, as it'll be my first time doing it, and it's not entirely clear to me at this time. Thanks.

-Chris

-----Original Message-----
From: Alexandre Belloni <alexandre.belloni@...>
Sent: Friday, June 4, 2021 2:44 PM
To: Larson, Chris <Chris_Larson@...>
Cc: swat@...; Christopher Larson <kergoth@...>; Stephen Jolley <sjolley.yp.pm@...>
Subject: Re: [swat] SWAT Rotation

Hello Chris,

Would you be available next week for SWAT duty? Unfortunately, issues did pile up this week... :/

Regards,
Alex

On 03/05/2021 19:34:55+0000, Larson, Chris wrote:
Apologies, I'm swamped this week and am dealing with an injured kid, I'll need to shift my position to a future week again.

-Chris

-----Original Message-----
From: swat@... <swat@...> On
Behalf Of Alexandre Belloni
Sent: Thursday, April 29, 2021 3:09 PM
To: Christopher Larson <kergoth@...>
Cc: swat@...; Stephen Jolley
<sjolley.yp.pm@...>
Subject: [swat] SWAT Rotation

Hello Christopher,

SWAT duty will rotate from me to you at the end of this Friday.

Please reply to let me know whether you will be able to work on this task next week.

I'll be available to walk you through the swatbot process on Monday, don't hesitate to contact me by email or on IRC.
I'll create your swatbot account soon.

--
Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel
engineering https://bootlin.com









--
Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com


Re: SWAT Rotation

Alexandre Belloni
 

Hello Chris,

Would you be available next week for SWAT duty? Unfortunately, issues
did pile up this week... :/

Regards,
Alex

On 03/05/2021 19:34:55+0000, Larson, Chris wrote:
Apologies, I'm swamped this week and am dealing with an injured kid, I'll need to shift my position to a future week again.

-Chris

-----Original Message-----
From: swat@... <swat@...> On Behalf Of Alexandre Belloni
Sent: Thursday, April 29, 2021 3:09 PM
To: Christopher Larson <kergoth@...>
Cc: swat@...; Stephen Jolley <sjolley.yp.pm@...>
Subject: [swat] SWAT Rotation

Hello Christopher,

SWAT duty will rotate from me to you at the end of this Friday.

Please reply to let me know whether you will be able to work on this task next week.

I'll be available to walk you through the swatbot process on Monday, don't hesitate to contact me by email or on IRC.
I'll create your swatbot account soon.

--
Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com









--
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com