SWAT Rotation schedule
Alexandre Belloni
Hello,
Following the summer break, I would like to organize SWAT duty rotation. This time, I've prepared a semi randomized schedule so it is easier for each one of you to plan for SWAT duty. Please check the table and let me know if you are not available for the selected week. SWAT duty will be from Friday to Thursday and the goal is to triage all the failures on swatbot before the weekly triage call happening at 2:30pm UTC. ┌──────┬────────────┐ │ Week │ Start │ ┌───────────────────────────────┼──────┼────────────┤ │ Alejandro Hernandez Samaniego │ 37 │ 17/09/2021 │ │ Oleksiy Obitotskyy │ 38 │ 24/09/2021 │ │ Naveen Saini │ 39 │ 01/10/2021 │ │ Thomas Perrot │ 40 │ 08/10/2021 │ │ Paul Eggleton │ 41 │ 15/10/2021 │ │ Christopher Larson │ 42 │ 22/10/2021 │ │ Jon Mason │ 43 │ 29/10/2021 │ │ Lee Chee Yang │ 44 │ 05/11/2021 │ │ Minjae Kim │ 45 │ 12/11/2021 │ │ Jaga │ 46 │ 19/11/2021 │ │ Leo Sandoval │ 47 │ 26/11/2021 │ │ Ross Burton │ 48 │ 03/12/2021 │ │ Köry Maincent │ 49 │ 10/12/2021 │ │ Anibal Limon │ 50 │ 17/12/2021 │ │ Saul Wold │ 51 │ 24/12/2021 │ └───────────────────────────────┴──────┴────────────┘ Alejandro you would be the next one on the list, starting this Friday, can you confirm you are available? There are currently 8 failures to triage on swatbot, I'm going to take care of those. Thanks! -- Alexandre Belloni, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
Re: SWAT Rotation
Saul Wold
On 8/30/21 1:26 AM, Alexandre Belloni wrote:
Hello Saul,ACK I'll take care of 211545 and earlier when I'll be back so you don'tI will look at the newer ones later this morning. Sau! Thanks!-- Sau!
|
|
SWAT Rotation
Alexandre Belloni
Hello Saul,
As we discussed las Thursday, I count on you to take care of SWAT this week. I'll take care of 211545 and earlier when I'll be back so you don't necessarily have to spend time on those. Thanks! -- Alexandre Belloni, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
[swatbot] [PATCH] models/serializers: triage exception automatically
Alexandre Belloni
Most of the cancelled builds are actually exceptions, tirage them
automatically. Signed-off-by: Alexandre Belloni <alexandre.belloni@...> --- swatapp/serializers.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/swatapp/serializers.py b/swatapp/serializers.py index cba4fa482079..e5ca25e4442f 100644 --- a/swatapp/serializers.py +++ b/swatapp/serializers.py @@ -28,7 +28,7 @@ class StepFailureSerializer(serializers.ModelSerializer): attrs['triage'] = TriageStatus.NOTFORSWAT # Cancelled builds with only stdio (no errors/warnings) aren't interesting - if attrs['status'] == Status.CANCELLED: + if attrs['status'] == Status.CANCELLED or attrs['status'] == Status.EXCEPTION: if "/errors" not in attrs['urls'] and "/warnings" not in attrs['urls']: attrs['triage'] = TriageStatus.CANCELLEDNOERR return attrs -- 2.31.1
|
|
Re: SWAT Rotation
Anibal Limon
Hi Alexandre, Unfortunately I will be out next two weeks for summer vacation. Sorry for the inconvenience. Best regards!, Anibal
On Fri, 13 Aug 2021 at 09:14, Alexandre Belloni <alexandre.belloni@...> wrote: Hello Anibal,
|
|
SWAT Rotation
Alexandre Belloni
Hello Anibal,
You are the next one on the SWAT Team members list (https://wiki.yoctoproject.org/wiki/Yocto_Build_Failure_Swat_Team#Members) Are you available next week for SWAT duty? Thanks! -- Alexandre Belloni, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
Re: glibc 2.34 tests
Add Tony who has been working on valgrind so he and Khem keep in sync.
toggle quoted messageShow quoted text
../Randy
On 2021-07-06 5:01 a.m., Khem Raj wrote:
On Sat, Jul 3, 2021 at 2:29 PM Alexandre Belloni --
# Randy MacLeod # Wind River Linux
|
|
Re: glibc 2.34 tests
Khem Raj <raj.khem@...>
On Sat, Jul 3, 2021 at 2:29 PM Alexandre Belloni
<alexandre.belloni@...> wrote: thanks, the good thing is that these packages are quite sticky to glibc internals so usually an upgrade if available for them will sort things out, glibc is merging libpthread/librt etc into libc so lot of valgrind failures could be due to that I will see what I can find. --
|
|
glibc 2.34 tests
Alexandre Belloni
Hello,
I've been testing multiple versions of Khem's branch for glibc2.34 Everything apart from non-gpl3 builds fine. There is a patch from Khem to solve this issue, I didn't apply/test yet: [yocto] [meta-gplv2] [PATCH] coreutils_6.9.bb: Fix build with glibc 2.34 On the tests side, systemd-journal was crashing, the last patch on Khem's branch is solving the issue, thanks! The remaining issues are on the ptests side: https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2233/steps/12/logs/stdio https://autobuilder.yoctoproject.org/typhoon/#/builders/82/builds/1946/steps/12/logs/stdio https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2236/steps/12/logs/stdio https://autobuilder.yoctoproject.org/typhoon/#/builders/82/builds/1949/steps/12/logs/stdio https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2244/steps/12/logs/stdio https://autobuilder.yoctoproject.org/typhoon/#/builders/82/builds/1957/steps/12/logs/stdio Issues happen both on qemux86-64 and qemuarm64 and they reproduced every time: 'elfutils': ['run-backtrace-native.sh', 'run-backtrace-native-core.sh'] 'strace': ['prlimit64.gen.test', 'xet_robust_list.gen.test', 'xet_robust_list--pidns-translation.gen.test'], For valgrind, it is all over the place: ARM has: 'valgrind': ['gdbserver_tests/hginfo', 'helgrind/tests/hg02_deadlock', 'drd/tests/boost_thread', 'drd/tests/pth_cond_race3', 'drd/tests/pth_create_chain', 'drd/tests/pth_detached', 'drd/tests/std_list', 'drd/tests/std_mutex', 'drd/tests/swapcontext', 'drd/tests/tc24_nonzero_sem', 'drd/tests/thread_name', 'drd/tests/thread_name_xml', 'drd/tests/threaded-fork-vcs', 'drd/tests/threaded-fork', 'drd/tests/tls_threads', 'massif/tests/deep-D'] x86 has: 'valgrind': ['gdbserver_tests/hginfo', 'memcheck/tests/supp_unknown', 'helgrind/tests/annotate_smart_pointer', 'helgrind/tests/bar_bad', 'helgrind/tests/hg01_all_ok', 'helgrind/tests/hg02_deadlock', 'helgrind/tests/pth_destroy_cond', 'helgrind/tests/rwlock_test', 'helgrind/tests/tc05_simple_race', 'helgrind/tests/tc06_two_races', 'helgrind/tests/tc06_two_races_xml', 'helgrind/tests/tc14_laog_dinphils', 'helgrind/tests/tc17_sembar', 'helgrind/tests/tc19_shadowmem', 'helgrind/tests/tc21_pthonce', 'helgrind/tests/tc22_exit_w_lock', 'helgrind/tests/tc23_bogus_condwait', 'helgrind/tests/tls_threads', 'drd/tests/annotate_barrier', 'drd/tests/annotate_barrier_xml', 'drd/tests/annotate_hb_race', 'drd/tests/annotate_hbefore', 'drd/tests/annotate_ignore_read', 'drd/tests/annotate_ignore_rw', 'drd/tests/annotate_ignore_rw2', 'drd/tests/annotate_ignore_write', 'drd/tests/annotate_ignore_write2', 'drd/tests/annotate_order_1', 'drd/tests/annotate_order_2', 'drd/tests/annotate_order_3', 'drd/tests/annotate_rwlock', 'drd/tests/annotate_sem', 'drd/tests/annotate_smart_pointer', 'drd/tests/annotate_smart_pointer2', 'drd/tests/annotate_spinlock', 'drd/tests/annotate_static', 'drd/tests/atomic_var', 'drd/tests/bar_bad', 'drd/tests/bar_bad_xml', 'drd/tests/bar_trivial', 'drd/tests/boost_thread', 'drd/tests/bug-235681', 'drd/tests/bug322621', 'drd/tests/circular_buffer', 'drd/tests/concurrent_close', 'drd/tests/dlopen', 'drd/tests/fork-parallel', 'drd/tests/fork-serial', 'drd/tests/fp_race', 'drd/tests/fp_race2', 'drd/tests/fp_race_xml', 'drd/tests/free_is_write', 'drd/tests/free_is_write2', 'drd/tests/hg01_all_ok', 'drd/tests/hg02_deadlock', 'drd/tests/hg03_inherit', 'drd/tests/hg04_race', 'drd/tests/hg05_race2', 'drd/tests/hg06_readshared', 'drd/tests/linuxthreads_det', 'drd/tests/matinv', 'drd/tests/monitor_example', 'drd/tests/pth_barrier', 'drd/tests/pth_barrier2', 'drd/tests/pth_barrier3', 'drd/tests/pth_barrier_race', 'drd/tests/pth_barrier_thr_cr', 'drd/tests/pth_broadcast', 'drd/tests/pth_cancel_locked', 'drd/tests/pth_cleanup_handler', 'drd/tests/pth_cond_race', 'drd/tests/pth_cond_race2', 'drd/tests/pth_cond_race3', 'drd/tests/pth_create_chain', 'drd/tests/pth_detached', 'drd/tests/pth_detached2', 'drd/tests/pth_detached3', 'drd/tests/pth_detached_sem', 'drd/tests/pth_inconsistent_cond_wait', 'drd/tests/pth_once', 'drd/tests/pth_spinlock', 'drd/tests/read_and_free_race', 'drd/tests/rwlock_race', 'drd/tests/rwlock_test', 'drd/tests/sem_as_mutex', 'drd/tests/sem_as_mutex2', 'drd/tests/sem_as_mutex3', 'drd/tests/sem_open', 'drd/tests/sem_open2', 'drd/tests/sem_open3', 'drd/tests/sem_open_traced', 'drd/tests/sigalrm', 'drd/tests/std_atomic', 'drd/tests/std_list', 'drd/tests/std_mutex', 'drd/tests/std_string', 'drd/tests/std_thread', 'drd/tests/swapcontext', 'drd/tests/tc01_simple_race', 'drd/tests/tc02_simple_tls', 'drd/tests/tc03_re_excl', 'drd/tests/tc05_simple_race', 'drd/tests/tc06_two_races', 'drd/tests/tc07_hbl1', 'drd/tests/tc08_hbl2', 'drd/tests/tc09_bad_unlock', 'drd/tests/tc11_XCHG', 'drd/tests/tc16_byterace', 'drd/tests/tc17_sembar', 'drd/tests/tc18_semabuse', 'drd/tests/tc19_shadowmem', 'drd/tests/tc21_pthonce', 'drd/tests/tc22_exit_w_lock', 'drd/tests/tc23_bogus_condwait', 'drd/tests/tc24_nonzero_sem', 'drd/tests/thread_name', 'drd/tests/thread_name_xml', 'drd/tests/threaded-fork-vcs', 'drd/tests/threaded-fork', 'drd/tests/tls_threads', 'massif/tests/deep-D'] It seems that https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2233 was under more load has it had python3 fail and more strace failures. I'll triage the remaining intermittent ptest failures -- Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
Re: SWAT Rotation
Leonardo Sandoval
Hi Alexandre and team, Sure, I can 'swat' next week. lsg
On Thu, 1 Jul 2021 at 17:28, Alexandre Belloni <alexandre.belloni@...> wrote: Hello Leonardo,
|
|
SWAT Rotation
Alexandre Belloni
Hello Leonardo,
You are the next one on the SWAT Team members list (https://wiki.yoctoproject.org/wiki/Yocto_Build_Failure_Swat_Team#Members) and SWAT duty will rotate to you at EOD 2021-07-02. Please reply to let me know whether you will be able to work on this task. Thanks! -- Alexandre Belloni, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
Failed qemu bugs - notes for SWAT
Richard Purdie
I have some good news. Now that we have merged the rcu stall crash fix:
http://git.yoctoproject.org/cgit.cgi/poky/commit/?id=1f802cdf857475eefb9f3ae5032a6f1ce3e01d97 I think that the cause of some of our qemu hangs under higher IO load should be fixed. As such, I've gone through and closed out a number of qemu AB-INT bugs that look like they're related to this. If we do see anything that looks like some of the closed issues recur, I'd suggest opening new bugs as the older ones are becoming unwieldy and it is hard to tell what may have been the rcu stall and what wasn't. For any new testimage/qemu failures, please do try and capture and include the qemu boot log from WORKDIR/testimage/qemu_*.log if it isn't printed on the console log. If you don't have access, ping someone who does. It is looking to be of key importance that we get those logs. We have improved some of the code paths but we're not quite there yet for some failure paths. I also closed out the ltp bugs since we do have a fix there at least for the most common hang. I suspect there is still an arm hang lurking somewhere so if we see that, it should have its own new bug (ping Alex or me too please as we'd like to poke at the hung build). Cheers, Richard
|
|
Re: Autobuilder "rcu stall" issue summary
Richard Purdie
On Fri, 2021-06-25 at 16:34 +0100, Richard Purdie via lists.yoctoproject.org wrote:
On Thu, 2021-06-24 at 17:31 +0100, Richard Purdie via lists.yoctoproject.org wrote:I realised there was a potential lock issue in the rcu stall code in theI'll stop here but will follow up on the mail if I remember more info. Anyone elseI formed a new plan and set the stall detector to 3 seconds instead of kernel. In looking into it with Paul we found that upstream have a fix queued for the same issue: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?id=406a2f008f2e This would account for the BUG: entries in the previous email. I plan to test this with the short rcu stall timing, see if we still see the hangs. Adding this fix to linux-yocto would seem helpful just to rule it out if nothing else but this fix does look like the kind of thing that would cause the issues we've been tracking. Cheers, Richard
|
|
Re: Autobuilder "rcu stall" issue summary
Richard Purdie
On Thu, 2021-06-24 at 17:31 +0100, Richard Purdie via lists.yoctoproject.org wrote:
I'll stop here but will follow up on the mail if I remember more info. Anyone elseI formed a new plan and set the stall detector to 3 seconds instead of 21 seconds with this hack in master-next: http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=71b6bc157d39e09e8f76a15a049168eb72bbd3d9 i.e. adding CONFIG_RCU_CPU_STALL_TIMEOUT=3 I ran two builds on the autobuilder, one of which was a heavy rebuild build so should trigger a lot of IO. I've been looking over the build results and we have a number of builds which failed or are in the process of doing so, I've dumped 4 below (three arm, one x86-64). My thinking is that if we reduced the stall detector limit and only saw occasional hangs at the same rate as before, it was likely qemu. If we saw an increase in hangs (which we definitely have even on the preliminary incomplete results), it is more likely something in the kernel RCU stall code is taking out the system. The BUG: in occasional logs also hints at the latter. My own next step is probably to hack a "stall report" trigger into sysrq and try and in some of my own images locally, see if I can break this at will. Cheers, Richard qemuarm: https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/3608 /home/pokybuild/yocto-worker/qemuarm/build/build/tmp/work/qemuarm-poky-linux-gnueabi/core-image-sato-sdk/1.0- r0/testimage/qemu_boot_log.20210625121601 ** (matchbox-desktop:715): [1;33mWARNING[0m **: [34m12:17:35.454[0m: Error loading icon: Icon 'terminal' not present in theme Sato [ 2173.684752] hellomod: loading out-of-tree module taints kernel. [ 2173.726821] Hello world! [ 2178.269807] Cleaning up hellomod. [ 2426.747471] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2426.753269] rcu: 2-....: (32 ticks this GP) idle=21e/1/0x40000000 softirq=95871/95871 fqs=19 [ 2426.823824] (detected by 0, t=302 jiffies, g=214353, q=107) qemuarm-oecore: https://autobuilder.yoctoproject.org/typhoon/#/builders/47/builds/3601 OpenEmbedded nodistro.0 qemuarm /dev/hvc0 /home/pokybuild/yocto-worker/qemuarm-oecore/build/build/tmp-glibc/work/qemuarm-oe-linux-gnueabi/core-image- sato/1.0-r0/testimage/qemu_boot_log.20210625114847 qemuarm login: [ 53.248661] rcu: INFO: rcu_preempt self-detected stall on CPU [ 53.250587] rcu: 0-...!: (1 ticks this GP) idle=4aa/0/0x1 softirq=1385/1385 fqs=0 [ 53.251571] (t=1019 jiffies g=1293 q=4) [ 53.252192] rcu: rcu_preempt kthread starved for 1019 jiffies! g1293 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 - cpu=1[ 53.253026] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 53.253697] rcu: RCU grace-period kthread stack dump: [ 53.254348] task:rcu_preempt state:R running task stack: 0 pid: 13 ppid: 2 flags:0x00000000 [ 53.258129] [<c0a73808>] (__schedule) from [<c0a73a74>] (schedule+0xc8/0x118) [ 53.258925] [<c0a73a74>] (schedule) from [<c0a75ef8>] (schedule_timeout+0xcc/0xfc) [ 53.259247] [<c0a75ef8>] (schedule_timeout) from [<c028b8b0>] (rcu_gp_kthread+0x2ac/0x92c) [ 53.259619] [<c028b8b0>] (rcu_gp_kthread) from [<c024f314>] (kthread+0x158/0x164) [ 53.259865] [<c024f314>] (kthread) from [<c0200138>] (ret_from_fork+0x14/0x3c) [ 53.260207] Exception stack(0xc1539fb0 to 0xc1539ff8) [ 53.260691] 9fa0: 00000000 00000000 00000000 00000000 [ 53.260973] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 53.261236] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 53.261718] NMI backtrace for cpu 0 [ 53.262040] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.43-yocto-standard #1 [ 53.262324] Hardware name: Generic DT based system [ 53.262584] [<c020fe28>] (unwind_backtrace) from [<c020ab54>] (show_stack+0x18/0x1c) [ 53.262847] [<c020ab54>] (show_stack) from [<c0a6e304>] (dump_stack+0x88/0xa4) [ 53.263081] [<c0a6e304>] (dump_stack) from [<c06dc0e8>] (nmi_cpu_backtrace+0xb8/0xcc) [ 53.263340] [<c06dc0e8>] (nmi_cpu_backtrace) from [<c06dc1b8>] (nmi_trigger_cpumask_backtrace+0xbc/0x164) [ 53.263615] [<c06dc1b8>] (nmi_trigger_cpumask_backtrace) from [<c0a69248>] (rcu_dump_cpu_stacks+0xe8/0x114) [ 53.264190] [<c0a69248>] (rcu_dump_cpu_stacks) from [<c028d9d8>] (rcu_sched_clock_irq+0x35c/0xa78) [ 53.264463] [<c028d9d8>] (rcu_sched_clock_irq) from [<c0296f8c>] (update_process_times+0x5c/0x80) [ 53.264757] [<c0296f8c>] (update_process_times) from [<c02a8220>] (tick_sched_timer+0x17c/0x1fc) [ 53.265047] [<c02a8220>] (tick_sched_timer) from [<c0297fc8>] (__hrtimer_run_queues+0x19c/0x2c4) [ 53.265317] [<c0297fc8>] (__hrtimer_run_queues) from [<c0298528>] (hrtimer_interrupt+0xc8/0x204) [ 53.265559] [<c0298528>] (hrtimer_interrupt) from [<c08a112c>] (arch_timer_handler_virt+0x30/0x38) [ 53.265802] [<c08a112c>] (arch_timer_handler_virt) from [<c027f844>] (handle_percpu_devid_irq+0x60/0x118) [ 53.266079] [<c027f844>] (handle_percpu_devid_irq) from [<c0279bf0>] (generic_handle_irq+0x30/0x40) [ 53.266320] [<c0279bf0>] (generic_handle_irq) from [<c027a218>] (__handle_domain_irq+0x98/0xa8) [ 53.266554] [<c027a218>] (__handle_domain_irq) from [<c06eb518>] (gic_handle_irq+0x74/0x84) [ 53.266778] [<c06eb518>] (gic_handle_irq) from [<c0201278>] (__irq_svc+0x58/0x8c) [ 53.266998] Exception stack(0xc1201f20 to 0xc1201f68) [ 53.267171] 1f20: 0000a4a8 cfd6de84 00000000 c0218860 c1200000 00000000 c1200000 c1201f80 [ 53.267390] 1f40: c1204f94 00000001 c1204fd4 00000000 c1684040 c1201f70 c0207930 c0207920 [ 53.267611] 1f60: 60010113 ffffffff [ 53.267744] [<c0201278>] (__irq_svc) from [<c0207920>] (arch_cpu_idle+0x24/0x40) [ 53.267946] [<c0207920>] (arch_cpu_idle) from [<c0a76898>] (default_idle_call+0x40/0x5c) [ 53.268167] [<c0a76898>] (default_idle_call) from [<c025e110>] (do_idle+0x108/0x254) [ 53.268384] [<c025e110>] (do_idle) from [<c025e534>] (cpu_startup_entry+0x20/0x24) [ 53.268624] [<c025e534>] (cpu_startup_entry) from [<c1000e80>] (start_kernel+0x4a0/0x57c) [ 73.311040] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 73.313047] (detected by 2, t=428 jiffies, g=1965, q=113) [ 73.313886] rcu: All QSes seen, last rcu_preempt kthread activity 425 (-22941--23366), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 73.315413] rcu: rcu_preempt kthread starved for 425 jiffies! g1965 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 - cpu=2[ 73.316823] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 73.317775] rcu: RCU grace-period kthread stack dump: [ 73.318304] task:rcu_preempt state:R stack: 0 pid: 13 ppid: 2 flags:0x00000000 [ 73.318759] [<c0a73808>] (__schedule) from [<c0a73a74>] (schedule+0xc8/0x118) [ 73.319123] [<c0a73a74>] (schedule) from [<c0a75ef8>] (schedule_timeout+0xcc/0xfc) [ 73.319500] [<c0a75ef8>] (schedule_timeout) from [<c028b8b0>] (rcu_gp_kthread+0x2ac/0x92c) [ 73.319886] [<c028b8b0>] (rcu_gp_kthread) from [<c024f314>] (kthread+0x158/0x164) [ 73.320201] [<c024f314>] (kthread) from [<c0200138>] (ret_from_fork+0x14/0x3c) [ 73.320776] Exception stack(0xc1539fb0 to 0xc1539ff8) [ 73.321088] 9fa0: 00000000 00000000 00000000 00000000 [ 73.321403] 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 73.321698] 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 73.322376] BUG: scheduling while atomic: swapper/2/0/0x00000002 [ 73.322697] Modules linked in: [ 73.323092] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.43-yocto-standard #1 [ 73.323383] Hardware name: Generic DT based system [ 73.323621] [<c020fe28>] (unwind_backtrace) from [<c020ab54>] (show_stack+0x18/0x1c) [ 73.323908] [<c020ab54>] (show_stack) from [<c0a6e304>] (dump_stack+0x88/0xa4) [ 73.324184] [<c0a6e304>] (dump_stack) from [<c02575e4>] (__schedule_bug+0x68/0x8c) [ 73.324454] [<c02575e4>] (__schedule_bug) from [<c0a73578>] (__schedule+0x58/0x48c) [ 73.324741] [<c0a73578>] (__schedule) from [<c0a73e00>] (schedule_idle+0x68/0x80) [ 73.325046] [<c0a73e00>] (schedule_idle) from [<c025e534>] (cpu_startup_entry+0x20/0x24) [ 73.325576] [<c025e534>] (cpu_startup_entry) from [<40201c2c>] (0x40201c2c) qemuarm64-ltp (on arm64 host ubuntu1804-arm-1) https://autobuilder.yoctoproject.org/typhoon/#/builders/96/builds/1888 /home/pokybuild/yocto-worker/qemuarm64-ltp/build/build/tmp/work/qemuarm64-poky-linux/core-image-sato/1.0- r0/testimage/qemu_boot_log.20210625121516 qemuarm64 login: [ 21.445581] Process accounting resumed [ 22.705496] Process accounting resumed [ 331.583598] capability: warning: `capget01' uses 32-bit capabilities (legacy support in use) [ 331.585764] capability: warning: `capget01' uses deprecated v2 capabilities in a way that may be insecure [ 345.996555] Clock: inserting leap second 23:59:60 UTC [ 713.924549] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 714.133632] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P9691/1:b..l [ 714.255745] (detected by 1, t=822 jiffies, g=195937, q=3443) [ 714.256974] task:msgstress03 state:R running task stack: 0 pid: 9589 ppid: 9543 flags:0x00000000 [ 714.258971] Call trace: [ 714.259520] __switch_to+0x114/0x19c [ 714.260274] __schedule+0x274/0x610 [ 714.261008] preempt_schedule_common+0x28/0x50 [ 714.261928] preempt_schedule+0x30/0x40 [ 714.262711] _raw_spin_unlock+0x38/0x50 [ 714.263507] do_msgrcv+0x284/0x4f0 [ 714.264230] __arm64_sys_msgrcv+0x38/0x44 [ 714.265042] el0_svc_common.constprop.0+0x9c/0x1c0 [ 714.266014] do_el0_svc+0x7c/0xa0 [ 714.266718] el0_svc+0x20/0x30 [ 714.267391] el0_sync_handler+0xa4/0x130 [ 714.268193] el0_sync+0x180/0x1c0 [ 714.268900] task:msgstress03 state:R running task stack: 0 pid: 9691 ppid: 32178 flags:0x00000000 [ 714.270830] Call trace: [ 714.271369] __switch_to+0x114/0x19c [ 714.313241] __schedule+0x274/0x610 [ 714.333040] preempt_schedule_irq+0x68/0x100 [ 714.333944] arm64_preempt_schedule_irq+0x2c/0x40 [ 714.334910] el1_irq+0xe4/0x180 [ 714.335606] wake_up_q+0xa8/0xd0 [ 714.336308] do_msgsnd+0x3f8/0x4e0 [ 714.337029] ksys_msgsnd+0x90/0xb0 [ 714.337750] __arm64_sys_msgsnd+0x2c/0x40 [ 714.338564] el0_svc_common.constprop.0+0x9c/0x1c0 [ 714.339552] do_el0_svc+0x7c/0xa0 [ 714.340266] el0_svc+0x20/0x30 [ 714.340936] el0_sync_handler+0xa4/0x130 [ 714.341735] el0_sync+0x180/0x1c0 [ 720.932547] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 721.016226] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): [ 721.072646] (detected by 0, t=752 jiffies, g=195941, q=7890) [ 721.073841] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295072518-4295072517), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 750.072043] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 750.073887] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P1738/1:b..l [ 750.116802] (detected by 0, t=773 jiffies, g=196493, q=15176) [ 750.118095] task:msgstress04 state:R running task stack: 0 pid: 1733 ppid: 1482 flags:0x00000000 [ 750.120144] Call trace: [ 750.120730] __switch_to+0x114/0x19c [ 750.144388] __schedule+0x274/0x610 [ 750.145215] preempt_schedule_common+0x28/0x50 [ 750.146187] preempt_schedule+0x30/0x40 [ 750.147020] _raw_spin_unlock+0x38/0x50 [ 750.147853] do_msgsnd+0x3ec/0x4e0 [ 750.148610] ksys_msgsnd+0x90/0xb0 [ 750.149365] __arm64_sys_msgsnd+0x2c/0x40 [ 750.150219] el0_svc_common.constprop.0+0x9c/0x1c0 [ 750.151244] do_el0_svc+0x7c/0xa0 [ 750.151995] el0_svc+0x20/0x30 [ 750.152698] el0_sync_handler+0xa4/0x130 [ 750.153532] el0_sync+0x180/0x1c0 [ 750.154273] task:msgstress04 state:R running task stack: 0 pid: 1738 ppid: 1482 flags:0x00000000 [ 750.156271] Call trace: [ 750.156842] __switch_to+0x114/0x19c [ 750.157626] __schedule+0x274/0x610 [ 750.158395] preempt_schedule_common+0x28/0x50 [ 750.159363] preempt_schedule+0x30/0x40 [ 750.160196] _raw_spin_unlock+0x38/0x50 [ 750.161020] do_msgrcv+0x284/0x4f0 [ 750.161774] __arm64_sys_msgrcv+0x38/0x44 [ 750.162620] el0_svc_common.constprop.0+0x9c/0x1c0 [ 750.163643] do_el0_svc+0x7c/0xa0 [ 750.164388] el0_svc+0x20/0x30 [ 750.165086] el0_sync_handler+0xa4/0x130 [ 750.165918] el0_sync+0x180/0x1c0 [ 763.280549] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 763.282336] (detected by 0, t=752 jiffies, g=196741, q=7494) [ 763.283694] rcu: All QSes seen, last rcu_preempt kthread activity 703 (4295083105-4295082402), jiffies_till_next_fqs=1, root ->qsmask 0x0 [ 763.286260] rcu: rcu_preempt kthread starved for 703 jiffies! g196741 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 - cpu=3[ 763.288363] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 763.290242] rcu: RCU grace-period kthread stack dump: [ 763.291334] task:rcu_preempt state:R running task stack: 0 pid: 13 ppid: 2 flags:0x00000028 [ 763.293387] Call trace: [ 763.293976] __switch_to+0x114/0x19c [ 763.294779] __schedule+0x274/0x610 [ 763.295582] schedule+0x7c/0x110 [ 763.296343] schedule_timeout+0xb0/0x1c0 [ 763.297216] rcu_gp_kthread+0x4e0/0xa90 [ 763.298073] kthread+0x168/0x16c [ 763.298831] ret_from_fork+0x10/0x3c [ 763.299741] BUG: scheduling while atomic: msgstress04/18860/0x00000002 [ 763.301093] Modules linked in: [ 763.301808] CPU: 0 PID: 18860 Comm: msgstress04 Not tainted 5.10.43-yocto-standard #1 [ 763.303502] Hardware name: linux,dummy-virt (DT) [ 763.304491] Call trace: [ 763.305056] dump_backtrace+0x0/0x1a0 [ 763.305852] show_stack+0x24/0x30 [ 763.306595] dump_stack+0xd4/0x110 [ 763.307357] __schedule_bug+0x68/0x80 [ 763.308160] __schedule+0x454/0x610 [ 763.308925] schedule+0x7c/0x110 [ 763.309655] do_msgsnd+0x294/0x4e0 [ 763.310406] ksys_msgsnd+0x90/0xb0 [ 763.311167] __arm64_sys_msgsnd+0x2c/0x40 [ 763.312024] el0_svc_common.constprop.0+0x9c/0x1c0 [ 763.313039] do_el0_svc+0x7c/0xa0 [ 763.313780] el0_svc+0x20/0x30 [ 763.314479] el0_sync_handler+0xa4/0x130 [ 763.315326] el0_sync+0x180/0x1c0 multilib core-image-sato ubuntu1804-ty-1 https://autobuilder.yoctoproject.org/typhoon/#/builders/44/builds/3622 /home/pokybuild/yocto-worker/multilib/build/build/tmp/work/qemux86_64-poky-linux/core-image-sato/1.0- r0/testimage/qemu_boot_log.20210625121139 libGL error: MESA-LOADER: failed to open bochs-drm: /usr/lib/dri/bochs-drm_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib/dri) libGL error: failed to load driver: bochs-drm libGL error: pci id for fd 4: 1234:1111, driver (null) libGL error: MESA-LOADER: failed to open bochs-drm: /usr/lib64/dri/bochs-drm_dri.so: cannot open shared object file: No such file or directory (search paths /usr/lib64/dri) libGL error: failed to load driver: bochs-drm [settings daemon] Forking. run with -n to prevent fork dbus-daemon[652]: Activating service name='org.a11y.atspi.Registry' requested by ':1.1' (uid=0 pid=669 comm="matchbox-panel --start-applets showdesktop,windows") dbus-daemon[652]: Successfully activated service 'org.a11y.atspi.Registry' SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry ** (matchbox-desktop:668): [1;33mWARNING[0m **: [34m12:12:24.699[0m: Error loading icon: Icon 'terminal' not present in theme Sato ** (matchbox-desktop:668): [1;33mWARNING[0m **: [34m12:12:24.713[0m: Error loading icon: Icon 'net-48d24' not present in theme Sato [ 21.640654] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 21.642263] rcu: 3-...0: (0 ticks this GP) idle=18a/1/0x4000000000000000 softirq=1618/1618 fqs=425 [ 21.644170] (detected by 1, t=3155 jiffies, g=2653, q=16) [ 30.951654] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
|
|
Re: Autobuilder "rcu stall" issue summary
Denys Dmytriyenko
On Thu, Jun 24, 2021 at 01:57:43PM -0400, Bruce Ashfield wrote:
On Thu, Jun 24, 2021 at 12:31 PM Richard PurdieCan this be keyed off of DEBUG_BUILD variable? They turn on the prime suspects in the kernel to get more information. If there-- Regards, Denys Dmytriyenko <denis@...> PGP: 0x420902729A92C964 - https://denix.org/0x420902729A92C964 Fingerprint: 25FC E4A5 8A72 2F69 1186 6D76 4209 0272 9A92 C964
|
|
Re: Autobuilder "rcu stall" issue summary
Bruce Ashfield <bruce.ashfield@...>
On Thu, Jun 24, 2021 at 12:31 PM Richard Purdie
<richard.purdie@...> wrote: Richard, As mentioned earlier, we do have the kernel debug fragments available (as well as a 'developer' kernel type). We can't globally enable the developer ktype, since we haven't defined BSP entry points for all of our reference boads (but I can do that as a TODO item). But we can get the same options enabled by adding: KERNEL_FEATURES_append = "ktypes/developer/developer.cfg features/debug/debug-kernel.scc" They turn on the prime suspects in the kernel to get more information. If there are options we are missing, let me know and I'll add them to those fragments so we can rely on them in the future. Bruce
-- - Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end - "Use the force Harry" - Gandalf, Star Trek II
|
|
Autobuilder "rcu stall" issue summary
Richard Purdie
We have info all over the place relating to our "rcu stall" issue on the
autobuilder. This is an attempt to document what we know and what we don't. We see builds fail periodically where there is an "rcu stall" mentioned by the kernel and the image "locks up". An easier to read example is: https://autobuilder.yoctoproject.org/typhoon/#/builders/56/builds/1488/steps/14/logs/stdio A core-image-minimal in testimage sitting using 394% CPU having run for 97 minutes of cpu time. A minimal image doesn't execute much at all and should be over in a minute max. An example we were able to "catch" and debug a bit at the expense of breaking the output is: https://autobuilder.yoctoproject.org/typhoon/#/builders/45/builds/3576/steps/12/logs/stdio where a musl qemux86-64 core-image-sato-sdk stopped mid "configure" on target during testimage. We were able to look at the kernel log, the tail end of which showed: [settings daemon] Forking. run with -n to prevent fork [ 102.082744] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 102.093213] rcu: 1-...!: (1 GPs behind) idle=3f6/0/0x1 softirq=1374/1375 fqs=0 [ 102.096448] rcu: 3-...!: (12 ticks this GP) idle=e34/0/0x0 softirq=1658/1658 fqs=0 [ 102.099326] (detected by 1, t=21002 jiffies, g=1937, q=278) The kernel was: [ 0.000000] Linux version 5.10.43-yocto-standard (oe-user@oe-host) (x86_64-poky-linux-musl-gcc (GCC) 11.1.1 20210523, GNU ld (GNU Binutils) 2.36.1.20210209) #1 SMP PREEMPT Wed Jun 16 1 6:00:47 UTC 2021 [ 0.000000] Command line: root=/dev/vda rw mem=512M ip=192.168.7.4::192.168.7.3:255.255.255.0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_exp edited=1 printk.time=1 Full log: https://www.rpsys.net/wp/rp/qemuboot-autobuild.log In the above state, qemu was seem with 303% CPU usage and 403% cpu usage. I do have strace output from the hung qemu: https://www.rpsys.net/wp/rp/qemu-strace.log It was interesting to note that pinging the image did work and you can see that in the strace log so something was still alive. You couldn't ssh in though and existing ssh connections failed with timeouts. We do also have the VM core dump for this which Alexandre is looking into (from qmp with: {"execute": "qmp_capabilities"} {"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}} ) I couldn't see any extra kernel log messages in the vmcore file. We've seen this before in core-image-sato, core-image-sato-sdk and also with qemuarm, the latter seeming to imply that KVM isn't the issue and that it isn't arch specific, implying it is qemu related. The load itself doesn't appear to matter, above it was a configure script but we've seen it on images without toolchains in a variety of states/loads. We've seen on a variety of host workers, ubuntu1804 being most common but it is a more common worker OS too. I suspect that this is the same issue that causes images to lock up in the firmware boot stage as in: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018 There may also be other related bugs, without looking at the kernel traces, it is hard to say for sure https://bugzilla.yoctoproject.org/show_bug.cgi?id=14273 lists a load more. We've not been able to replicate the issue at will. I have tried various stress-ng --all loads locally with no success at replicating. I did compare the boot output from qemu on my local system with the above kernel boot and it matches apart from timings and kvm-clock msr value. The suspicion is that some high IO load event somehow triggers qemu to fail. I also suspect that the "rcu stall" is just a symptom of a wider issue. The stall is suspicious as it looks incomplete. I'll stop here but will follow up on the mail if I remember more info. Anyone else feel free to add and if anyone has any insight into what is happening or how to better debug, I'm very open to it! Cheers, Richard
|
|
Re: ltp failures on autobuilder
On 2021-06-16 10:17 a.m., Richard Purdie wrote:
On Wed, 2021-06-16 at 08:56 -0400, Paul Gortmaker wrote:[Re: [swat] ltp failures on autobuilder] On 11/06/2021 (Fri 14:19) Richard Purdie wrote:Awesome work in tracking that down, much appreciated, thanks!On Fri, 2021-06-11 at 12:36 +0100, Richard Purdie via lists.yoctoproject.org wrote:For the mail archive trail, and for those not follwing the ongoingas a .cfg to the kernel and that still reproduced the crash. However:Isolated down to CONFIG_SCHED_DEBUG=y being the line which somehow "fixes" Dropped the wider list but as I said to Richard, I wanted to confirm that this bug is really gone. I ran 100 tests overnight and it seems to be dead Jim. $ bitbake core-image-sato && \ for i in `seq 100`; do \ echo "--- " $i " ---"; \ timeout --kill-after=2m 10m bitbake core-image-sato -c testimage \ && echo GOOD || echo BAD; \ done $ ls -l tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/qemu_boot_log.202* | wc -l 100 $ grep -m1 BUG: tmp/work/qemux86_64-poky-linux/core-image-sato/1.0-r0/testimage/qemu_boot_log.2021* | wc -l 0 All the log files are there and roughly the same size with similar but not idnetical contents due to differing timestamps and non-deterministic ordering of some of the output. ../Randy -- # Randy MacLeod # Wind River Linux
|
|
Re: SWAT Rotation
Ross Burton <ross@...>
Yes, that's fine.
Ross On Thu, 17 Jun 2021 at 16:25, Alexandre Belloni <alexandre.belloni@...> wrote:
|
|
SWAT Rotation
Alexandre Belloni
Hello Ross,
SWAT duty should rotate to you next week. Will you have some time to work on that? Thanks! -- Alexandre Belloni, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|