VMs hanging with rcu stall problems
Richard Purdie
Not cross posted but mentioned here for info. Seeing if the qemu devs have any ideas.
Cheers, Richard
|
|
qa-extras2 hang rcu/scheduling while atomic
Richard Purdie
https://autobuilder.yoctoproject.org/typhoon/#/builders/72/builds/3538
failure during execution of dnf --help Traceback from qemu logs fro thre 5.10 kernel (full log attached): qemux86-64 login: [ 133.333475] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 133.337109] (detected by 2, t=25864 jiffies, g=1529, q=10) [ 133.339025] rcu: All QSes seen, last rcu_preempt kthread activity 4865 (4294800423-4294795558), jiffies_till_next_fqs=3, root ->qsmask 0x0 [ 133.343445] rcu: rcu_preempt kthread starved for 4870 jiffies! g1529 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=2 [ 133.346976] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. [ 133.350262] rcu: RCU grace-period kthread stack dump: [ 133.352704] task:rcu_preempt state:R stack: 0 pid: 13 ppid: 2 flags:0x00004000 [ 133.355581] Call Trace: [ 133.356488] __schedule+0x1dc/0x570 [ 133.357693] ? __mod_timer+0x220/0x3c0 [ 133.359018] schedule+0x68/0xe0 [ 133.360000] schedule_timeout+0x8f/0x160 [ 133.361267] ? force_qs_rnp+0x8d/0x1c0 [ 133.362515] ? __next_timer_interrupt+0x100/0x100 [ 133.364264] rcu_gp_kthread+0x55f/0xba0 [ 133.365701] ? note_gp_changes+0x70/0x70 [ 133.367356] kthread+0x145/0x170 [ 133.368597] ? kthread_associate_blkcg+0xc0/0xc0 [ 133.370686] ret_from_fork+0x22/0x30 [ 133.371976] BUG: scheduling while atomic: swapper/2/0/0x00000002 [ 133.374066] Modules linked in: bnep [ 133.375324] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.10.41-yocto-standard #1 [ 133.377813] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 133.381882] Call Trace: [ 133.382744] dump_stack+0x5e/0x74 [ 133.384027] __schedule_bug.cold+0x4b/0x59 [ 133.385362] __schedule+0x3f6/0x570 [ 133.386655] schedule_idle+0x2c/0x40 [ 133.388033] do_idle+0x15a/0x250 [ 133.389257] ? complete+0x3f/0x50 [ 133.390406] cpu_startup_entry+0x20/0x30 [ 133.391827] start_secondary+0xf1/0x100 [ 133.393143] secondary_startup_64_no_verify+0xc2/0xcb [ 191.482302] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 255.155323] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: Shouldn't see "scheduling while atomic", the stall detector also isn't listing tasks. Cheers, Richard
|
|
Re: [PATCH 0/4] Re-implement prserv on top of asyncrpc
Paul Barker <pbarker@...>
On Mon, 31 May 2021 at 12:25, Richard Purdie
<richard.purdie@...> wrote: (╯°□°)╯︵ ┻━┻ If you're planning to take the day off don't worry about investigating these. I'll take a look at the patches again on Wednesday. I think the best approach may be to add some timeouts and maybe more error handling to the asyncrpc code I extracted from hashserv - if we can turn these hangs into a proper error then we can reduce the amount of autobuilder time they take to test and hopefully we'll get a better insight into what is actually going wrong. My guess is that there's something in the autobuilder config or just the level of load on the machines which is aggravating this as the tests finish successfully on my build machine (with a few expected test failures as noted previously). Thanks, -- Paul Barker Konsulko Group
|
|
Re: [PATCH 0/4] Re-implement prserv on top of asyncrpc
Richard Purdie
Hi Paul,
On Fri, 2021-05-28 at 09:42 +0100, Paul Barker wrote: These changes replace the old XML-based RPC system in prserv with theThanks for these. Unfortunately I think there is still a gremlin somewhere as this was included in an autobuilder test build that is showing as this: https://autobuilder.yoctoproject.org/typhoon/#/builders/83/builds/2203 i.e. all four selftests have not finished and I'd have expected them to by now. I'm trying not to work today so I haven't debugged them or confirmed where they are hanging but it seems likely related. Cheers, Richard
|
|
SWAT Rotation
Alexandre Belloni
Hello 민재,
As discussed last week, you are on SWAT duty this week, could you confirm you'll be able to work on the topic? I hope you moved without any issue. Regards, -- Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
qemuarm failure on autobuilder - analysis
Richard Purdie
Today's failure for analysis is a qemuarm failure where all the test images
failed at the same time: https://autobuilder.yoctoproject.org/typhoon/#/builders/53/builds/3493 We have's Randy+team's logging for this here: https://autobuilder.yocto.io/pub/non-release/20210526-2/testresults/qemuarm/2021-05-26--08-58/host_stats_1_top.txt What is interesting is the load average is peaked at the point the three qemu-system-arm are running at 300+ compared to the usual 50-80. What is it doing at the time? In parallel to qemuarm there appears to be: reproducibile-fedora (in stage B, non-sstate, i.e. from scratch) building llvm, webkitgkt, kernel-devsrc, qemu, kea musl-x86-64 building webkitgtk, piglit, cmake, kernel-devsrc, llvm, stress-ng which is a pretty heavy workload as those are all pretty heavy targets. I question whether cmake's rpmbuild really should be using 7g of RES memory (15g VIRT). The python bitbake-worker processes at near 100% cpu are interesting, I suspect those are python tasks for recipes. The code is meant to rename the process so we can better identify it but that is a tricky thing under linux and hints it may not be working. Bottom line for this one suggests it was load related. Cheers, Richard
|
|
Re: SWAT statistics for week 19
Alexandre Belloni
Hi,
On 25/05/2021 19:33:53-0400, Randy MacLeod wrote: On 2021-05-18 6:26 p.m., Alexandre Belloni wrote:Ah sorry, I forgot to send it. It is up to date with what was triagedHi,Hi Alexandre, last week: https://docs.google.com/spreadsheets/d/1bviDvW1SRwflofKLx9SwPUTWE3sBkvL3eb1PKnmZJcM/edit?usp=sharing Tony is getting going on valgrind and he'll start with:This is probably the best one to start with, the following one would be 14311 which has more occurrences but is about multiple (related) ptests. Regards, -- Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
Re: SWAT statistics for week 19
On 2021-05-18 6:26 p.m., Alexandre Belloni wrote:
Hi,Hi Alexandre, Any update on the list/spreadsheet? Tony is getting going on valgrind and he'll start with: https://bugzilla.yoctoproject.org/show_bug.cgi?id=14294 [Bug 14294] valgrind memcheck/tests/linux/timerfd-syscall ptest intermittent failure unless there's another ptest issue that is more urgent. ../Randy --Ross # Randy MacLeod # Wind River Linux
|
|
ubuntu2004-arm-1 load increase and possible instablity
Michael Halstead <mhalstead@...>
The ubuntu2004-arm-1 worker has been unstable in the past and we reduced the number of simultaneous builds from 3 to 1 to see if that would stop the crashes. It didn't at first but now the crashes have stopped. Perhaps due to kernel updates. I'm planning to increase the simultaneous builds back to 3 when the controller is next idle. This may cause the crashes to begin again and I want the SWAT team to be aware of the change. Michael Halstead Linux Foundation / Yocto Project Systems Operations Engineer
|
|
Re: Further rcu stall on autobuilder
Richard Purdie
On Mon, 2021-05-24 at 15:29 +0100, Richard Purdie via lists.yoctoproject.org wrote:
On Mon, 2021-05-24 at 09:21 -0400, Bruce Ashfield wrote:I switched to Bruce's 5.12 patches. Unfortunately even with 5.12:On Sun, May 23, 2021 at 12:56 PM Richard PurdieAbout the time you were writing this, I'd hacked up: https://autobuilder.yoctoproject.org/typhoon/#/builders/81/builds/2118/steps/12/logs/stdio :( Also, https://autobuilder.yoctoproject.org/typhoon/#/builders/110/builds/2362 and the corresponding: https://autobuilder.yocto.io/pub/non-release/20210523-10/testresults/qemuarm-alt/2021-05-24--01-52/host_stats_1_top.txt is interesting. That was a qemuarm-alt image (5.4 kernel) which could be a genuine load issue. It is getting 300% cpu though so hardly resource starved. Ideas welcome at this point. Cheers, Richard
|
|
Re: Further ltp hang - kernel issue?
Bruce Ashfield <bruce.ashfield@...>
On Mon, May 24, 2021 at 11:31 AM Richard Purdie
<richard.purdie@...> wrote: sent. I cherry picked it from my queue and sent it individually. I'll continue testing the rest of my updates. Bruce This is obviously a separate issue to the rcu stalls but I also think that -- - Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end - "Use the force Harry" - Gandalf, Star Trek II
|
|
Re: Further ltp hang - kernel issue?
Richard Purdie
On Sun, 2021-05-23 at 07:42 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 6:36 AM Richard PurdieThanks for the patch, I ran with it for a number of runs. I have not seen .38 break in the way master or master-next with .37 did. I've ran several and 50% of the time .37 would hang in ltp. Can we upgrade to .38 ASAP please? :) This is obviously a separate issue to the rcu stalls but I also think that is 5.10 related. Cheers, Richard
|
|
Re: Further rcu stall on autobuilder
Richard Purdie
On Mon, 2021-05-24 at 09:21 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:56 PM Richard PurdieAbout the time you were writing this, I'd hacked up: http://git.yoctoproject.org/cgit.cgi/poky/commit/?h=master-next&id=de3e2253482b6d9df1137128a9fde35dec8fd915 and put it into a build on the autobuilder. It caused meta-arm to blow up and I suspect there may be other fallout but we'll see... FWIW, I checked with Alexandre and it seems all the rcu failure issues are on qemuXXX builds but not qemuXXX-alt. The former is 5.10, the latter 5.4. I'm starting to strongly suspect there is some issue with 5.10 as we don't see this with dunfell or with poky-alt :/. I'd wonder why nobody else has noticed though... Cheers, Richard
|
|
Re: Further rcu stall on autobuilder
Bruce Ashfield <bruce.ashfield@...>
On Sun, May 23, 2021 at 12:56 PM Richard Purdie
<richard.purdie@...> wrote: I created the attached recipes. Built and booted on qemux86-64 with no issues. I assume you'll do the appropriate preferred version in the test branches to make sure they are used instead of 5.10 ? Bruce
-- - Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end - "Use the force Harry" - Gandalf, Star Trek II
|
|
Re: SWAT Rotation
Alexandre Belloni
Hello Jaga,
Are you available for SWAT this week? I'll be looking at some of the failures today. On 22/05/2021 11:44:56+0900, 김민재 wrote: Hi AlexandreSure, no problem -- Alexandre Belloni, co-owner and COO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
|
|
Re: Further rcu stall on autobuilder
Richard Purdie
On Sun, 2021-05-23 at 12:51 -0400, Bruce Ashfield wrote:
On Sun, May 23, 2021 at 12:47 PM Richard PurdieA set of SRCREVs sounds like the best plan, I think it might be worth testing to see if things improve or not. What is also odd is that in that in that same build, another qemu instance hung in syslinux loading bzImage. We've seen this before occasionally and it seems to keep happening periodically. That would seem more like a qemu bug yet we're on the latest qemu release :/. In neither case did Randy's stall detector trigger as far as I can tell. Cheers, Richard
|
|
Re: Further rcu stall on autobuilder
Bruce Ashfield <bruce.ashfield@...>
On Sun, May 23, 2021 at 12:47 PM Richard Purdie
<richard.purdie@...> wrote: If you want to switch to linux-yocto-dev, it is on 5.12.x, and I have a local 5.13-rcX version of -dev. We could whip together a SRCREV recipe for it, if you don't want to use the AUTOREV. I'm not going to do a full versioned linux-yocto for 5.12, but we can special case this if we want to go that route. Bruce
-- - Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end - "Use the force Harry" - Gandalf, Star Trek II
|
|
Further rcu stall on autobuilder
Richard Purdie
We've got yet another rcu stall failure on the autobuilder:
https://autobuilder.yoctoproject.org/typhoon/#/builders/80/builds/2123/steps/15/logs/stdio and looking at the dmesg in the qemu log: [ 20.424033] Freeing unused kernel image (rodata/data gap) memory: 652K [ 20.425229] Run /sbin/init as init process INIT: version 2.99 booting FBIOPUT_VSCREENINFO failed, double buffering disabledStarting udev [ 20.547298] udevd[161]: starting version 3.2.10 [ 20.553329] udevd[162]: starting eudev-3.2.10 [ 20.751260] EXT4-fs (vda): re-mounted. Opts: (null) [ 20.752548] ext4 filesystem being remounted at / supports timestamps until 2038 (0x7fffffff) INIT: Entering runlevel: 5 Configuring network interfaces... RTNETLINK answers: File exists Starting random number generator daemon. Starting OpenBSD Secure Shell server: sshd done. Starting rpcbind daemon...done. starting statd: done Starting atd: OK [ 21.921925] Installing knfsd (copyright (C) 1996 okir@...). starting 8 nfsd kernel threads: [ 23.066283] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 23.068096] NFSD: Using legacy client tracking operations. [ 23.069086] NFSD: starting 90-second grace period (net f0000098) done starting mountd: [ 45.272151] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 45.273423] rcu: 1-...0: (10 ticks this GP) idle=7ba/1/0x4000000000000000 softirq=598/612 fqs=5249 [ 45.274951] (detected by 2, t=21002 jiffies, g=-195, q=13) [ 138.202149] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 332.762209] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: This is with the kvm clock source disabled (in master-next) and with Bruce's 5.10.38 upgrade so that kind of rules out either of those two things for this issue. It also can't be the qemu platform or cpu emulation used since we've changed that. What is really odd is that it never actually prints the stalled tasks. That seems really strange. It is obviously alive enough to print a stall message later but stalls out and is terminated after 1500s. Really open to ideas at this point. Should we try a newer kernel version for testing in -next, see if we can isolate this to 5.10? Cheers, Richard
|
|
Re: Further ltp hang - kernel issue?
Bruce Ashfield <bruce.ashfield@...>
On Sun, May 23, 2021 at 6:36 AM Richard Purdie
<richard.purdie@...> wrote: I can't think of anything specific that would cause those issues, but the Wind River guys did report some bad iommu patches that were part of 5.10.37 I've merged .38, which has the fixes, but I haven't sent the bumps yet. It is worth trying the attached SRCREV patch, to see if there's any change in behaviour. Bruce
-- - Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end - "Use the force Harry" - Gandalf, Star Trek II
|
|
Re: Further ltp hang - kernel issue?
Richard Purdie
On Sun, 2021-05-23 at 11:33 +0100, Richard Purdie via lists.yoctoproject.org wrote:
https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1932Oddly enough, https://autobuilder.yoctoproject.org/typhoon/#/builders/95/builds/1933 on centos7-ty-4 (master build) is locked up with pretty much exactly the same issue/ps output/tests/dmesg. The first one above was debian10-ty-1 with master-next. Recent kernel version bump? Cheers, Richard
|
|