Date
1 - 1 of 1
Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: June 24, 2021
YP AB Intermittent failures meeting - June 24, 2021, 9 AM ET
https://windriver.zoom.us/j/3696693975 Attendees: Alex, Richard, Saul, Randy, Tony, Trevor, Sakib Summary: Things are improving somewhat on the autobuilder, RCU stalls are still the top problem now. Meeting Notes: ============== 1. The most common problem is still the qemu RCU hang. Alex found the qemu machine protocol debugging commands: {"execute": "qmp_capabilities"} {"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}} This generates a kernel core. Saul is going to send a patch to do that when qemu hangs. RP investigated a few things (ACPI table alignement, etc) and while there are differences between crashes, there's nothing that is apparently significant so we still don't have a good understanding of the cause of the stalls. Richard summarized the situation here: https://lists.yoctoproject.org/g/swat/message/177 2. We had an interesting AB failure: Core image sato failed to start since image copy took longer than 300 seconds: https://autobuilder.yoctoproject.org/typhoon/#/builders/74/builds/3559/steps/13/logs/stdio System load was ~ 140. Nothing in qemu log file so qemu never started. There was a qemu mips in the logs so qemu was running but not fast enough. There was no AB INT log produced so the trigger was not effective. ( We may have to change out trigger to use iostat or something else. ) Can't tell from logs if the image 'start cmd' in QMP should be put in logs. - Saul to add a log after QMP port has connected. (maybe also add a log before?). RP did some experiments with starting qemu in a controlled stressful env. He wasn't able to cause the RCU hang. - stress-ng could cause the qemu to pause itself but not crash. The pause may have been due to running out of disk space. qemu was running in snapshot mode initially after that also from a tmpfs like we use for our testing. The test generated load of ~ 3000+ and the qemu's were trying to boot. Any testimage failure should run Sakib's report generator. - Sakib to send patch. 3. System load: make, ninja jobs make: Trevor and Randy were looking at some tools: https://github.com/gscano/libjobserver.git https://github.com/olsner/jobclient.git that use the jobserver feature of make but they seemed awkward and we weren't actually able to see them be useful. They passed self-tests in one case but in the limited time we used them, it seemed like a dead end. The primary purpose of the job server feature is to enable a single recursive make to limit the number of jobs dispatched. It is likely possible to have independent builds using 'make' co-operate but we have not yet figured out how to do that, This document by Paul Smith: http://make.mad-scientist.net/papers/jobserver-implementation/ in point "7.", explains what needs to be set to use the jobserver but exactly how to achieve that use wasn't yet clear to us. Luckily Trevor noticed in the make source that in 2017, there was a patch added to make the load average calculation more timely: https://github.com/mirror/make/blob/master/src/job.c#L1947 https://github.com/mirror/make/commit/d8728efc80b720c630e6b12dbd34a3d44e060690 We're confirming that this actually works on a buld server and for all versions of make in the cluster. We can just re-use this variable: PARALLEL_MAKE = "-j 4" and add -l <NUM> in addition to -j <NUM> - Randy may write to Paul Smith about the general problem we are having since they worked together years ago. 4. ptest issues are improving. Valgrind ptest results are getting better. Thanks Tony! 5. discussed Sakib's summary script. It's coming along. TO DO: - collapse compile pipeline (cc1, as, ...) to one line. The update was just merged to master-next, let us know if you'd like to see other info in the summary or the logs. 6. Timeouts: - qemu-runner? timeout increase 120 -> 240 Increased qemu-runner timeout. - ptest timeouts 300 -> 450? Not happening. 7. The iostat output is in some of the AB logs: https://autobuilder.yocto.io/pub/non-release/ for example: https://autobuilder.yocto.io/pub/non-release/20210623-17/testresults/meta-oe/2021-06-23--21-51/host_stats_0_top.txt search for "start: iostat" it looks like the io sub-systems are 100% utilized but we need more time, data and the summary script to be able to easily make a general statement about AB INT issues and IO load and to be able to identify what is typically generating the IO load. Plans for the week: Richard: RCU stall Alex: qemu debugging of core Sakib: - testimage failure dump summary. Trevor: make job server Tony: nothing this week for YP. Saul: QMP logs Randy: make job w/ Trevor, herd cats!! Here kitty,.... ../Randy |
|