Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: July 29, 2021
YP AB Intermittent failures meeting
=================================== July 29, 2021, 9 AM ET https://windriver.zoom.us/j/3696693975 Attendees: Tony, Richard, Trevor, Randy, Sakib! Summary: ======== ptest failures again are better but there's still room for improvement. The make/ninja load average limit is in but it's not clear if it's effective yet. We tried mechanically summarizing all the high latency top logs for the last week. No firm conclusions but initial thoughts below. If anyone wants to help, we could use more eyes on the logs, particularly the summary logs and understanding iostat # when the dd test times out. Plans for the week: =================== All: Wait and see if the ptest failure rate continues to be lower than previous weeks. Richard: Alex: Sakib: hook more responsive load average in to latency test. (v2) Trevor: patch to set PARALLEL_MAKE : -l 50 -> dunfell, gatesgarth, hardknott Investigate dunfell which failed with this change. Tony: Saul: on vacation Randy: Look at performance data Meeting Notes: ============== 1. job server - ninja could be patched with make's more responsive algorithm next or is this good enough? - Richard suggested that we extract make's code for measuring the load average to a separate binary and run it in the periodic io latency test. Also can we translate it to python? - Trevor is working on this and had some problems so next week. 2. AB status ptest cases are improving, we may be close to done! Let's wait a week to see how things go. (July29, we're not done...) - development week with lots of failures and a-quick builds so it's hard to say. 3. Nothing new on this item this week (July 29): Richard reported - something really flaky going on with serial ports. - particularly bad on qemuppc but also x86. - related to Saul's QMP data dump? - Juy 22/29: We didn't talk about this issue this week. 4. Sakib's improvements to the logging are merged. We think Michael needs to update the script that generates the web page. Randy/Sakib to talk with Michael. -- Done. Sakib generated a summary of all high latency 'top' logs from ~July 23->July 29 by just running his summary script on the merged raw top logs. see all_summary attached. You can see what compilation jobs are most frequently associated with high latency events by: $ grep GCC ~/Downloads/all_summary.txt | less They are: linux-yocto, llvm, qemu, gtk+, # quick top 10 list: $ grep GCC ~/Downloads/all_summary.txt | grep cc | head -10 | \ cut -d"/" -f1,4,5,8 104 ~/genericx86_64-poky-linux/linux-yocto/x86_64-poky-linux 94 ~/core2-32-poky-linux-musl/llvm/i686-poky-linux-musl 89 ~/i686-nativesdk-pokysdk-linux/nativesdk-qemu/i686-pokysdk-linux 74 ~/core2-32-poky-linux-musl/gtk+3/i686-poky-linux-musl 64 ~/build-st/reproducibleB/core2-64-poky-linux 59 ~/cortexa8hf-neon-poky-linux-gnueabi/qemu/arm-poky-linux-gnueabi 53 ~/qemux86-poky-linux/perf/i686-poky-linux 40 ~/core2-64-poky-linux/glibc/x86_64-poky-linux 39 ~/build-st/reproducibleA/core2-64-poky-linux 38 ~/ppc7400-poky-linux/ofono/powerpc-poky-linux If you look at the non-GCC activities that are not part of the base OS activities you see processes such as: make, mv, perl, tar, pseudo, rm, ninja $ grep -v GCC ~/Downloads/all_summary.txt | grep -A 33 "Userspace Process Summary:" Userspace Process Summary: 12326 bitbake-server 12145 python3 8112 /bin/sh 7213 /bin/bash 5207 make 1329 /usr/bin/python3 826 mv 758 (sd-pam) 715 perl 694 x86_64-poky-linux-gcc 620 top 587 sshd: 580 bash 566 /lib/systemd/systemd 561 -bash 476 tar 398 sh 397 arm-poky-linux-gnueabi-gcc 386 /usr/bin/dbus-daemon 382 gcc 379 /usr/sbin/irqbalance 379 /sbin/agetty 373 ~/pkgman-rpm-non-rpm/build/build/tmp/sysroots-components/x86_64/pseudo-native/usr/bin/pseudo 360 qmgr 360 dpkg-deb 358 pickup 356 rm 351 as 335 /usr/sbin/cron 333 /usr/sbin/rsyslogd 326 /usr/sbin/atd 314 /usr/sbin/sshd 296 ninja mv is likely blocked on IO (Sakib please confirm from logs) Since make is around more than ninja, we may be able to better control the load using the 'load average' limit and not have to patch ninja (with make's enhancement) to be more responsive. Attachments: a. script to gather the file: sum_sum.py b. ./summarize_to_outup.py all <directory w/ all the interval files> More analysis required.... 5. (From July 8) Richard says that we may need to redesign the data collection system that Sakib's AB INT tests are based on. Was worried the current approach does NOT cover oe-selftests but we do see it when we see the AB-INT trigger from builds. Not sure if we need the change anything yet. Everything goes through run-command in yocto-ab-helper. Still relevant parts of Previous Meeting Notes: ======================= 4. bitbake server timeout ( no change july 29) "Timeout while waiting for a reply from the bitbake server (60s)" Randy mentioned that the bitbake server timeouts seen in the Wind River build cluster have gone away after upgrading to a newer version of docker. Old: Docker Version: Docker version 18.09.4, build d14af54266 New: Docker Version: Docker version 20.10.7, build f0df350 Clearly the YP ABs aren't running in docker but what about firmware and kernel tunings. Michael, Is the BIOS/firmware kept up to date on most nodes? - July 22: This was done. For the performance builder trend see: https://autobuilder.yocto.io/pub/non-release/20210721-9/testresults/buildperf-centos7/perf-centos7.yoctoproject.org_master_20210721150057_1ad79313a5.html https://autobuilder.yocto.io/pub/non-release/20210721-14/testresults/buildperf-ubuntu1604/perf-ubuntu1604_master_20210721210034_1ad79313a5.html Summary, - CentOS-7 seems to take less time (~ 1 min), - Ubuntu-16.04 seems to take more (~ 5 min) That's a bit surprising! Randy to look at 62659 commit number in poky. 5. io stalls (no update: July 29) Richard said that it would make sense to write an ftrace utility / script to monitor io latency and we could install it with sudo Ch^W mentioned ftrace on IRC. Sakib and Randy will work on that but not for a week or two. ../Randy |
|