Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: Oct 7, 2021
YP AB Intermittent failures meeting
=================================== https://windriver.zoom.us/j/3696693975 Attendees: Richard, Trevor, Randy, Saul Summary: ======== Ptest results continue to improve yet again but there's still room for even more improvement. Alex made a graph of the number of AB INT issues per week: https://bootlin.com/~alexandre/SWAT_stats.png We assume that week 15, 16 was when the RCU bug in he kernel started being a problem and week 29 was when it go fixed but more careful analysis is required. The make/ninja load average limit is in but it's not clear if it's effective yet and it breaks dunfell. Trevor has a build of dunfell that with some patches appears to work. If anyone wants to help, we could use more eyes on the logs, particularly the summary logs and understanding iostat # when the dd test times out. Plans for the week: =================== Richard: QA results for M4, etc. Alex: ? Sakib: hook more responsive load average in to latency test. (v3) Trevor: patch to set PARALLEL_MAKE : -l 50 -> dunfell, gatesgarth, hardknott (Aug 5, Oct 7) Confirm that dunfell works now, test other branches. Saul: SBOM Randy: # processes graph of full builds, patch ninja, graph it. Kiran: SBOM Nothing much new below here. Keeping the list since it's still to-do. ../Randy Meeting Notes: ============== 1. job server - ninja could be patched with make's more responsive algorithm next or is this good enough? Aug 26: Randy made some graphs that show that the -l NUM results in the number of compile jobs oscillates *wildly* between 0 and 200 on a 192 core builder compiling chromium. What I did was: $ bitbake -c cleansstate chromium-x11 $ bitbake -c configure chromium-x11 $ bitbake -c compile chromium-x11 and while that compile was running: $ while [ ! -f /tmp/compiling-chromium-is-done ]; do \ cat /proc/loadavg >> procs-load.log ; sleep 0.5 ; done Results so far: https://postimg.cc/gallery/3hjfYfG/f8f46c97 Next step is either: a. collect data as above for an image build and see if the sub-optimal ninja behaviour makes a difference and/or b. patch ninja with make's more responsive load avg algorithm: https://git.savannah.gnu.org/cgit/make.git/commit/?id=d8728efc8 - Richard suggested that we extract make's code for measuring the load average to a separate binary and run it in the periodic io latency test. Also can we translate it to python? - Trevor is working on this and had some problems so next week. (Aug 19 - Trevor is back from vaction so maybe next week.) - Trevor to see if the load average change really did reduce load on WR build systems. (Aug 19) 2. AB status Trevor is learning about buildbot and working on a scheduling bug (CentOS worker?) bitbake layer setup tool should allow multiple backends: eg: kas, a y-a-helper. ptest cases are improving, we may be close to done! Let's wait a week to see how things go. (July29, Aug 5, Aug 19, we're not done...) - lttng-tools ptest is failing. RP is working on it with upstream. The timeout (done on Aug 5) increase hasn't helped. 3. Sakib's improvements to the logging are merged. Sakib generated a summary of all high latency 'top' logs from ~July 23->July 29 by just running his summary script on the merged raw top logs. More analysis required.... Still relevant parts of Previous Meeting Notes: ======================= 4. bitbake server timeout ( no change july 29, Aug 19, Oct 7) "Timeout while waiting for a reply from the bitbake server (60s)" 5. io stalls (no update: July 29, Oct 7) Richard said that it would make sense to write an ftrace utility / script to monitor io latency and we could install it with sudo Ch^W mentioned ftrace on IRC. Sakib and Randy will work on that but not for a week or two or longer! (Aug 19). Randy collected iostat data on 3 build server: https://postimg.cc/gallery/8cN6LYB We agreed that having -ty-2 be ~ 100 utilization for many hours in a row is not acceptable and that a threshold of ~ 10 minutes at 100% utilization may be a reasonable limt. I need to figure out if I can get data on the fraction of IO done per IO clas since we do use ionice to do clean-up and other activities. ../Randy |
|