Yocto Autobuilder: Latency Monitor and AB-INT - Meeting notes: Aug 19, 2021


Randy MacLeod
 

YP AB Intermittent failures meeting
===================================
Aug 19, 2021, 9 AM ET
https://windriver.zoom.us/j/3696693975

Attendees: Tony, Richard, Trevor, Randy, Alex, Saul


Summary:
========

ptest results continue to improve but there's still room
for even more improvement.

The make/ninja load average limit is in but it's not clear
if it's effective yet and it breaks dunfell. Trevor investigating.

There's not much new this week, I've commented on a few existing
activities below and added "Aug 19" in most cases.


If anyone wants to help, we could use more eyes on the logs,
particularly the summary logs and understanding iostat #
when the dd test times out.



Plans for the week:
===================

Richard: lttng-tools and more!
Alex: SWAT plans. September email, training.
Sakib: hook more responsive load average in to latency test. (v3)
Trevor: patch to set PARALLEL_MAKE : -l 50
-> dunfell, gatesgarth, hardknott (Aug 5 - it's a priority)
Investigate dunfell which failed with this change.
- data on WR AB load average.
Tony: go back to school. Thanks for all your work Tony!
Saul:
Randy: Gather more iostat data, graph it!

Meeting Notes:
==============

1. job server

- ninja could be patched with make's more responsive algorithm
next or is this good enough?

- Richard suggested that we extract make's code for measuring the load
average to a separate binary and run it in the periodic io latency
test. Also can we translate it to python?
- Trevor is working on this and had some problems so next week.
(Aug 19 - Trevor is back from vaction so maybe next week.)

- Trevor to see if the load average change really did reduce load
on WR build systems. (Aug 19)

2. AB status

Trevor is learning about buildbot and working on a scheduling bug
(CentOS worker?)

bitbake layer setup tool should allow multiple backends:
eg: kas, a y-a-helper.

ptest cases are improving, we may be close to done!
Let's wait a week to see how things go.
(July29, Aug 5, Aug 19, we're not done...)

- lttng-tools ptest is failing. RP is working on it with upstream.
The timeout (done on Aug 5) increase hasn't helped.


3. Sakib's improvements to the logging are merged.

Sakib generated a summary of all high latency 'top' logs from
~July 23->July 29 by just running his summary script on the
merged raw top logs.

More analysis required....


Still relevant parts of
Previous Meeting Notes:
=======================


4. bitbake server timeout ( no change july 29, Aug 19)

"Timeout while waiting for a reply from the bitbake server (60s)"

5. io stalls (no update: July 29)

Richard said that it would make sense to write an ftrace utility
/ script to monitor io latency and we could install it with sudo
Ch^W mentioned ftrace on IRC.
Sakib and Randy will work on that but not for a week or two
or longer! (Aug 19).

Randy collected iostat data on 3 build server:
https://postimg.cc/gallery/8cN6LYB
We agreed that having -ty-2 be ~ 100 utilization for many hours
in a row is not acceptable and that a threshold of ~ 10 minutes
at 100% utilization may be a reasonable limt. I need to figure out
if I can get data on the fraction of IO done per IO clas since
we do use ionice to do clean-up and other activities.


../Randy

Join yocto@lists.yoctoproject.org to automatically receive all group messages.