Yocto Technical Team Minutes, Engineering Sync, for June 22, 2021

Trevor Woerner

Yocto Technical Team Minutes, Engineering Sync, for June 22, 2021
archive: https://docs.google.com/document/d/1ly8nyhO14kDNnFcW2QskANXW3ZT7QwKC5wWVDg9dDH4/edit

== disclaimer ==
Best efforts are made to ensure the below is accurate and valid. However,
errors sometimes happen. If any errors or omissions are found, please feel
free to reply to this email with any corrections.

== attendees ==
Trevor Woerner, Stephen Jolley, Michael Halstead, Paul Barker, Richard
Purdie, Scott Murray, Steve Sakoman, Tony Tascioglu, Trevor Gamblin,
Alejandro Hernandez, Alexandre Belloni, Jan-Simon Möller, Randy MacLeod

== notes ==
- 3.4 m1 (honister) through QA, couple issue found, in review by TSC to decide
on release
- 3.1.9 (dunfell) currently being build then sent to QA
- thanks to PaulG for tracking down null pointer in cgroups mount code (and
others) that have led to solving some AB hangs
- the rcu dump AB vm hangs are still occurring
- 2 new manual sections: reproducible builds, and YP compatible
- multiconfig changes continue to cause issues
- still lots of AB-INT issues, we’re working on trying to define the load
pattern that causes these issues

== general ==
RP: special thanks to Paul Gortmaker for finding the null pointer issue to fix
the LTP builds

PaulB: still working on the pr-server changes. my changes work on my setup but
don’t seem to do well on the AB. if i enable parallelism in my builds
then the oom-killer gets busy. it’s annoying that the output from a test
is buffered until the test is finished, so if a test hangs then i can’t
find out which one is hanging. it looks like we’ll have to dig into
async-io or python parallelism to see if there’s something wrong there
or at least figure out how to get more output before a test finishes
RP: did you look at the trace output i found
PaulB: yes, but not exactly sure what i’m looking at
RP: if bitbake server starts one of the servers, then it’s responsible for
taking it out when it shuts down. so they run individually, but there is
some sharing. that traceback suggests to me that it’s trying to shut down
the pr server at shutdown time but not succeeding. sometimes the sqlite
database takes time to sync to disk for example (especially when there
is high io) but bitbake ends up blocked. it looks like some contention
between the parse threads and the locks. when doing python parallel there
are 2 systems and we have to make sure to not mix the two.
PaulB: in one case it looks like one of the comm sockets is already closed,
but the server is waiting for it to close again, but i would think we’d
get the same traceback every time
RP: maybe it’s stuck somewhere where it shouldn’t be, which makes it skip
some error/exit handling. or maybe add more debug code
PaulB: yes, i think adding timeouts to every wait/callback to see which ones
are timing out
RP: can python give you a dump of what’s in every waitstate?
PaulB: not sure, i’ll have to look at what python provides. bitbake isn’t
what’s telling prserv to exit on the cluster, but it is on my builds
RP: that seems unlikely
PaulB: i can’t generate the same conditions locally as what we’re seeing
on the AB. because of thread contention there seems to be, perhaps, a
resource exhaustion that i can’t reproduce
RP: put a “sleep 5” in the shutdown path in the prserv. it would cause
hashserve to exist longer than bitbake
PaulB: or backout the use of multiprocessing, keep the new json rpc system.
but do a more traditional fork() to start the process up
RP: i suspect we did the fork() for a reason
PaulB: multiprocessing didn’t exist back then
RP: it did, but it was broken. i sent a cooker daemon log which should show
what was running at the time. so we should be able to reverse engineer
what was going on at that time. i’m getting good at reading those, so i
can probably read them and tell you what tests were running at the time
PaulB: if we could just run that one test
RP: i would run just the prserv selftest and put that sleep in there. i’m
convinced it’s one of them
PaulB: i’m afraid i’m going to have to hand this off to someone else.
i’ll give one more push but then i’m out of ideas
RP: okay. it’d be good to get that in because i’m looking forward to it

TrevorW: move to SPDX license identifiers
RP: we have
TrevorW: classes and ptests yes, but not recipes and other places of oecore
RP: any place where we’re writing python we have. it’s a nice cleanup
thing, we’ve been embracing it. we’ve made some of the changes.
we’ve done it with code. the problem with recipes is that if we put that
at the top of the license it’ll get confused with the LICENCE field. so
we’re not quite sure what to do in that case
PaulB: i've been using a linting tool that checks for these identifiers.
again, the problem of what license to put on the recipe and for patches
etc (https://reuse.software/spec)
RP: licenses in patches in something to think about and would make upstreaming
easier. we’ve done the low hanging fruit but need more thought for other
cases. there’s also a standard form for copyright headers too
PaulB: yes, there’s been an update to the copyright format as well
RP: YP now has attained silver status for core infrastructure in LF. to get to
gold we need copyright on all files and many of our recipes don’t
ScottM: if there’s a statement at the top or the repo doesn’t that help
RP: i don’t think it accounts for that. and then we have to decide if these
are oe contributors or yocto contributors. so i left it at that. if anyone
wants to move this forward i’m definitely interested. there are some
questions we need to answer. our code is good at having copyright, but the
problem is with recipes
RP: there are a couple things we need to get gold status: 2-factor auth for
git pushes, some https issues (best practices changed), and licensing

TrevorG: working on the job server. create a dir in tmp to create the fifo,
but not seeing any difference in performance. i’m not sure kea is the
right way to go
RP: it was deleting the fifo
TrevorG: yes, at the end of the day it’s using the one that’s tied to the
user process
RP: that’s the correct way to do it. if it already exists then it should
get an EXISTS error and just go ahead and user it. as long as it’s not
deleting it
TrevorG: i’ve probably left something out, so i’ll look through it and see
RP: it was a POC i had written, but i can't remember how much testing it got.
these rcu issues seem to be related to a peak-load issue, so it’d be
nice to dig in and get to the core of the issue
TrevorG: tl;dr: no definitive results yet
Randy: make has the ability to look at the system load and use that to decide
whether or not to add more load, is there a reason we’re not considering
RP: good question. we’ve talked about it before, but i can't remember why we
decided not to go that route

RP: any updates on the rcu things AlexB? we had another one over the weekend
AlexB: right now i don’t have anything to say because i believe this
is something that is stuck in userspace. so i’m debugging, but
unfortunately the kernel we’re using doesn’t have debugging enabled,
so i want to build another kernel with debugging enabled so we can test
with debugging enabled. maybe we can carry a patch for the qemu kernels
on the AB which will enable more debugging. we’re also missing a few
important symbols that will help us see more. so i don't think these core
dump are important as-is.
RP: which debug symbols are you missing? our dumps have a lot of things
AlexB: debuginfo. there’s the risk that a new kernel will change behaviour
but we should try
RP: because we’re so reproducible, we should be able to debug easier. would
a complete kernel build help
AlexB: config?
RP: x86-64 musl. we should change the qmp code to run the commands that we
AlexB: yes. i’m carrying the 2nd qmp socket patch (instead of going through
the close)
RP: it’d be nice to always have this
AlexB: how do we detect the hang
RP: there are timeouts, so if we hit the timeouts then we assume there’s a
AlexB: we do that before killing…
RP: yes. if you have notes about how to set this up with gdb i’d like to try
replicating it
AlexB: okay. i was hoping this was a problem in the kernel and that the answer
would be right there. we’ve also seen the load of qemu going to 300 to
400 then 300 then 400.
RP: it’s like something is waiting for something else to happen
AlexB: you straced it
RP: yes, while pinging it. and we could see the straces going into the ping
AlexB: yes, which means the kernel is still alive, but the rcu is saying that
the kernel is not alive, so something is confused. something seems to be
waiting for something to happen that might never happen
Randy: so no progress on the rcu stalls?
RP: yes. we’ve made progress at poking qemu, and we have this core dump, but
no insight into what is causing it
Randy: has anyone tried to reproduce it? i’ve got a busy system and i’ve
run stress-ng
RP: i haven’t tried that.
Randy: where does it happen
RP: ubuntu 18.04
AlexB: but maybe that’s because that’s what most of our AB is running?
RP: that’s the info we have. if we can reproduce it then that would be
wonderful since that would make it easier to debug. making qmp better is
worth it in any case.

Randy: what about the kernel change PaulG made. i’d like to hammer on it
RP: i’ve done some hammering and haven’t seen any issues yet
Randy: i’ll do some hammering tonight so we can add some stats
RP: we’ve see one arm AB issue reading a message in /proc, but we’ve only
seen it once
Randy: we don’t have any arm builders. what are you using?
RP: ampere (something) and a huawei (something). not sure what the trigger is;
no idea how to reproduce. it’s a hanging read(), but you can still ssh
into the system. we may end up disabling the test

PaulB: back to copyright headers. i’ve pasted the output of a run i’ve
just done looking for licenses. there’s probably some low-hanging fruit
there and some false positives.