Yocto Technical Team Minutes, Engineering Sync, for October 6, 2020
Yocto Technical Team Minutes, Engineering Sync, for October 6, 2020
== disclaimer ==
Best efforts are made to ensure the below is accurate and valid. However,
errors sometimes happen. If any errors or omissions are found, please feel
free to reply to this email with any corrections.
== attendees ==
Trevor Woerner, Stephen Jolly, Armin Kuster, Josef Holzmayr, Richard
Purdie, Joshua Watt, Trevor Gamblin, Steve Sakoman, Paul Barker, Saul
Wold, Stacy Gaikovaia, Rob Woolley, Randy MacLeod, Michael Halstead, Jon
Mason, Ross Burton, Jan-Simon Möller, Mark Hatle, Scott Murray, Vikram
Subramanian, Tim Orling, Denys Dmytriyenko, Bruce Ashfield, Christopher
Larson, Martin Jansa (JaMa)
== notes ==
- m3-rc2 released
- well into m4
- 3.1.3 out of QA (in review by TSC)
- ready for m4, won’t be built until pseudo cleared up
- large number of intermittent AB issues
== general ==
RP: 3.1.3 has a ptest regression in perl, but it was due to the test, not
perl. some tests need fixing for 3.1.4
SS: it was a problem with the count of the number of tests
RP: we’re a little behind schedule for m4, but we need to get pseudo under
RP: if you create files in a pseudo context (LD_PRELOAD) e.g. do_package_qa,
do_install, etc if you create a file in that context but then delete it
outside that context but then go back into pseudo and try to manipulate
the file pseudo gets confused (it assumes if it has the same inode then
it’s the same file)
RP: one solution is to add path filtering so that pseudo will ignore certain
files in certain paths (e.g. sysroot native). the more paths we filter
out from the pseudo database the less likely we’ll trip over inode
issues. pseudo needs to be unloaded when qemu is run (since they don’t
get along). we generally don’t want the files that end up in the deploy
folder to be under pseudo, but the tasks that do that (put files in the
deploy folder) need to run under pseudo. there’s also the issue of hard
linking: sometimes there’s a file under pseudo’s control but then a
hard link is made to it so now pseudo needs to know that this other path
is the same file as this other thing it knows about.
RP: right now we’ve added code so that pseudo aborts should it find an
inconsistency. so before releasing:
- do we add path filtering: i think yes
- do we have pseudo abort if it finds inconsistencies: ??
MarkH: could we do an insane_skip?
RP: just to be clear, the abort doesn’t happen consistently in all cases,
they hit randomly. i added code to do a sanity check on the db and it
found stuff, but it’s not practical to run sanity tests during a live
build. a human can look at the issues and know “this is sane, this one
isn’t” but coding it isn’t easy
MarkH: we should have clear documentation, because layers are going to hit it
RP: i was hoping these fixes would improve the builds (time, space) but it
only seems to affect space
Josef: is this a new issue?
RP: problem has been there for years. i can point to bugs 3-4 years ago that
had weird permission changes that couldn’t be explained. now i can see
this behaviour was the cause. it’s probably been in the pseudo code from
the start. why are we seeing it now? because the AB is so extremely busy
and i was lucky to catch it once that made it easy to track down. also
recipe-specific sysroots exacerbates the issue, making it potentially
occur more often now.
Josef: then we should focus on fixing
RP: agreed. and we need to think of LTS too
Josef: any idea on how bad the issue is? is it 1 in 100? 1/1000? 1/10000???
RP: don’t know. the new code, working with sqlite package, brings the number
of db entries down from 10,000 to 500. so that means that there were all
those extra entries that could be causing issues. all that extra was stuff
that was installed but then later deleted (but the db not cleaned up).
much less likely to see issue if you’re using a clean build every time
rather than reusing the same build area over and over
PaulB: are there legitimate cases where we need to delete something from a
non-pseudo task that was put in place under pseudo?
RP: yes sysroot-native
PaulB: is the abort patch available?
RP: all in master-next
MarkH: fyi i’ve been running that code and haven’t seen an issue
MarkH: meta-browser, meta-xilinx, poky, meta-oe
JS-M: i back ported it to dunfell and ran it against AGL, zero hits so far
Randy: i’m surprised we’re not seeing it “in the field”
RP: i’m not, it’s a core issue, which layers and how many layers
shouldn’t affect it
RP: for eg. there was an sstate test that would fail, but it only ever failed
on one specific worker. turned out to be a cache invalidation issue
(sstate cache). pseudo’s ignore paths, which included the cache, should
have ignore it but weren’t
RP: anyway, please test
Randy: a week or two until release?
RP: sooner than later
RP: m4-rc1. probably not before next week
PaulB: i’d rather see aborts than silent failures
RP: it’ll annoy people that the aborts are not deterministic
JS-M: can we print an error that’ll help the user
RP: it’ll be hard to convey the problem when an abort happens
Randy: i can offer a large system for helping
JS-M: i can offer a large system too
Randy: any other tests we can do to try to repeat?
Randy: is inode reuse policy done by FS or pseudo?
RP: FS. there might be different policies and figuring them out
TW: if this issue has been there potentially from the beginning, is it
possible there are bad images in the field? images that were built years
ago, in production, that might have a bad permission in some file that's
almost never used, but could fail if that file is accessed?
TW: does the file path size of the build area affect this? conversely, could
shorter paths help avoid the issue?
RP: longer paths will cause slower builds, but not breaking. path comparison
is fast, but not an issue. might need to switch to an allow list rather
than a deny list, but MarkH has warned against that approach
RP: i’ve had more ideas about improving build speed
PaulB: do we know when we’re going to talk about features for the next
RP: not planned yet, we don’t have a lot of people working on new features
(like we used to) so if nobody has the time to add new things, there’s
little point to talk about what we’d like
Randy: should we talk about it in this call? should it only be for the 1st
call of the month?
Saul: i looked at the qemu monitor issue. i’m trying to use the qemu monitor
which gives us visibility into at the state of memory/network/etc qemu. RP
suggested we might be able to use the monitor, which often uses the same
connection as the serial/console switching between them by a Ctrl key. i
was able to get the qemu-montior away from the serial console, interract
with it via netcat/telnet in order to access the selftest.
RP: what are you seeing?
Saul: i’m using qemu-runner to try to connect to the monitor, but it hangs.
so if someone has better knowledge of python select()/poll() it might
help. i’m select()/poll() on the socket, but it never completes.
JPEW: i can help
RP: share the code :-)
Timo: toaster-container still failing, trying to instrument the code to find
out why/where? just getting a bb-unhandled, which isn’t helpful
RP: do you have the cooker log?
Timo: yes, but it looks clean. it’s the toaster-ui log that shows the failure
RP: these are caused by changes i made in bitbake, technically this should be
a release blocker
Saul: is TOPDIR whitelisted, or magically removed elsewhere?
RP: whitelisted from what? sstate-hash
Saul: but it’s not in the whitelist itself? it’s magically done
RP: it shouldn’t be magic
RP: we exclude TMPDIR, so maybe it falls under that
JPEW: parsing improvements?
RP: PSEUDO_IGNORE_PATHS which has other stuff in it e.g. BUILDHISTORY_DIR.
i had been trying to keep BUILDHISTORY_DIR out. the vardeps excludes
aren’t working properly, i noticed all variables were being recursed
indefinately. i think that at parse time it can get away with only going
1-deep, at build time we need to parse indefinitely, but it would speed
up the parsing step. it is a large job at parse time because at parse
time it’s looking at everything, at build time we only look at stuff
relevant to that task. There’s a chance we’re already relying on this
behaviour, which would make it harder to change.
TW: MACHINE_EXTRA_RRECOMMENDS is not included in core-image-full-cmdline, is
RP: i stumbled across this earlier, there was an explanation
PaulB: probably not included packagegroup-core-base
RP: yes, there was a reason, can’t quite remember