Yocto Technical Team Minutes/Engineering Sync for Feb 23, 2021
Trevor Woerner
Yocto Technical Team Minutes, Engineering Sync, for Feb 23, 2021
archive: https://docs.google.com/document/d/1ly8nyhO14kDNnFcW2QskANXW3ZT7QwKC5wWVDg9dDH4/edit == disclaimer == Best efforts are made to ensure the below is accurate and valid. However, errors sometimes happen. If any errors or omissions are found, please feel free to reply to this email with any corrections. == attendees == Trevor Woerner, Stephen Jolley, Scott Murray, Armin Kuster, Michael Halstead, Steve Sakoman, Richard Purdie, Randy MacLeod, Saul Wold, Jon Mason, Joshua Watt, Paul Barker, Tim Orling, Mark Morton, John Kaldas, Alejandro H, Ross Burton == notes == - 3.2.2 passed QA clean, awaiting final approval from TSC - 3.1.6 built and in QA - 1 week before -m3 should be built (feature freeze for 3.3) - adding RPM to reproducibility, still needs some work - recipe maintainers: please review the patches we’re carrying (push upstream as many as possible) - glibc 2.33 issue should be resolved with latest pseudo - AUH patches are now merged or queued, few of the failures handled - AB issue list is at record high (not good) == general == RP: can’t get stable test results out of AB on master RP: would be nice to get RPM reproducibility issues ironed out, but there are some epoch issues to work through which messes up diffoscope RP: was surprised to see how bad the interactive response is on the cmdline on the builders. it seems like an I/O bottleneck Randy: mostly I/O to SSDs? RP: i believe so RP: it was immediately after a build had been started. so it could be related to downloads or sstate fetching. how much sstate did you expect that build to be reusing? SteveS: version bumps to conman, kernel… yea that could lead to a lot of rebuilds Michael: we’ve been optimizing for throughput for a while. on some other build systems we leave some overhead available for cmdline interactivity. should we start to do that with the YP AB? RP: i think it would have to be backed off by a significant amount to get that breathing space. so maybe yes, but we’d have to look at it to see what to backoff and by how much. looking through the build i see that 77% was pulled from sstate (therefore pulling data off the NAS, then extracting it). Randy: and that’s not coordinated at all, if there are 100 items, then 100 threads? RP: but limited by BB_THREADS JPEW: run by buildbot? maybe we could use cgroups? RP: i don’t think it’s CPU bound, the CPU was 50% idle when the cmdline was very slow MichaelH: sometimes we see that when the system isn’t healthy, i wonder if it’s isolated to specific machines? RP: on the CentOS machine, a command took over 5 minutes to complete. then tried debian, same thing. then logged into the fedora machine and was able to do stuff. but it didn’t seem isolated to any machines, it seemed localized in time (i.e. right after a build had been started), then dropped off. so i feel that it might be related to the initial build startup, probably related to sstate pulling/extraction Randy: could also limit I/O using cgroups RP: we do use IOnice for parts of the build (2.1 and 2.7) Michael: translation: class 2 priority 1; class 2 priority 7 Alejandro: are these sharing any hardware RP: they’re all connected to the NAS Michael: and they’re 100% dedicated to this work RP: i don’t think this is a network bottleneck, i think this is sstate extraction JPEW: maybe a different compression algorithm? gzip is notoriously slow RP: wouldn’t that make it worse? JPEW: does each AB have their own mirror RP: it’s all done over NFS JPEW: network bandwidth should be lower than local unzipping/extraction bandwidth? Alejandro: could we try different I/O scheduling? RP: don’t know RP: had a look at patches. 1,300 patches in oe-core, ~600 in pending the rest are submitted or not appropriate. some of these are 10 years or older, do we still need them? i sent 2 upstream and was told it wasn’t needed anymore (problem fixed in other ways). there’s also one in valgrind that looks similar (different fix upstream) and not needed. Ross: if some people could try to do 1 a day that would be a huge help RP: lots of patches related to reproducibility JPEW: the big issue with perf is that it uses bison (which needs patches) PaulB: read-only mode for PR server. i’ve been working on it, but it’s 1 big patch. there’s code to handle daemonizing and forking which in hash server is using the python mutli-processing. we also want to use the same RPC mechanisms that python is using. are those good lines along which to break the patches down? RP: that sounds perfect PaulB: it was easier to bash on it all together, then go back and break it up into digestible chunks RP: it’s 10 year old code, so i’m not surprised PaulB: i’ve broken out a part that uses JSON RPC, then use that for the server RP: sounds good to me JPEW: me too RP: scaling that code under python2 was a challenge. glad to see this moving forward RP: Randy posted rust patch set. felt it couldn’t be merged in this form (too many patches) Randy: do you want the history squashed? RP: that was my feeling Randy: i’ve been working on it bit by bit as stuff happens upstream which leads to lots of little commits. but i can reorg by logical group and squash the log RP: yes. in one case there were lots of commits to the rust version, then in the end you end up with 2 Randy: someone from MS worked on getting the sdk stuff working RP: given that next Monday is the feature freeze, let’s get the patched out sooner, and worry about the sdk later Randy: ok. last remaining issue is the pre-fetcher but i don’t know much about it. looked at PaulB’s patches PaulB: there are 3 methods floating around, i’ve focused on one of them that i like 1. doing the download ahead of time in do_fetch() 2. let rust-bin do the downloads itself in do_compile() which i don’t like 3. haven’t looked at the last one yet PaulB: i like 1 because it asks rust to output a cargo which the fetcher can then act on Randy: doesn’t rely on crates? PaulB: i think it relies on crates for things that it can’t resolve. however Andreas’ approach relies on getting bitbake to understand cargo-toml file, not sure if that’s a good approach Randy: are there any lessons with Go that we can use? Scott: Bruce would be a good one to talk to PaulB: my understanding with Go is that the code tends to all be placed together in the git repository, so the fetch side is a little simpler Randy: so given the approach we’re using is there anything that needs to be added PaulB: it needs testing Randy: i have a team working on testing the rust compiler itself. they can successfully execute 2/3 of the tests now (of which 99.9% pass). i have a reproducibility test for rust hello world, but it takes a long time to run. any tests you’re thinking of? PaulB: fetcher tests. if you have a “crate://” in a URL, just making sure it gets translated correctly to make sure it doesn’t bitrot Randy: is that an -m3 or -m4 activity RP: if we’ll get it in -m4 for sure we can wait until then. in oe-core we’d want some sort of hello world (make sure compiler works and we can run the binaries) Randy: we have that already RP: both for cross-compiling and target. then reproducibility tests. i’m happy to build them up, as long as there’s a roadmap. for -m3 i think we should get the baseline rust set and the crate fetcher PaulB: crate fetcher overrides the wget fetcher and makes sure everything gets put in the right place. so it just needs a couple test cases; a map of inputs to outputs. i’ll resubmit the patch and include a list of tests that we need to add RP: if someone could reply to Andreas and let him know what’s going on and why we’re going in a slightly different direction than the work he’s submitted PaulB: the fundamental unit is the recipe. devtool is the place for some of the functionality not bitbake Randy: building rust hello world works for all glibc qemu targets but some breakage with musl (risc-v and powerpc) i think Khem is working on the risc-v one. will that hold things up? RP: no PaulB: i have some slightly larger rust packages (larger than hello world) that i think will test things a little more thoroughly, e.g. ripgrep Randy: we’re testing that one already. should we add it to oe-core? RP: it would be good to have something in oe-core to do testing Randy: i think hello world would be good enough for oe-core and leave larger tests for other layers PaulB: there are rust things in oe-core (librsvg, etc) so i think oe-core will have good test things already in them without having to add recipes just for testing’s sake Randy: things also seem to work well on ARM builders Ross: yes, things are on-target TrevorW: started work on 2021 YP conference. conversation moved to conferences@... if you want to follow along or help RP: fetch, workdir, can’t clean up workdir, config changes but can’t cleanup. maybe we should fetch to a special dir and then just symlink PaulB: would it be recipe-specific RP: yes, it would be under $WORKDIR PaulB: would make the archiver easier TimO: there are a number of Go modules that don’t cleanup properly so maybe this would help ScottM: there are lots recipes that do post-processing on files in $WORKDIR before moving them to the artifacts directory, so there could be breakage there PaulB: can we do it first thing next release RP: we’ll give it a try and see TrevorW: i think a lot of BSP layers will be affected RP: i think there’s a lot of chance a lot of things (not just BSPs) will be affected ScottM: there are some BSP things that will be affected, but in AGL we’re doing a lot of $WORKDIR manipulations that aren’t necessarily BSP related as well (several): overall it sounds like a good idea and a good cleanup to try
|
|