Hardknott (GCC10) Compiler Issues


Chuck Wolber
 

All,

Please accept my apologies in advance for the detailed submission. I think it is warranted in this
case.

There is something... "odd" about the GCC 10 compiler that is delivered with Hardknott. I am still
chasing it down, so I am not yet ready to declare a root cause or submit a bug, but I am posting
what I have now in case anyone has some insights to offer.

For all I know it is something unusual that I am doing, but we have a lot of history with our
build/dev/release methods, so I would be surprised if that was actually the case. I have also
discussed aspects of this on IRC for the last few days, so some of this may be familiar to some
of you.

Background: We maintain a virtual machine SDK for our developers that is as close as possible to
the actual embedded hardware environment that we target. The SDK image is our baseline Linux
OS plus lots of the expected dev and debugging tools. The image deployed to our target devices is
the baseline Linux OS plus the core application suite. It is also important to note that we only
support the x86_64 machine architecture in our target devices and development workstations.

We also spin up and spin down the SDK VM for our nightly builds. This guarantees strict consistency
and eliminates lots of variables when we are trying to troubleshoot something hairy.

We just upgraded from Thud to Hardknott. This means we built our new Hardknott based SDK VM
image from our Thud based SDK VM (GCC 8 / glibc 2.28). When we attempted to build our target
device image in the new Hardknott based SDK VM, we consistently got a segfault when any build
task involves bison issuing a warning of some sort. I traced this down for a very long time and it
seemed to have something to do with the libtextstyle library from gettext and the way bison used it.
But I now believe that this to be a red herring. Bison seems to be very fragile, but in this case,
that may have actually been a good thing.

After some experimentation I found that the issue went away when I dropped down to the 3.6.4
recipe of bison found at OE-Core:bc95820cd. But this did not sit right with me. There is no way I
should be the only person seeing this issue.

Then I tried an experiment... I assumed I was encountering a compiler bootstrap issue with such a
big jump (GCC8 -> GCC10), so I rebuilt our hardknott based SDK VM with the 3.3.1 version of
buildtools-extended. The build worked flawlessly, but when I booted into the new SDK VM and
kicked off the build I got the same result (bison segfault when any build warnings are encountered).

This is when I started to mentally put a few more details together with other post-upgrade issues that
had been discovered in our lab. We attributed them to garden variety API and behavioral changes
expected during a Yocto upgrade, but now I am not so sure.

During the thud-to-hardknott upgrade process, we did nightly builds of the new hardknott based
target image from our thud based SDK VM. I assumed that since GCC10 was being built as part of
the build sysroot bootstrap process, we were getting a clean and consistent result irrespective of the
underlying build server OS.

One of the issues we were seeing in the lab was a periodic hang during the initramfs phase of the
boot process. We run a couple of setup scripts to manage the sysroot before the switch_root, so it
is not unusual to see some "growing pains" after an upgrade. The hangs were random with no
obvious cause, but systemd is very weird anyway so we attributed it to a new dependency or race
condition that we had to address after going from systemd 239 to 247.

It is also worth noting that systemd itself was not hung, it responded to the 'ole "three finger salute"
and dutifully filled the screen with shutdown messages. It was just that the boot process randomly
stopped cold in initramfs before the switch root. We would also occasionally see systemd
complaining in the logs, "Starting requested but asserts failed".

Historically, when asserts fail, it is a sign of a much larger problem, so I did another experiment...

Since we could build our SDK VM successfully with buildtools-extended, why not build the target
images? So I did. After a day of testing in the lab, none of the testers have seen the boot hang up in
the initramfs stage, whereas before it was happening about 50% of the time. I need a good week of
successful test activity before I am willing to declare success, but the results were convincing
enough to make it worth this summary post.

I did an extensive amount of trial and error testing, including meticulously comparing
buildtools-extended with our own versions of the same files. The only intersection point was gcc.

The gcc delivered with buildtools-extended works great. When I build hardknott's gcc10 from the
gcc in buildtools-extended, we are not able to build our target images with the resulting compiler.
When I build our target images from the old thud environment, we get a mysterious hang and
systemd asserts triggering during boot. Since GCC10 is an intermediate piece of the build, it is
also implicated despite the native environment running GCC8.

I will continue to troubleshoot this but I was hoping for some insight (or gentle guidance if I am
making a silly mistake). Overall, I am at a loss to think of a reason why I should not be able to build
a compiler from the buildtools-extended compiler and then use it to reliably build our target images.

Thank you,

..Ch:W..


P.S. For those who are curious, we started out on Pyro hosted on Ubuntu 16.04. From there we made
the jump to self hosting when we used that environment to build a thud based VM SDK. After years of
successful build, we are now in the process of upgrading to Hardknott.

P.P.S. For the sake of completeness, I had to add the following files to the buildtools-extended
sysroot to fully complete the build of our images:

/usr/include/magic.h -> util-linux "more" command requires this.
/usr/include/zstd.h -> I do not recall which recipe required this.
/usr/bin/free -> The OpenJDK 8 build scripts need this.
/usr/include/sys/* -> openjdk-8-native
/lib/libcap.so.2 -> The binutils "dir" command quietly breaks the build without this. I am not a fan of the
                            lack of error checking in the binutils build...
/usr/include/sensors/error.h and sensors.h -> mesa-native
/usr/include/zstd_errors.h -> qemu-system-native

--
"Perfection must be reached by degrees; she requires the slow hand of time." - Voltaire

Join yocto@lists.yoctoproject.org to automatically receive all group messages.