Is curated SPDX data sharing a thing?


Jérôme Carretero
 

Hi,


Please correct me if I'm wrong but as far as I understand it, as of
today the flow for generating SPDX data to build software BoMs,
documented eg. in:

- https://www.fossology.org/get-started/basic-workflow/
- https://elinux.org/images/2/20/License_Compliance_in_Embedded_Linux_with_the_Yocto_Project.pdf

involves building your own database of SPDX files after reviewing all
the sources, which doesn't look to be something at reach of most
businesses.


I am wondering by extension:

- Whether there are businesses selling pre-masticated SPDX data
(I can imagine one would be willing to pay a little something to
obtain a collection of "certified" (or possibly "insured") SPDX);

- Whether there are (plans for having) public, collaborative
repositories of SPDX data that could be trusted over automatic scans
of source.


Best regards,

--
Jérôme


Richard Purdie
 

Hi,

On Fri, 2020-12-18 at 15:15 -0500, Jérôme Carretero wrote:
Please correct me if I'm wrong but as far as I understand it, as of
today the flow for generating SPDX data to build software BoMs,
documented eg. in:

- https://www.fossology.org/get-started/basic-workflow/
-
https://elinux.org/images/2/20/License_Compliance_in_Embedded_Linux_with_the_Yocto_Project.pdf

involves building your own database of SPDX files after reviewing all
the sources, which doesn't look to be something at reach of most
businesses.
The challenge is that Yocto Project lets you build your own custom
software, which means you also end up in your own BoM situation. We
generally therefore provide tooling that can help you generate the
information you need but there usually isn't "one size fits all".

I would mention the meta-spdxscanner layer as having
support/integration for some of the more recent scanning and document
generation tools.

I am wondering by extension:

- Whether there are businesses selling pre-masticated SPDX data
(I can imagine one would be willing to pay a little something to
obtain a collection of "certified" (or possibly "insured") SPDX);
I'm sure there are services provided, particularly by some of the
member OSVs but as I mention above, its hard to have a one size fits
all since you can patch or reconfigure the sources at will.

- Whether there are (plans for having) public, collaborative
repositories of SPDX data that could be trusted over automatic
scans of source.
We are hoping to have better tools integration where the build process
may be able to generation better SBoM and SPDX information directly.
Unfortunately its an area its hard to find people willing to
contribute.

Cheers,

Richard


Jérôme Carretero
 

On Fri, 18 Dec 2020 20:34:01 +0000
"Richard Purdie" <richard.purdie@linuxfoundation.org> wrote:

The challenge is that Yocto Project lets you build your own custom
software, which means you also end up in your own BoM situation. We
generally therefore provide tooling that can help you generate the
information you need but there usually isn't "one size fits all".
Of course different choices can be made regarding obligations (where
licenses are shown, how sources are distributed) but it in the same way
that today ${LICENSE_DIRECTORY}/${P}/recipeinfo contains a LICENSE key
which is very useful figuring out obligations, SPDX could be used to
have more information and more trust.

In most of my experience, a product mostly contains F/LOSS code from
major Yocto/OE layers, maybe a couple of other 3rd party libraries, a
couple of patches here and there, and a few 100kSLOC of "original" code;
the BoM consists... in an image manifest file.

A huge portion of the SPDX data could be reused, to get an
almost-complete better BoM.

I would mention the meta-spdxscanner layer as having
support/integration for some of the more recent scanning and document
generation tools.
Yeah, I used it. I can see that it mostly works except for the fact
that you either spend a lifetime doing source code analysis, or just a
few years because you trust the agreement of multiple robots on the
license verdict, which only leaves you the ambiguous files to process
(and that's time-consuming work).

I'm sure there are services provided, particularly by some of the
member OSVs but as I mention above, its hard to have a one size fits
all since you can patch or reconfigure the sources at will.
SPDX data contains package and also source file info (based on hashes),
so if a patch is applied, an analysis would only need to concern
modified files. Provided a development history and a baseline SPDX
available, it would significantly reduce the amount of work one would
face.

We are hoping to have better tools integration where the build process
may be able to generation better SBoM and SPDX information directly.
Unfortunately its an area its hard to find people willing to
contribute.
It's certainly easy to verify after do_patch (or after do_compile in
some cases) that sources correspond to existing SPDX files, or to
lookup SPDX files in an external database based on hashes of sources,
but automatically generating SPDX:

- is very time-consuming and I don't see it as something that one would
even do eg. in continuous integration;
- is not perfect; I don't think the build process could automatically
generate more than "candidate SPDX" information except maybe for a
couple of really-clean packages where the developers care about that.

Is there is a more focused discussion list on that topic or here is OK?
I may have a lot of questions/ideas but don't want to cause off-topic
noise.


Best,

--
Jérôme


Richard Purdie
 

On Fri, 2020-12-18 at 16:51 -0500, Jérôme Carretero wrote:
On Fri, 18 Dec 2020 20:34:01 +0000
"Richard Purdie" <richard.purdie@linuxfoundation.org> wrote:

The challenge is that Yocto Project lets you build your own custom
software, which means you also end up in your own BoM situation. We
generally therefore provide tooling that can help you generate the
information you need but there usually isn't "one size fits all".
Of course different choices can be made regarding obligations (where
licenses are shown, how sources are distributed) but it in the same
way that today ${LICENSE_DIRECTORY}/${P}/recipeinfo contains a
LICENSE key which is very useful figuring out obligations, SPDX could
be used to have more information and more trust.
Its going to take someone to stand up and provide the first "version"
of that and I'm not sure anyone wants to step up and be that
person/organisation...

In most of my experience, a product mostly contains F/LOSS code from
major Yocto/OE layers, maybe a couple of other 3rd party libraries, a
couple of patches here and there, and a few 100kSLOC of "original"
code;
the BoM consists... in an image manifest file.

A huge portion of the SPDX data could be reused, to get an
almost-complete better BoM.
It does depend on which data we're talking about. You also have the
issue that its fine to generate this tons of data but at some point you
have to interpret what it means too...

I would mention the meta-spdxscanner layer as having
support/integration for some of the more recent scanning and
document
generation tools.
Yeah, I used it. I can see that it mostly works except for the fact
that you either spend a lifetime doing source code analysis, or just
a few years because you trust the agreement of multiple robots on the
license verdict, which only leaves you the ambiguous files to process
(and that's time-consuming work).
I watched and helped our older LICENSE field work and I can say its a
thankless task which its very hard to get people to do. I fear that the
SPDX scans you refer to are so complex it will be hard to do this
consistently across the codebase. I'm actually hoping things may go a
slightly different route such as ultimately a majority of code having
license identifiers in it (we've tried to ensure YP code has them).

I'm sure there are services provided, particularly by some of the
member OSVs but as I mention above, its hard to have a one size
fits
all since you can patch or reconfigure the sources at will.
SPDX data contains package and also source file info (based on
hashes),
so if a patch is applied, an analysis would only need to concern
modified files. Provided a development history and a baseline SPDX
available, it would significantly reduce the amount of work one would
face.
Sure, how do we get people to build such a baseline though?

We are hoping to have better tools integration where the build
process
may be able to generation better SBoM and SPDX information
directly.
Unfortunately its an area its hard to find people willing to
contribute.
It's certainly easy to verify after do_patch (or after do_compile in
some cases) that sources correspond to existing SPDX files, or to
lookup SPDX files in an external database based on hashes of sources,
but automatically generating SPDX:

- is very time-consuming and I don't see it as something that one
would
even do eg. in continuous integration;
- is not perfect; I don't think the build process could automatically
generate more than "candidate SPDX" information except maybe for a
couple of really-clean packages where the developers care about
that.
There are certainly ways it could be done, if there are people who
agree on a common objective and are willing/able to contribute time to
it.

Is there is a more focused discussion list on that topic or here is
OK?
I may have a lot of questions/ideas but don't want to cause off-topic
noise.
We did set one up so there is
https://lists.yoctoproject.org/g/licensing/topics but it hasn't really
taken off (yet?)...

Cheers,

Richard