Skip to content

Conversation

@dave2wave
Copy link
Member

This added file represents a machine readable form of resolved.md.

  1. The legal committee updates this file whenever resolved.md is updated.
  2. The Apache Trusted Releases (ATR) platform can read this file and make the most up to date license checks to assure releases are following ASF release policy.

This added file represents a machine readable form of `resolved.md`
@jdaugherty
Copy link

Hi @dave2wave , are the identifiers you're checking in the SPDX identifier? https://spdx.org/licenses/

@dave2wave
Copy link
Member Author

Mostly. The ones prefixed by LicenseRef- are specifically not found there. @sbp can provide additional references if necessary.

@jdaugherty
Copy link

Since sbom tools like cyclonedx use the SPDX license list, should we contribute upstream to https://github.com/spdx/license-list-XML so that we have IDs for all of these and end user applications can verify too?

@dave2wave
Copy link
Member Author

That would be ideal, in fact reading SBOM licenses is one of the ATR use cases.

Once additions are in place we can create an appropriate plan on how to keep this file updated. It may take time for the SBOM generating tools to catch up to SPDX changes, I hope that they.all go to the current source at SPDX, but don't know that.

@jdaugherty
Copy link

Cyclonedx definitely uses the SPDX license ids:

    CycloneDX incorporates SPDX license IDs and expressions to document stated licenses of open source components. Licenses can be expressed three ways, by SPDX license ID, by SPDX license expression, or as a license name. Zero or more licenses can be defined by ID or by name.

I think SPDX & Cyclonedx are the major formats for sboms?

@sbp
Copy link

sbp commented Oct 8, 2025

SPDX have a list of License Inclusion Principles that we need to adhere to for each license that we suggest to them. There is a form online for submitting new licenses, which requires that you log in using GitHub. Note that the first field requires you to defend how the license adheres to their inclusion principles. Submissions apparently have to be submitted one at a time; I could not find a way to submit in bulk. You have to follow up submission there with a GitHub issue too, and "be prepared to help create the required XML and test text files if the license is accepted for submission".

@dave2wave
Copy link
Member Author

I think that we are going to find inconsistency as we validate licenses in more and more sboms. For example here is a license that looks like it doesn’t follow their rules: https://github.com/spdx/license-list-data/blob/main/json/licenses.json#L1438-L1449

Which references a definition here: https://github.com/apache/apr/blob/trunk/LICENSE#L298C6-L298C29
That license that is treated as Category A by Apache APR release, but is not listed in resolved.html.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 8, 2025

This effectively creates a copy of the list.
Which one is canonical? How are the copies going to be synchronised?

The JSON version could be used to populate the HTML page, meaning only one file has to be updated.
i.e. something similar to the way the ECCN data is maintained.
Alternatively, perhaps the JSON file could be automatically created from the Markdown.

@sbp
Copy link

sbp commented Oct 9, 2025

Generating the HTML from the structured data would ensure that the presentation is regular, and would therefore be preferable. But as part of a long tradition of enriching HTML with structured data stretching back to HyperRDF and GRDDL I have experimentally implemented the opposite approach.

Specifically, I've made a new, renamed version of resolved.md:

atr/policy/third-party-licenses.md

This version adds title attributes to license links in the format "Category X: SPDX-IDENTIFIER[, SPDX-IDENTIFIER]". Using a Makefile, I then convert the Markdown to HTML using cmark, and use a script called extract_spdx_identifiers.py to extract the JSON:

atr/policy/third-party-licenses.json

I tried to preserve the original Markdown page as closely as possible, but for example had to split apart a few links into separate license versions. Along the way I also updated a few of my own decisions, so the output JSON license list above differs slightly from the one in the present pull request.

There are some inconsistencies in the classification logic of the original page, but I'll give that feedback elsewhere.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 9, 2025

Why extract the JSON from the HTML, rather than the simpler markdown?
AIUI, the markdown syntax understood by cmark may not be exactly the same as the syntax used on the rest of the site.

Also, this would be easier to review and test if it were in a branch of www-site (possibly a preview/ one), rather than in an unrelated repo.

@sbp
Copy link

sbp commented Oct 9, 2025

There is an HTML parser in the Python standard library, and the HTML will be certain to exist because we do not consume Markdown directly on the web. This is a proof of concept for which cmark is adequate; the technique will adapt itself to www-site because I modified the Markdown from www-site. I did not port it to CommonMark.

I'm happy to update this pull request if that's okay with @dave2wave.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 9, 2025

Surely the code to create the JSON file would run as part of building the site?

In which case MarkDown is one of the expected input formats - indeed the Makefile consumes Markdown input to create the HTML.

Further, if there is any change in the way HTML is generated from MarkDown, then the HTML parser may have to be updated.

@dave2wave
Copy link
Member Author

@sbp The resolved.md file is the responsibility of the VP, Legal. We should not be making a copy of it. If we can annotate the links in that file and somehow script the www-site pelican build to produce an updated json file that would be ideal. We can discuss what that might take tomorrow, but in general scripts can be run either before or after the pelican build via configuration: https://github.com/apache/www-site/blob/main/pelicanconf.yaml#L20-L23

@dave2wave
Copy link
Member Author

Another approach would be to run a GitHub workflow whenever resolved.md is changed. There is an example here: https://github.com/apache/www-site/blob/main/.github/workflows/members_check.yml

@sbp
Copy link

sbp commented Oct 10, 2025

@sebbASF Since the HTML is always available to users of the website, the HTML is always generated by the build process, and therefore the build process always has access to the HTML to generate the JSON. The build process can use the Markdown or it can use the HTML, always. Therefore it is a choice, and I believe that the better choice is to parse from the HTML because Python contains an HTML parser in its standard library but does not contain a Markdown parser.

There is no chance that "the HTML parser may have to be updated" because I chose the title attribute format "Category C: SPDX-IDENTIFIER" very carefully to be self-documenting. Could somebody add a "Category C: SPDX-IDENTIFIER" formatted title to a link, unaware of the local structured data convention, and intend it to mean something different to our convention? I do not think that other meanings are possible. Perhaps I misunderstand your concern about our having to update the HTML parser?

@dave2wave Apologies, I didn't know where else to put the file for contributing to this pull request. As it is your pull request I did not want to contribute the file directly here, and we need the data in the ATR anyway, both of which reasons motivated me to add the full flow to the ATR for now. I understand that ASF Legal should be responsible for the maintenance of this structured data. I think a pelican configuration option, which is tied only to an open source technology that we use, would be preferable to a GitHub action.

@sbp
Copy link

sbp commented Oct 10, 2025

By the way, I still have a slight preference for generating the Markdown from the JSON, which we could do as a pelican pre-build step. This would result in a fairly substantial restructuring of the page, however, so it would make our changes harder to review for consistency with the existing policy, would require more extensive tooling, and I think would also be harder to edit than the structured Markdown titles convention. These are the tradeoffs to consider.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 10, 2025

JSON can be used to generate the output by using the features of EZT in combination with Markdown, which is already supported. No extra tooling needed. Comparison of the generated HTML can be used to check that the policy is consistent.

@dave2wave
Copy link
Member Author

dave2wave commented Oct 10, 2025

  1. If we are generating from JSON then we can use EZMD and populate a data dictionary directly from the JSON using asfdata plugin.
  2. The resulting html must have the same appearance as what resolved.md currently produces.

@dave2wave dave2wave marked this pull request as draft October 10, 2025 16:34
@dave2wave dave2wave closed this Oct 10, 2025
@dave2wave dave2wave deleted the legal-resolved-json branch October 10, 2025 16:35
@dave2wave
Copy link
Member Author

I renamed the branch and that required a new PR see #573

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants