Create resolved.json a machine readable version of resolved.html #569

dave2wave · 2025-10-08T15:22:42Z

This added file represents a machine readable form of resolved.md.

The legal committee updates this file whenever resolved.md is updated.
The Apache Trusted Releases (ATR) platform can read this file and make the most up to date license checks to assure releases are following ASF release policy.

This added file represents a machine readable form of `resolved.md`

jdaugherty · 2025-10-08T17:26:11Z

Hi @dave2wave , are the identifiers you're checking in the SPDX identifier? https://spdx.org/licenses/

dave2wave · 2025-10-08T17:30:32Z

Mostly. The ones prefixed by LicenseRef- are specifically not found there. @sbp can provide additional references if necessary.

jdaugherty · 2025-10-08T17:33:47Z

Since sbom tools like cyclonedx use the SPDX license list, should we contribute upstream to https://github.com/spdx/license-list-XML so that we have IDs for all of these and end user applications can verify too?

dave2wave · 2025-10-08T17:56:36Z

That would be ideal, in fact reading SBOM licenses is one of the ATR use cases.

Once additions are in place we can create an appropriate plan on how to keep this file updated. It may take time for the SBOM generating tools to catch up to SPDX changes, I hope that they.all go to the current source at SPDX, but don't know that.

jdaugherty · 2025-10-08T18:13:22Z

Cyclonedx definitely uses the SPDX license ids:

    CycloneDX incorporates SPDX license IDs and expressions to document stated licenses of open source components. Licenses can be expressed three ways, by SPDX license ID, by SPDX license expression, or as a license name. Zero or more licenses can be defined by ID or by name.

I think SPDX & Cyclonedx are the major formats for sboms?

sbp · 2025-10-08T19:00:07Z

SPDX have a list of License Inclusion Principles that we need to adhere to for each license that we suggest to them. There is a form online for submitting new licenses, which requires that you log in using GitHub. Note that the first field requires you to defend how the license adheres to their inclusion principles. Submissions apparently have to be submitted one at a time; I could not find a way to submit in bulk. You have to follow up submission there with a GitHub issue too, and "be prepared to help create the required XML and test text files if the license is accepted for submission".

dave2wave · 2025-10-08T22:13:03Z

I think that we are going to find inconsistency as we validate licenses in more and more sboms. For example here is a license that looks like it doesn’t follow their rules: https://github.com/spdx/license-list-data/blob/main/json/licenses.json#L1438-L1449

Which references a definition here: https://github.com/apache/apr/blob/trunk/LICENSE#L298C6-L298C29
That license that is treated as Category A by Apache APR release, but is not listed in resolved.html.

sebbASF · 2025-10-08T23:34:53Z

This effectively creates a copy of the list.
Which one is canonical? How are the copies going to be synchronised?

The JSON version could be used to populate the HTML page, meaning only one file has to be updated.
i.e. something similar to the way the ECCN data is maintained.
Alternatively, perhaps the JSON file could be automatically created from the Markdown.

sbp · 2025-10-09T10:40:20Z

Generating the HTML from the structured data would ensure that the presentation is regular, and would therefore be preferable. But as part of a long tradition of enriching HTML with structured data stretching back to HyperRDF and GRDDL I have experimentally implemented the opposite approach.

Specifically, I've made a new, renamed version of resolved.md:

atr/policy/third-party-licenses.md

This version adds title attributes to license links in the format "Category X: SPDX-IDENTIFIER[, SPDX-IDENTIFIER]". Using a Makefile, I then convert the Markdown to HTML using cmark, and use a script called extract_spdx_identifiers.py to extract the JSON:

atr/policy/third-party-licenses.json

I tried to preserve the original Markdown page as closely as possible, but for example had to split apart a few links into separate license versions. Along the way I also updated a few of my own decisions, so the output JSON license list above differs slightly from the one in the present pull request.

There are some inconsistencies in the classification logic of the original page, but I'll give that feedback elsewhere.

sebbASF · 2025-10-09T11:57:56Z

Why extract the JSON from the HTML, rather than the simpler markdown?
AIUI, the markdown syntax understood by cmark may not be exactly the same as the syntax used on the rest of the site.

Also, this would be easier to review and test if it were in a branch of www-site (possibly a preview/ one), rather than in an unrelated repo.

sbp · 2025-10-09T13:34:38Z

There is an HTML parser in the Python standard library, and the HTML will be certain to exist because we do not consume Markdown directly on the web. This is a proof of concept for which cmark is adequate; the technique will adapt itself to www-site because I modified the Markdown from www-site. I did not port it to CommonMark.

I'm happy to update this pull request if that's okay with @dave2wave.

sebbASF · 2025-10-09T16:02:40Z

Surely the code to create the JSON file would run as part of building the site?

In which case MarkDown is one of the expected input formats - indeed the Makefile consumes Markdown input to create the HTML.

Further, if there is any change in the way HTML is generated from MarkDown, then the HTML parser may have to be updated.

dave2wave · 2025-10-09T17:00:22Z

@sbp The resolved.md file is the responsibility of the VP, Legal. We should not be making a copy of it. If we can annotate the links in that file and somehow script the www-site pelican build to produce an updated json file that would be ideal. We can discuss what that might take tomorrow, but in general scripts can be run either before or after the pelican build via configuration: https://github.com/apache/www-site/blob/main/pelicanconf.yaml#L20-L23

dave2wave · 2025-10-09T17:04:09Z

Another approach would be to run a GitHub workflow whenever resolved.md is changed. There is an example here: https://github.com/apache/www-site/blob/main/.github/workflows/members_check.yml

sbp · 2025-10-10T15:45:44Z

@sebbASF Since the HTML is always available to users of the website, the HTML is always generated by the build process, and therefore the build process always has access to the HTML to generate the JSON. The build process can use the Markdown or it can use the HTML, always. Therefore it is a choice, and I believe that the better choice is to parse from the HTML because Python contains an HTML parser in its standard library but does not contain a Markdown parser.

There is no chance that "the HTML parser may have to be updated" because I chose the title attribute format "Category C: SPDX-IDENTIFIER" very carefully to be self-documenting. Could somebody add a "Category C: SPDX-IDENTIFIER" formatted title to a link, unaware of the local structured data convention, and intend it to mean something different to our convention? I do not think that other meanings are possible. Perhaps I misunderstand your concern about our having to update the HTML parser?

@dave2wave Apologies, I didn't know where else to put the file for contributing to this pull request. As it is your pull request I did not want to contribute the file directly here, and we need the data in the ATR anyway, both of which reasons motivated me to add the full flow to the ATR for now. I understand that ASF Legal should be responsible for the maintenance of this structured data. I think a pelican configuration option, which is tied only to an open source technology that we use, would be preferable to a GitHub action.

sbp · 2025-10-10T15:52:41Z

By the way, I still have a slight preference for generating the Markdown from the JSON, which we could do as a pelican pre-build step. This would result in a fairly substantial restructuring of the page, however, so it would make our changes harder to review for consistency with the existing policy, would require more extensive tooling, and I think would also be harder to edit than the structured Markdown titles convention. These are the tradeoffs to consider.

sebbASF · 2025-10-10T16:25:42Z

JSON can be used to generate the output by using the features of EZT in combination with Markdown, which is already supported. No extra tooling needed. Comparison of the generated HTML can be used to check that the policy is consistent.

dave2wave · 2025-10-10T16:31:59Z

If we are generating from JSON then we can use EZMD and populate a data dictionary directly from the JSON using asfdata plugin.
The resulting html must have the same appearance as what resolved.md currently produces.

dave2wave · 2025-10-10T16:44:13Z

I renamed the branch and that required a new PR see #573

Create resolved.json

5f9c1f6

This added file represents a machine readable form of `resolved.md`

dave2wave marked this pull request as draft October 10, 2025 16:34

dave2wave closed this Oct 10, 2025

dave2wave deleted the legal-resolved-json branch October 10, 2025 16:35

dave2wave mentioned this pull request Oct 10, 2025

Add resolved.json and replace resolved.md with resolved.ezmd #573

Draft

Create resolved.json a machine readable version of resolved.html #569

Create resolved.json a machine readable version of resolved.html #569

Uh oh!

Conversation

dave2wave commented Oct 8, 2025

Uh oh!

jdaugherty commented Oct 8, 2025

Uh oh!

dave2wave commented Oct 8, 2025

Uh oh!

jdaugherty commented Oct 8, 2025

Uh oh!

dave2wave commented Oct 8, 2025

Uh oh!

jdaugherty commented Oct 8, 2025

Uh oh!

sbp commented Oct 8, 2025

Uh oh!

dave2wave commented Oct 8, 2025

Uh oh!

sebbASF commented Oct 8, 2025

Uh oh!

sbp commented Oct 9, 2025

Uh oh!

sebbASF commented Oct 9, 2025

Uh oh!

sbp commented Oct 9, 2025

Uh oh!

sebbASF commented Oct 9, 2025

Uh oh!

dave2wave commented Oct 9, 2025

Uh oh!

dave2wave commented Oct 9, 2025

Uh oh!

sbp commented Oct 10, 2025

Uh oh!

sbp commented Oct 10, 2025

Uh oh!

sebbASF commented Oct 10, 2025

Uh oh!

dave2wave commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dave2wave commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dave2wave commented Oct 10, 2025 •

edited

Loading