-
Notifications
You must be signed in to change notification settings - Fork 143
Create resolved.json a machine readable version of resolved.html #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This added file represents a machine readable form of `resolved.md`
|
Hi @dave2wave , are the identifiers you're checking in the SPDX identifier? https://spdx.org/licenses/ |
|
Mostly. The ones prefixed by |
|
Since sbom tools like cyclonedx use the SPDX license list, should we contribute upstream to https://github.com/spdx/license-list-XML so that we have IDs for all of these and end user applications can verify too? |
|
That would be ideal, in fact reading SBOM licenses is one of the ATR use cases. Once additions are in place we can create an appropriate plan on how to keep this file updated. It may take time for the SBOM generating tools to catch up to SPDX changes, I hope that they.all go to the current source at SPDX, but don't know that. |
|
Cyclonedx definitely uses the SPDX license ids: I think SPDX & Cyclonedx are the major formats for sboms? |
|
SPDX have a list of License Inclusion Principles that we need to adhere to for each license that we suggest to them. There is a form online for submitting new licenses, which requires that you log in using GitHub. Note that the first field requires you to defend how the license adheres to their inclusion principles. Submissions apparently have to be submitted one at a time; I could not find a way to submit in bulk. You have to follow up submission there with a GitHub issue too, and "be prepared to help create the required XML and test text files if the license is accepted for submission". |
|
I think that we are going to find inconsistency as we validate licenses in more and more sboms. For example here is a license that looks like it doesn’t follow their rules: https://github.com/spdx/license-list-data/blob/main/json/licenses.json#L1438-L1449 Which references a definition here: https://github.com/apache/apr/blob/trunk/LICENSE#L298C6-L298C29 |
|
This effectively creates a copy of the list. The JSON version could be used to populate the HTML page, meaning only one file has to be updated. |
|
Generating the HTML from the structured data would ensure that the presentation is regular, and would therefore be preferable. But as part of a long tradition of enriching HTML with structured data stretching back to HyperRDF and GRDDL I have experimentally implemented the opposite approach. Specifically, I've made a new, renamed version of
This version adds title attributes to license links in the format
I tried to preserve the original Markdown page as closely as possible, but for example had to split apart a few links into separate license versions. Along the way I also updated a few of my own decisions, so the output JSON license list above differs slightly from the one in the present pull request. There are some inconsistencies in the classification logic of the original page, but I'll give that feedback elsewhere. |
|
Why extract the JSON from the HTML, rather than the simpler markdown? Also, this would be easier to review and test if it were in a branch of www-site (possibly a preview/ one), rather than in an unrelated repo. |
|
There is an HTML parser in the Python standard library, and the HTML will be certain to exist because we do not consume Markdown directly on the web. This is a proof of concept for which I'm happy to update this pull request if that's okay with @dave2wave. |
|
Surely the code to create the JSON file would run as part of building the site? In which case MarkDown is one of the expected input formats - indeed the Makefile consumes Markdown input to create the HTML. Further, if there is any change in the way HTML is generated from MarkDown, then the HTML parser may have to be updated. |
|
@sbp The |
|
Another approach would be to run a GitHub workflow whenever |
|
@sebbASF Since the HTML is always available to users of the website, the HTML is always generated by the build process, and therefore the build process always has access to the HTML to generate the JSON. The build process can use the Markdown or it can use the HTML, always. Therefore it is a choice, and I believe that the better choice is to parse from the HTML because Python contains an HTML parser in its standard library but does not contain a Markdown parser. There is no chance that "the HTML parser may have to be updated" because I chose the @dave2wave Apologies, I didn't know where else to put the file for contributing to this pull request. As it is your pull request I did not want to contribute the file directly here, and we need the data in the ATR anyway, both of which reasons motivated me to add the full flow to the ATR for now. I understand that ASF Legal should be responsible for the maintenance of this structured data. I think a pelican configuration option, which is tied only to an open source technology that we use, would be preferable to a GitHub action. |
|
By the way, I still have a slight preference for generating the Markdown from the JSON, which we could do as a pelican pre-build step. This would result in a fairly substantial restructuring of the page, however, so it would make our changes harder to review for consistency with the existing policy, would require more extensive tooling, and I think would also be harder to edit than the structured Markdown titles convention. These are the tradeoffs to consider. |
|
JSON can be used to generate the output by using the features of EZT in combination with Markdown, which is already supported. No extra tooling needed. Comparison of the generated HTML can be used to check that the policy is consistent. |
|
|
I renamed the branch and that required a new PR see #573 |
This added file represents a machine readable form of
resolved.md.resolved.mdis updated.