Skip to content

Include mapped variant data and README in public data dump #664

@bencap

Description

@bencap

The public data dump script (src/mavedb/scripts/export_public_data.py) currently exports metadata (main.json), score/count CSVs, and a license file. It does not include mapped variant data (VRS alleles, mapped HGVS, etc.), even though this data is available via GET /api/v1/score-sets/{urn}/mapped-variants.

We should include mapped variant JSON in the data dump so that downstream consumers have access to post-mapped VRS representations without needing to call the live API.

Proposed Changes

  1. Add mapped variant data to the dump
    For each published score set that has completed mapping, export its mapped variant data (the same payload returned by GET /score-sets/{urn}/mapped-variants) as a JSON file in the archive, e.g.:

mapped/tmp:00000001-a-1.mapped-variants.json
Each file should contain the current mapped variants for that score set, including pre_mapped and post_mapped VRS allele JSON, HGVS columns, and VRS version metadata.

  1. Add a README to the archive
    Add a README.md (or README.txt) to the root of the dump archive that documents:
  • What is included in the dump (metadata JSON, score CSVs, count CSVs, mapped variant JSON, license)
  • The structure/layout of the archive directory
  • A brief description of each file type and its format
  • Any caveats (e.g. only CC0-licensed published data is included, only current mapped variants are exported)
  • A link back to MaveDB and the API documentation for further reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: backendTask implementation touches the backendtype: enhancementEnhancement to an existing featuretype: maintenanceMaintaining this project

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions