Add dataversion utility for extracting version info from data files#17
Merged
Add dataversion utility for extracting version info from data files#17
Conversation
Add new data_version module and dataversion CLI binary to extract version metadata from IPUMS data files: - For parquet: reads all key-value metadata from file_metadata, counts variables, ignores samples/datasets, outputs everything else as version info - For fixed-width: reads system variables (record type '#') from layout, parses first line of compressed data file to extract values Features: - Auto-detects file format from path (.parquet/.dat.gz) - Recognizes parquet datasets by convention (parent dir named "parquet") - Handles both single parquet files and partitioned parquet datasets - Dynamic metadata extraction - no hardcoded field names - Output formats: human-readable text or JSON - Reuses existing DatasetLayout parser for fixed-width layouts Usage: dataversion /path/to/parquet/us2015b dataversion /path/to/us2015b_usa.dat.gz dataversion --format json /path/to/data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add new data_version module and dataversion CLI binary to extract version metadata from IPUMS data files:
Most IPUMS fixed-width will only have 'ZZZZ...' values for version variables since Workspaces requires these don't vary unless data changes -- this is for supporting the data checksum == data version behavior. Mainly you'd use it on the parquet data.
It's meant to support reporting on what's deployed to servers, but could be used by various deployment or data reproducibility tasks.
Right now the
data_versionmodule repeats some functionality found inrsvprelating to parquet files and path conventions and fixed width stuff. I plan to eventually pull in the fixed-width support and multi file parquet reading intocimdeaand then that extra code can go away.Features:
Usage:
dataversion /path/to/parquet/us2015b dataversion /path/to/us2015b_usa.dat.gz dataversion --format json /path/to/data
Partially 🤖 Generated with Claude Code