Skip to content

Add dataversion utility for extracting version info from data files#17

Merged
ccdavis merged 1 commit intomainfrom
check-server-status
Dec 30, 2025
Merged

Add dataversion utility for extracting version info from data files#17
ccdavis merged 1 commit intomainfrom
check-server-status

Conversation

@ccdavis
Copy link
Owner

@ccdavis ccdavis commented Dec 30, 2025

Add new data_version module and dataversion CLI binary to extract version metadata from IPUMS data files:

  • For parquet: reads all key-value metadata from file_metadata, counts variables, ignores samples/datasets, outputs everything else as version info
  • For fixed-width: reads system variables (record type '#') from layout, parses first line of compressed data file to extract values

Most IPUMS fixed-width will only have 'ZZZZ...' values for version variables since Workspaces requires these don't vary unless data changes -- this is for supporting the data checksum == data version behavior. Mainly you'd use it on the parquet data.

It's meant to support reporting on what's deployed to servers, but could be used by various deployment or data reproducibility tasks.
Right now the data_version module repeats some functionality found in rsvp relating to parquet files and path conventions and fixed width stuff. I plan to eventually pull in the fixed-width support and multi file parquet reading into cimdea and then that extra code can go away.

Features:

  • Auto-detects file format from path (.parquet/.dat.gz)
  • Recognizes parquet datasets by convention (parent dir named "parquet")
  • Handles both single parquet files and partitioned parquet datasets
  • Dynamic metadata extraction - no hardcoded field names
  • Output formats: human-readable text or JSON
  • Reuses existing DatasetLayout parser for fixed-width layouts

Usage:
dataversion /path/to/parquet/us2015b dataversion /path/to/us2015b_usa.dat.gz dataversion --format json /path/to/data

Partially 🤖 Generated with Claude Code

Add new data_version module and dataversion CLI binary to extract version
metadata from IPUMS data files:

- For parquet: reads all key-value metadata from file_metadata, counts
  variables, ignores samples/datasets, outputs everything else as version info
- For fixed-width: reads system variables (record type '#') from layout,
  parses first line of compressed data file to extract values

Features:
- Auto-detects file format from path (.parquet/.dat.gz)
- Recognizes parquet datasets by convention (parent dir named "parquet")
- Handles both single parquet files and partitioned parquet datasets
- Dynamic metadata extraction - no hardcoded field names
- Output formats: human-readable text or JSON
- Reuses existing DatasetLayout parser for fixed-width layouts

Usage:
  dataversion /path/to/parquet/us2015b
  dataversion /path/to/us2015b_usa.dat.gz
  dataversion --format json /path/to/data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ccdavis ccdavis merged commit de9d02f into main Dec 30, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant