A Python class to help download ComStock data locally for analysis. The ComStockProcessor class provides an easy interface to download metadata and time series data from the ComStock dataset hosted on AWS S3.
Install and run poetry:
pip install poetry
poetry installThe ComStockProcessor class is located in lib/comstock_processor.py and provides methods to download and process ComStock building data.
from pathlib import Path
from lib.comstock_processor import ComStockProcessor
# Initialize the processor
processor = ComStockProcessor(
state="CA", # 2-letter state abbreviation
county_name="All", # County name or "All"
building_type="All", # Building type or "All"
upgrade="0", # Upgrade identifier (0 = baseline)
base_dir=Path("./datasets/comstock") # Local directory to save data
)Downloads and processes ComStock metadata with filtering based on the class constraints.
- Downloads the baseline metadata parquet file if not already present
- Filters by state, county, and building type as specified during initialization
- Saves filtered results as a CSV file
- Returns a pandas DataFrame with the filtered metadata
Downloads time series data for buildings specified in the input DataFrame using parallel execution.
- Uses multi-threading to download building time series files efficiently
- Skips downloading files that already exist locally
- Downloads from the ComStock AWS S3 bucket
- Returns paths and building IDs of downloaded files
from pathlib import Path
from lib.comstock_processor import ComStockProcessor
# Set up directories
base_dir = Path("./datasets/comstock")
timeseries_dir = base_dir / "timeseries"
for d in [base_dir, timeseries_dir]:
d.mkdir(parents=True, exist_ok=True)
# Initialize processor for California data
processor = ComStockProcessor(
state="CA",
county_name="All",
building_type="All",
upgrade="0",
base_dir=base_dir
)
# Download and filter metadata
metadata_df = processor.process_metadata(save_dir=base_dir)
# Download time series data for buildings in metadata
paths, building_ids = processor.process_building_time_series(
metadata_df,
save_dir=timeseries_dir
)The processor downloads data from the ComStock dataset hosted on AWS S3:
- Base URL:
https://oedi-data-lake.s3.amazonaws.com/nrel-pds-building-stock/end-use-load-profiles-for-us-building-stock/2024/comstock_amy2018_release_1/ - Data Explorer: OpenEI Data Lake Explorer
- Parallel Downloads: Uses ThreadPoolExecutor for concurrent file downloads
- Smart Caching: Skips downloading files that already exist locally
- Progress Tracking: Shows download progress with tqdm progress bars
- Efficient Filtering: Uses pandas parquet filtering for large datasets
The ComStock processor includes comprehensive unit and integration tests that validate the downloading and processing functionality.
Run specific test categories:
# Unit tests only (fast)
poetry run pytest tests/ -m "unit" -v
# Integration tests (downloads small datasets)
poetry run pytest tests/ -m "integration" -v
# All tests including large dataset downloads
TEST_DATA=true poetry run pytest tests/ -m "integration" -v
# Run all tests
poetry run pytest tests -v- Unit tests: Fast tests that verify initialization and basic functionality
- Integration tests: Tests that download and process real ComStock data
Before pushing changes to GitHub, run pre-commit to format the code consistently:
pre-commit run --all-filesIf this doesn't work, try:
poetry update
poetry run pre-commit run --all-files