Skip to content

Dataset-conversion#5

Merged
AN0DA merged 13 commits intomainfrom
dataset-conversion
Dec 7, 2025
Merged

Dataset-conversion#5
AN0DA merged 13 commits intomainfrom
dataset-conversion

Conversation

@AN0DA
Copy link
Owner

@AN0DA AN0DA commented Dec 7, 2025

📋 Summary

This PR introduces a comprehensive access layer that automatically converts API responses to pandas DataFrames, significantly improving the developer experience for data analysis workflows. The access layer sits on top of the existing API clients and provides automatic data normalization, column renaming, and type inference, making LDB data immediately ready for analysis. Additionally, this PR enhances API clients with improved parameter handling, pagination controls, and configuration flexibility.

🎯 Purpose & Context

The Local Data Bank (LDB) API returns data in JSON format with camelCase field names and nested structures, which requires manual conversion and normalization before analysis. This PR addresses this by introducing a dedicated access layer that:

  • Automatically converts API responses to pandas DataFrames
  • Normalizes column names from camelCase to snake_case
  • Infers proper data types (integers, floats, booleans)
  • Flattens nested data structures into tabular format

This change enables users to work with LDB data more efficiently, reducing boilerplate code and making the library more accessible to data analysts and scientists.

🔧 Changes Made

Access Layer Implementation

  • New pyldb.access module: Introduced a complete access layer with classes for all API endpoints:
    • AggregatesAccess, AttributesAccess, DataAccess, LevelsAccess, MeasuresAccess, SubjectsAccess, UnitsAccess, VariablesAccess, YearsAccess
  • BaseAccess class: Provides common functionality for DataFrame conversion, column renaming, and data normalization
  • Dual interface design: Main LDB client now exposes both:
    • Access layer (default): ldb.levels, ldb.data, etc. → Returns DataFrames
    • API layer: ldb.api.levels, ldb.api.data, etc. → Returns raw dictionaries

API Client Enhancements

  • Parameter improvements:
    • Renamed yearyears across methods for consistency (supports multiple years)
    • Renamed variable_idvariable_ids in data retrieval methods (supports lists)
    • Removed all_pages parameter in favor of max_pages for clearer pagination control
  • Format enum: Introduced Format enum (JSON, JSONAPI, XML) for response format handling
  • Default page size: Added configurable page_size parameter (default: 100) for paginated requests
  • Enhanced request handling: Improved parameter and header management across all API clients

Configuration & Client Updates

  • Flexible initialization: LDB client now accepts None, dict, or LDBConfig instances
  • Configuration enhancements: Added page_size and default format to LDBConfig
  • Environment variable support: Enhanced handling for configuration overrides

Testing Infrastructure

  • Test organization: Reorganized tests into unit/, integration/, and e2e/ directories
  • Test markers: Added custom pytest markers (@pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e)
  • Comprehensive coverage:
    • Unit tests for all access layer classes
    • Integration tests with sample data for all endpoints
    • End-to-end workflow tests
  • Sample data: Added sample JSON responses for integration testing

Documentation

  • New documentation: Added comprehensive access_layer.rst documentation
  • Updated guides: Enhanced main_client.rst, api_clients.rst, and config.rst
  • Examples notebook: Added examples.ipynb with practical usage examples
  • Appendix: Added technical implementation details for developers

Dependencies & Infrastructure

  • MyST Notebook: Added for documentation support
  • CI updates: Upgraded python-semantic-release action (v9.15.0 → v10.5.2)
  • Gitignore: Updated to exclude IDE and environment files

✅ Testing

Test Coverage

  • Unit tests: 9 new test files covering all access layer classes (aggregates, attributes, data, levels, measures, subjects, units, variables, years)
  • Integration tests: 9 integration test files with realistic sample data
  • End-to-end tests: 2 workflow tests covering complete user scenarios
  • API client tests: Updated existing tests to reflect parameter changes

Test Execution

# Run all tests
pytest

# Run by category
pytest -m unit
pytest -m integration
pytest -m e2e

Manual Testing

  1. Access layer DataFrame conversion:

    from pyldb import LDB, LDBConfig
    ldb = LDB(LDBConfig(api_key="your-key"))
    df = ldb.levels.list_levels()
    assert isinstance(df, pd.DataFrame)
    assert 'level_id' in df.columns  # camelCase → snake_case
  2. API layer still returns raw dicts:

    raw = ldb.api.levels.list_levels()
    assert isinstance(raw, dict)
  3. Parameter changes:

    • Verify years parameter accepts lists
    • Verify variable_ids parameter accepts lists
    • Verify max_pages controls pagination correctly

🚨 Breaking Changes & Migration Notes

Parameter Renames

  • yearyears: Update calls to get_data_by_variable(), get_data_by_unit(), and related methods

    # Old
    ldb.api.data.get_data_by_variable(variable_id="3643", year=2021)
    
    # New
    ldb.api.data.get_data_by_variable(variable_id="3643", years=[2021])
  • variable_idvariable_ids: Update calls to get_data_by_unit() and aget_data_by_unit()

    # Old
    ldb.api.data.get_data_by_unit(unit_id="123", variable_id="3643")
    
    # New
    ldb.api.data.get_data_by_unit(unit_id="123", variable_ids=["3643"])

Removed Parameters

  • all_pages parameter: Removed from DataAPI, SubjectsAPI, UnitsAPI, and VariablesAPI
    # Old
    ldb.api.data.get_data_by_variable(variable_id="3643", all_pages=True)
    
    # New
    ldb.api.data.get_data_by_variable(variable_id="3643", max_pages=None)  # None = all pages

Migration Path

  1. For existing code using API layer: Update parameter names as shown above
  2. For new code: Consider using the access layer (default interface) for DataFrame-based workflows
  3. For advanced use cases: Continue using ldb.api.* for raw dictionary access

🔍 Review Focus Areas

Critical Review Points

  1. DataFrame conversion logic: Verify correctness of nested data flattening in BaseAccess._to_dataframe()
  2. Column renaming: Check that _column_renames mappings are correctly applied across all access classes
  3. Pagination handling: Ensure max_pages logic correctly handles edge cases (None, 0, negative values)
  4. Type inference: Validate that data types are correctly inferred from API responses
  5. Backward compatibility: Confirm that API layer changes don't break existing integrations

Performance Considerations

  • DataFrame conversion overhead for large responses
  • Memory usage with nested data flattening
  • Pagination efficiency with max_pages parameter

Security & Configuration

  • Verify API key handling remains secure
  • Check that environment variable overrides work correctly
  • Validate rate limiting still functions properly

📦 Dependencies & Side Effects

New Dependencies

  • myst-nb: Added for MyST Notebook support in documentation

Updated Dependencies

  • python-semantic-release: Upgraded in CI workflow (v9.15.0 → v10.5.2)

Side Effects

  • Import paths: No breaking changes to public API imports
  • Configuration: New optional page_size and format config parameters (backward compatible)
  • Test organization: Tests moved to tests/unit/ directory (does not affect runtime)

🚀 Deployment Notes

Pre-Deployment Checklist

  • Verify all tests pass in CI
  • Update CHANGELOG.md with breaking changes
  • Update version number (semantic versioning)
  • Review documentation for accuracy

Post-Deployment

  • Documentation: New documentation will be available at /docs/access_layer.html
  • Examples: Jupyter notebook examples available in docs/examples.ipynb
  • User communication: Consider announcing the new access layer in release notes

Environment Considerations

  • No database migrations required
  • No infrastructure changes needed
  • Backward compatible with existing API clients (with parameter updates)

📊 Statistics

  • Files changed: 111 files
  • Lines added: ~9,623 insertions
  • Lines removed: ~1,311 deletions
  • Net change: +8,312 lines
  • New test files: 20+ test files
  • New access classes: 9 classes
  • Documentation pages: 4 new/updated pages

AN0DA added 13 commits November 30, 2025 18:38
- Upgraded the python-semantic-release action from v9.15.0 to v10.5.2 to leverage new features and improvements.
- Updated the LDB client to support a more flexible configuration input, allowing for `None` and dictionary types.
- Introduced an enrichment registry for managing data sources and improved the access layer to return DataFrames.
- Added a sentinel value in `LDBConfig` to differentiate between "not provided" and "None" for the API key.
- Enhanced quota handling in the API client to support custom quotas and improved rate limiting logic for registered and anonymous users.
- Added .envrc and .vscode/ to the .gitignore to prevent tracking of environment configuration and IDE-specific files.
- Included dev/ directory to ignore development-related files.
…uest handling

- Updated API client methods across multiple modules to include new parameters for language, format, and conditional request headers.
- Introduced centralized handling of API parameters and headers to streamline request preparation.
- Enhanced list and get methods to support pagination and sorting options, improving data retrieval flexibility.
- Updated documentation strings to reflect new parameters and usage examples for better clarity.
…t handling

- Added a new `Format` enum to define supported response formats (JSON, JSONAPI, XML).
- Updated `LDBConfig` to include a default response format, enhancing configuration flexibility.
- Modified API client methods across various modules to utilize the new format handling, defaulting to the config settings.
- Improved documentation to reflect changes in expected parameters for language and format in API methods.
…stency

- Updated the parameter name from 'year' to 'years' across multiple API methods in the DataAPI, UnitsAPI, and VariablesAPI classes to better reflect that multiple years can be specified.
- Adjusted corresponding documentation strings to ensure clarity regarding the new parameter name.
- Enhanced consistency in parameter naming across the codebase.
- Introduced a new constant `DEFAULT_PAGE_SIZE` set to 100 for pagination.
- Updated `LDBConfig` to include a `page_size` attribute, allowing customization of the default page size.
- Enhanced environment variable handling to allow overriding the default page size, with error handling for invalid values.
- Updated documentation to reflect the new `page_size` parameter in the configuration.
…gination logic

- Eliminated the 'all_pages' parameter from DataAPI, SubjectsAPI, UnitsAPI, and VariablesAPI classes to simplify pagination handling.
- Updated methods to use 'max_pages' for controlling pagination, with clear documentation on its usage.
- Adjusted logic to fetch results based on 'max_pages' value, ensuring consistent behavior across API methods.
- Enhanced documentation to clarify the new pagination approach and parameters.
- Introduced a new access layer for various API endpoints, including aggregates, attributes, data, levels, measures, subjects, units, variables, and years.
- Each access class is designed to convert API responses into pandas DataFrames, enhancing data manipulation capabilities.
- Added methods for listing and retrieving data, with support for pagination and metadata retrieval.
- Improved documentation to clarify usage and functionality of the new access layer classes.
…lients

- Introduced comprehensive end-to-end tests for access layer workflows, ensuring correct data retrieval and handling.
- Added integration tests for various access classes, including AggregatesAccess, AttributesAccess, DataAccess, LevelsAccess, MeasuresAccess, SubjectsAccess, UnitsAccess, and VariablesAccess, validating their functionality with sample data.
- Implemented unit tests for API clients, enhancing coverage for asynchronous and synchronous operations.
- Included sample data files to support integration tests, ensuring realistic scenarios for testing.
- Improved overall test structure and organization for better maintainability and clarity.
- Included MyST Notebook as a dependency for documentation.
- Introduced custom test markers for unit, integration, and end-to-end tests to enhance test categorization and organization.
- Updated dependencies
- Introduced detailed documentation for the new access layer, highlighting its features such as automatic DataFrame conversion, column name normalization, and nested data flattening.
- Updated API clients documentation to clarify the distinction between the access layer and API layer, emphasizing the benefits of using the access layer for data analysis.
- Added examples and usage scenarios to enhance user understanding and facilitate quick start with the library.
- Included technical implementation details in the appendix for developers and power users.
…ethods

- Changed the parameter name from `variable_id` to `variable_ids` in the `get_data_by_unit` and `aget_data_by_unit` methods to support multiple variable IDs as a list.
- Updated corresponding documentation and test cases to reflect this change, ensuring consistency across the API.
- Cleaned up unnecessary whitespace in several files for improved code readability.
@AN0DA AN0DA merged commit a78e0ed into main Dec 7, 2025
14 of 16 checks passed
@github-actions
Copy link

github-actions bot commented Dec 7, 2025

Test Results (Python 3.13)

426 tests  +176   417 ✅ +167   4s ⏱️ ±0s
  1 suites ±  0     9 💤 +  9 
  1 files   ±  0     0 ❌ ±  0 

Results for commit 75f84e6. ± Comparison against base commit defc104.

This pull request removes 250 and adds 426 tests. Note that renamed tests count towards both.
tests.api.test_api_aggregates ‑ test_get_aggregate
tests.api.test_api_aggregates ‑ test_get_aggregate_error
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata_error
tests.api.test_api_aggregates ‑ test_list_aggregates
tests.api.test_api_aggregates ‑ test_list_aggregates_error
tests.api.test_api_aggregates ‑ test_list_aggregates_extra_query
tests.api.test_api_aggregates ‑ test_list_aggregates_with_sort
tests.api.test_api_aggregates_async ‑ test_aget_aggregate
tests.api.test_api_aggregates_async ‑ test_aget_aggregate_error
…
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_async_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_data_by_variable_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_get_level_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_levels_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_subjects_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_pagination_workflow
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_access_vs_api_layer
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_api_layer_access
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_client_initialization
tests.integration.access.test_access_with_api_client.TestAccessWithAPIClient ‑ test_async_methods_call_async_api
…

@github-actions
Copy link

github-actions bot commented Dec 7, 2025

Test Results (Python 3.11)

426 tests  +176   417 ✅ +167   5s ⏱️ -1s
  1 suites ±  0     9 💤 +  9 
  1 files   ±  0     0 ❌ ±  0 

Results for commit 75f84e6. ± Comparison against base commit defc104.

This pull request removes 250 and adds 426 tests. Note that renamed tests count towards both.
tests.api.test_api_aggregates ‑ test_get_aggregate
tests.api.test_api_aggregates ‑ test_get_aggregate_error
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata_error
tests.api.test_api_aggregates ‑ test_list_aggregates
tests.api.test_api_aggregates ‑ test_list_aggregates_error
tests.api.test_api_aggregates ‑ test_list_aggregates_extra_query
tests.api.test_api_aggregates ‑ test_list_aggregates_with_sort
tests.api.test_api_aggregates_async ‑ test_aget_aggregate
tests.api.test_api_aggregates_async ‑ test_aget_aggregate_error
…
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_async_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_data_by_variable_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_get_level_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_levels_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_subjects_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_pagination_workflow
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_access_vs_api_layer
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_api_layer_access
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_client_initialization
tests.integration.access.test_access_with_api_client.TestAccessWithAPIClient ‑ test_async_methods_call_async_api
…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant