Dataset-conversion by AN0DA · Pull Request #5 · AN0DA/pyBDL

AN0DA · 2025-12-07T14:27:25Z

📋 Summary

This PR introduces a comprehensive access layer that automatically converts API responses to pandas DataFrames, significantly improving the developer experience for data analysis workflows. The access layer sits on top of the existing API clients and provides automatic data normalization, column renaming, and type inference, making LDB data immediately ready for analysis. Additionally, this PR enhances API clients with improved parameter handling, pagination controls, and configuration flexibility.

🎯 Purpose & Context

The Local Data Bank (LDB) API returns data in JSON format with camelCase field names and nested structures, which requires manual conversion and normalization before analysis. This PR addresses this by introducing a dedicated access layer that:

Automatically converts API responses to pandas DataFrames
Normalizes column names from camelCase to snake_case
Infers proper data types (integers, floats, booleans)
Flattens nested data structures into tabular format

This change enables users to work with LDB data more efficiently, reducing boilerplate code and making the library more accessible to data analysts and scientists.

🔧 Changes Made

Access Layer Implementation

New pyldb.access module: Introduced a complete access layer with classes for all API endpoints:
- AggregatesAccess, AttributesAccess, DataAccess, LevelsAccess, MeasuresAccess, SubjectsAccess, UnitsAccess, VariablesAccess, YearsAccess
BaseAccess class: Provides common functionality for DataFrame conversion, column renaming, and data normalization
Dual interface design: Main LDB client now exposes both:
- Access layer (default): ldb.levels, ldb.data, etc. → Returns DataFrames
- API layer: ldb.api.levels, ldb.api.data, etc. → Returns raw dictionaries

API Client Enhancements

Parameter improvements:
- Renamed year → years across methods for consistency (supports multiple years)
- Renamed variable_id → variable_ids in data retrieval methods (supports lists)
- Removed all_pages parameter in favor of max_pages for clearer pagination control
Format enum: Introduced Format enum (JSON, JSONAPI, XML) for response format handling
Default page size: Added configurable page_size parameter (default: 100) for paginated requests
Enhanced request handling: Improved parameter and header management across all API clients

Configuration & Client Updates

Flexible initialization: LDB client now accepts None, dict, or LDBConfig instances
Configuration enhancements: Added page_size and default format to LDBConfig
Environment variable support: Enhanced handling for configuration overrides

Testing Infrastructure

Test organization: Reorganized tests into unit/, integration/, and e2e/ directories
Test markers: Added custom pytest markers (@pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e)
Comprehensive coverage:
- Unit tests for all access layer classes
- Integration tests with sample data for all endpoints
- End-to-end workflow tests
Sample data: Added sample JSON responses for integration testing

Documentation

New documentation: Added comprehensive access_layer.rst documentation
Updated guides: Enhanced main_client.rst, api_clients.rst, and config.rst
Examples notebook: Added examples.ipynb with practical usage examples
Appendix: Added technical implementation details for developers

Dependencies & Infrastructure

MyST Notebook: Added for documentation support
CI updates: Upgraded python-semantic-release action (v9.15.0 → v10.5.2)
Gitignore: Updated to exclude IDE and environment files

✅ Testing

Test Coverage

Unit tests: 9 new test files covering all access layer classes (aggregates, attributes, data, levels, measures, subjects, units, variables, years)
Integration tests: 9 integration test files with realistic sample data
End-to-end tests: 2 workflow tests covering complete user scenarios
API client tests: Updated existing tests to reflect parameter changes

Test Execution

# Run all tests
pytest

# Run by category
pytest -m unit
pytest -m integration
pytest -m e2e

Manual Testing

Access layer DataFrame conversion:

from pyldb import LDB, LDBConfig
ldb = LDB(LDBConfig(api_key="your-key"))
df = ldb.levels.list_levels()
assert isinstance(df, pd.DataFrame)
assert 'level_id' in df.columns  # camelCase → snake_case

API layer still returns raw dicts:

raw = ldb.api.levels.list_levels()
assert isinstance(raw, dict)

Parameter changes:
- Verify years parameter accepts lists
- Verify variable_ids parameter accepts lists
- Verify max_pages controls pagination correctly

🚨 Breaking Changes & Migration Notes

Parameter Renames

year → years: Update calls to get_data_by_variable(), get_data_by_unit(), and related methods

# Old
ldb.api.data.get_data_by_variable(variable_id="3643", year=2021)

# New
ldb.api.data.get_data_by_variable(variable_id="3643", years=[2021])

variable_id → variable_ids: Update calls to get_data_by_unit() and aget_data_by_unit()

# Old
ldb.api.data.get_data_by_unit(unit_id="123", variable_id="3643")

# New
ldb.api.data.get_data_by_unit(unit_id="123", variable_ids=["3643"])

Removed Parameters

all_pages parameter: Removed from DataAPI, SubjectsAPI, UnitsAPI, and VariablesAPI

# Old
ldb.api.data.get_data_by_variable(variable_id="3643", all_pages=True)

# New
ldb.api.data.get_data_by_variable(variable_id="3643", max_pages=None)  # None = all pages

Migration Path

For existing code using API layer: Update parameter names as shown above
For new code: Consider using the access layer (default interface) for DataFrame-based workflows
For advanced use cases: Continue using ldb.api.* for raw dictionary access

🔍 Review Focus Areas

Critical Review Points

DataFrame conversion logic: Verify correctness of nested data flattening in BaseAccess._to_dataframe()
Column renaming: Check that _column_renames mappings are correctly applied across all access classes
Pagination handling: Ensure max_pages logic correctly handles edge cases (None, 0, negative values)
Type inference: Validate that data types are correctly inferred from API responses
Backward compatibility: Confirm that API layer changes don't break existing integrations

Performance Considerations

DataFrame conversion overhead for large responses
Memory usage with nested data flattening
Pagination efficiency with max_pages parameter

Security & Configuration

Verify API key handling remains secure
Check that environment variable overrides work correctly
Validate rate limiting still functions properly

📦 Dependencies & Side Effects

New Dependencies

myst-nb: Added for MyST Notebook support in documentation

Updated Dependencies

python-semantic-release: Upgraded in CI workflow (v9.15.0 → v10.5.2)

Side Effects

Import paths: No breaking changes to public API imports
Configuration: New optional page_size and format config parameters (backward compatible)
Test organization: Tests moved to tests/unit/ directory (does not affect runtime)

🚀 Deployment Notes

Pre-Deployment Checklist

Verify all tests pass in CI
Update CHANGELOG.md with breaking changes
Update version number (semantic versioning)
Review documentation for accuracy

Post-Deployment

Documentation: New documentation will be available at /docs/access_layer.html
Examples: Jupyter notebook examples available in docs/examples.ipynb
User communication: Consider announcing the new access layer in release notes

Environment Considerations

No database migrations required
No infrastructure changes needed
Backward compatible with existing API clients (with parameter updates)

📊 Statistics

Files changed: 111 files
Lines added: ~9,623 insertions
Lines removed: ~1,311 deletions
Net change: +8,312 lines
New test files: 20+ test files
New access classes: 9 classes
Documentation pages: 4 new/updated pages

- Upgraded the python-semantic-release action from v9.15.0 to v10.5.2 to leverage new features and improvements.

- Updated the LDB client to support a more flexible configuration input, allowing for `None` and dictionary types. - Introduced an enrichment registry for managing data sources and improved the access layer to return DataFrames. - Added a sentinel value in `LDBConfig` to differentiate between "not provided" and "None" for the API key. - Enhanced quota handling in the API client to support custom quotas and improved rate limiting logic for registered and anonymous users.

- Added .envrc and .vscode/ to the .gitignore to prevent tracking of environment configuration and IDE-specific files. - Included dev/ directory to ignore development-related files.

…uest handling - Updated API client methods across multiple modules to include new parameters for language, format, and conditional request headers. - Introduced centralized handling of API parameters and headers to streamline request preparation. - Enhanced list and get methods to support pagination and sorting options, improving data retrieval flexibility. - Updated documentation strings to reflect new parameters and usage examples for better clarity.

…t handling - Added a new `Format` enum to define supported response formats (JSON, JSONAPI, XML). - Updated `LDBConfig` to include a default response format, enhancing configuration flexibility. - Modified API client methods across various modules to utilize the new format handling, defaulting to the config settings. - Improved documentation to reflect changes in expected parameters for language and format in API methods.

…stency - Updated the parameter name from 'year' to 'years' across multiple API methods in the DataAPI, UnitsAPI, and VariablesAPI classes to better reflect that multiple years can be specified. - Adjusted corresponding documentation strings to ensure clarity regarding the new parameter name. - Enhanced consistency in parameter naming across the codebase.

- Introduced a new constant `DEFAULT_PAGE_SIZE` set to 100 for pagination. - Updated `LDBConfig` to include a `page_size` attribute, allowing customization of the default page size. - Enhanced environment variable handling to allow overriding the default page size, with error handling for invalid values. - Updated documentation to reflect the new `page_size` parameter in the configuration.

…gination logic - Eliminated the 'all_pages' parameter from DataAPI, SubjectsAPI, UnitsAPI, and VariablesAPI classes to simplify pagination handling. - Updated methods to use 'max_pages' for controlling pagination, with clear documentation on its usage. - Adjusted logic to fetch results based on 'max_pages' value, ensuring consistent behavior across API methods. - Enhanced documentation to clarify the new pagination approach and parameters.

- Introduced a new access layer for various API endpoints, including aggregates, attributes, data, levels, measures, subjects, units, variables, and years. - Each access class is designed to convert API responses into pandas DataFrames, enhancing data manipulation capabilities. - Added methods for listing and retrieving data, with support for pagination and metadata retrieval. - Improved documentation to clarify usage and functionality of the new access layer classes.

…lients - Introduced comprehensive end-to-end tests for access layer workflows, ensuring correct data retrieval and handling. - Added integration tests for various access classes, including AggregatesAccess, AttributesAccess, DataAccess, LevelsAccess, MeasuresAccess, SubjectsAccess, UnitsAccess, and VariablesAccess, validating their functionality with sample data. - Implemented unit tests for API clients, enhancing coverage for asynchronous and synchronous operations. - Included sample data files to support integration tests, ensuring realistic scenarios for testing. - Improved overall test structure and organization for better maintainability and clarity.

- Included MyST Notebook as a dependency for documentation. - Introduced custom test markers for unit, integration, and end-to-end tests to enhance test categorization and organization. - Updated dependencies

- Introduced detailed documentation for the new access layer, highlighting its features such as automatic DataFrame conversion, column name normalization, and nested data flattening. - Updated API clients documentation to clarify the distinction between the access layer and API layer, emphasizing the benefits of using the access layer for data analysis. - Added examples and usage scenarios to enhance user understanding and facilitate quick start with the library. - Included technical implementation details in the appendix for developers and power users.

…ethods - Changed the parameter name from `variable_id` to `variable_ids` in the `get_data_by_unit` and `aget_data_by_unit` methods to support multiple variable IDs as a list. - Updated corresponding documentation and test cases to reflect this change, ensuring consistency across the API. - Cleaned up unnecessary whitespace in several files for improved code readability.

github-actions · 2025-12-07T14:28:00Z

Test Results (Python 3.13)

426 tests +176 417 ✅ +167 4s ⏱️ ±0s
1 suites ± 0 9 💤 + 9
1 files ± 0 0 ❌ ± 0

Results for commit 75f84e6. ± Comparison against base commit defc104.

This pull request removes 250 and adds 426 tests. Note that renamed tests count towards both.

tests.api.test_api_aggregates ‑ test_get_aggregate
tests.api.test_api_aggregates ‑ test_get_aggregate_error
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata_error
tests.api.test_api_aggregates ‑ test_list_aggregates
tests.api.test_api_aggregates ‑ test_list_aggregates_error
tests.api.test_api_aggregates ‑ test_list_aggregates_extra_query
tests.api.test_api_aggregates ‑ test_list_aggregates_with_sort
tests.api.test_api_aggregates_async ‑ test_aget_aggregate
tests.api.test_api_aggregates_async ‑ test_aget_aggregate_error
…

tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_async_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_data_by_variable_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_get_level_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_levels_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_subjects_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_pagination_workflow
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_access_vs_api_layer
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_api_layer_access
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_client_initialization
tests.integration.access.test_access_with_api_client.TestAccessWithAPIClient ‑ test_async_methods_call_async_api
…

github-actions · 2025-12-07T14:28:00Z

Test Results (Python 3.11)

426 tests +176 417 ✅ +167 5s ⏱️ -1s
1 suites ± 0 9 💤 + 9
1 files ± 0 0 ❌ ± 0

Results for commit 75f84e6. ± Comparison against base commit defc104.

This pull request removes 250 and adds 426 tests. Note that renamed tests count towards both.

tests.api.test_api_aggregates ‑ test_get_aggregate
tests.api.test_api_aggregates ‑ test_get_aggregate_error
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata
tests.api.test_api_aggregates ‑ test_get_aggregates_metadata_error
tests.api.test_api_aggregates ‑ test_list_aggregates
tests.api.test_api_aggregates ‑ test_list_aggregates_error
tests.api.test_api_aggregates ‑ test_list_aggregates_extra_query
tests.api.test_api_aggregates ‑ test_list_aggregates_with_sort
tests.api.test_api_aggregates_async ‑ test_aget_aggregate
tests.api.test_api_aggregates_async ‑ test_aget_aggregate_error
…

tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_async_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_data_by_variable_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_get_level_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_levels_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_list_subjects_workflow
tests.e2e.test_access_workflows.TestAccessWorkflows ‑ test_pagination_workflow
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_access_vs_api_layer
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_api_layer_access
tests.e2e.test_client_workflows.TestClientWorkflows ‑ test_client_initialization
tests.integration.access.test_access_with_api_client.TestAccessWithAPIClient ‑ test_async_methods_call_async_api
…

AN0DA added 13 commits November 30, 2025 18:38

chore: Update python-semantic-release action version in CI workflow

b1236b8

- Upgraded the python-semantic-release action from v9.15.0 to v10.5.2 to leverage new features and improvements.

chore: Update .gitignore to include additional environment and IDE files

2ed45bb

- Added .envrc and .vscode/ to the .gitignore to prevent tracking of environment configuration and IDE-specific files. - Included dev/ directory to ignore development-related files.

feat: Add support for MyST Notebook and test markers in configuration

46de554

- Included MyST Notebook as a dependency for documentation. - Introduced custom test markers for unit, integration, and end-to-end tests to enhance test categorization and organization. - Updated dependencies

AN0DA merged commit a78e0ed into main Dec 7, 2025
14 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset-conversion#5

Dataset-conversion#5
AN0DA merged 13 commits intomainfrom
dataset-conversion

AN0DA commented Dec 7, 2025

Uh oh!

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

github-actions bot commented Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AN0DA commented Dec 7, 2025

📋 Summary

🎯 Purpose & Context

🔧 Changes Made

Access Layer Implementation

API Client Enhancements

Configuration & Client Updates

Testing Infrastructure

Documentation

Dependencies & Infrastructure

✅ Testing

Test Coverage

Test Execution

Manual Testing

🚨 Breaking Changes & Migration Notes

Parameter Renames

Removed Parameters

Migration Path

🔍 Review Focus Areas

Critical Review Points

Performance Considerations

Security & Configuration

📦 Dependencies & Side Effects

New Dependencies

Updated Dependencies

Side Effects

🚀 Deployment Notes

Pre-Deployment Checklist

Post-Deployment

Environment Considerations

📊 Statistics

Uh oh!

Uh oh!

github-actions bot commented Dec 7, 2025

Test Results (Python 3.13)

Uh oh!

github-actions bot commented Dec 7, 2025

Test Results (Python 3.11)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant