Skip to content

Conversation

@MaxGhenis
Copy link
Contributor

Summary

Invalid enum values passed to Enum.encode() were silently converted to index 0 (the first enum value) instead of raising an error. This caused silent data corruption in simulations.

The Problem

In Enum.encode(), numpy.select() is used to map string values to enum indices:

array = numpy.select(
    [array == item.name for item in cls],
    [item.index for item in cls],
)

When no condition matches (invalid value), numpy.select returns its default value of 0, which silently converts invalid values to the first enum item.

For example:

  • 'MARRIED' passed to FilingStatus (which has SINGLE, JOINT, SEPARATE) would silently become SINGLE (index 0)
  • Empty string '' would also silently become the first enum value

Real-World Impact

This bug was discovered in PolicyEngine/policyengine-us#6901 where Missouri test files had:

  • filing_status: MARRIED (invalid - should be JOINT)
  • filing_status: (empty/blank)

These tests passed silently because the invalid values were converted to SINGLE.

The Fix

Add validation after numpy.select to detect unmatched values and raise a descriptive error:

unmatched_mask = ~numpy.isin(original_array, valid_names)
if unmatched_mask.any():
    invalid_values = numpy.unique(original_array[unmatched_mask])
    raise ValueError(
        f"Invalid value(s) {invalid_values.tolist()} for enum "
        f"{cls.__name__}. Valid values are: {valid_names}"
    )

Test Plan

  • Added tests/core/test_enum_encoding.py with tests for:
    • Valid single values encode correctly
    • Valid multiple values encode correctly
    • Invalid values raise ValueError with helpful message
    • Empty strings raise ValueError
    • Mixed valid/invalid values raise ValueError
    • Case sensitivity is enforced
    • Order is preserved in encoding

Fixes #410

🤖 Generated with Claude Code

Change the behavior from logging a warning and returning index 0 to
raising a ValueError with a clear message. This prevents silent data
corruption when incorrect enum values are passed to simulations.

The previous behavior (introduced in the searchsorted refactor) would:
1. Log a warning about invalid values
2. Return 0 (first enum value) for the invalid entries
3. Continue execution with corrupted data

This was an improvement over the original np.select behavior (which
silently returned 0 without any warning), but still allowed simulations
to run with incorrect data.

Now invalid enum values will raise:
  ValueError: Invalid value(s) ['MARRIED'] for enum FilingStatus.
  Valid values are: ['SINGLE', 'JOINT', 'SEPARATE', ...]

Fixes #410

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@MaxGhenis MaxGhenis force-pushed the fix-silent-enum-corruption branch from 6fec490 to c01e0f9 Compare December 3, 2025 21:59
@MaxGhenis MaxGhenis merged commit 26fdd57 into master Dec 3, 2025
14 checks passed
@MaxGhenis MaxGhenis deleted the fix-silent-enum-corruption branch December 3, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Silent data corruption: Invalid enum values default to index 0 instead of raising error

2 participants