Skip to content

Choose a new approach for deciding whether or not to check dataframe size in a step - without making step less testable #190

@lisad

Description

@lisad

We originally had a goal for developers using phaser to be able to write steps and write unit tests to make sure those steps worked as expected. Then, with the addition of the "check_size" flag in the step decorator, we made this unfortunately a little harder. Now to test a steP;

@dataframe_step
def sum_bonuses(df, context):
    df['total'] = df.sum(axis=1, numeric_only=True)
    return df

def test_sum_bonuses():
    data = {'eid': ['001', '001'], 'commission': [1000, 1000], 'performance': [9000, 1000]}
    output = [{'eid': '001', 'commission': 1000, 'performance': 9000, 'total': 10000},
              {'eid': '001', 'commission': 1000, 'performance': 1000, 'total': 2000}]
    bonus_df = pd.DataFrame(data)
    test_step_output, check_size_flag = sum_bonuses(bonus_df)
    assert test_step_output == output

To test the step, the developer must account for the check_size flag and deal with it. We should fix that - I think check_size should have been dealt with differently in retrospect, but we probably forgot about this impact when we built it like this. This is an issue also for row_step and batch_step.

Even better would be if there were some magical way to test sum_bonuses output as a dataframe. That seems even more challenging than fixing our approach to check_size, but maybe there's some way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions