Skip to content

Conversation

@tompollard
Copy link
Owner

This PR fixes a TypeError that occurs when computing p-values for categorical variables containing missing values. The error arises because missing values are replaced with the string 'None', leading to mixed-type categories (e.g.int and str) which cannot be sorted during internal processing (ref #161 and #160).

Previously, the following code would raise TypeError: '<' not supported between instances of 'str' and 'int':

from tableone import TableOne
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'group': ['A', 'B', 'A', 'B', 'A'],
    'numeric_cat': [1, 2, np.nan, 2, 1]
})

t1 = TableOne(df, columns=['numeric_cat'], categorical=['numeric_cat'], groupby='group', pval=True)
print(t1.tableone)

The fixes are:

  • Modified handle_categorical_nulls() to convert entire columns to string before replacing nulls with 'None', avoiding mixed-type category issues.
  • Added a key=str argument when sorting categories to prevent sorting errors in edge cases.

Fixes a TypeError that occurs when computing p-values for categorical variables containing missing values. The error arises because missing values are replaced with the string 'None', leading to mixed-type categories (e.g. int and str) which cannot be sorted during internal processing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants