-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I am not sure if I am looking in the right direction for this very simple problem so please advise.
I think the shortest formulation is this: I would like read_csv to use fixed-length string column data types where it currently uses "object" column data type.
Feature Description
Some combination of arguments in read_csv prohibiting 'object' data type and, instead, promoting appropriately-sized fixed-length string data types as a replacement. This will (hopefully) enable processing huge csv files while spawning a reasonably small number of python objects.
For my application, I am not even interested in the pandas dataframe itself but rather in read_csv that would return a numpy record array which I can use with cython (the latter does not support object-typed fields because reference counting with C structs is not really possible).
Alternative Solutions
Chatbot suggested this:
import numpy as np
import pandas as pd
def csv_to_recarray(filepath, delimiter=','):
# Read CSV with pandas - single pass, infers types well
df = pd.read_csv(filepath, delimiter=delimiter)
# Build dtype for record array
dtype_list = []
for col in df.columns:
if pd.api.types.is_integer_dtype(df[col]):
dtype_list.append((col, 'i8'))
elif pd.api.types.is_float_dtype(df[col]):
dtype_list.append((col, 'f8'))
elif pd.api.types.is_bool_dtype(df[col]):
dtype_list.append((col, '?'))
else:
# String column - get max length
max_len = df[col].astype(str).str.len().max()
# Add buffer for safety, minimum 1
max_len = max(1, max_len + 10)
dtype_list.append((col, f'U{max_len}'))
# Convert to record array
rec_array = np.rec.array(
[tuple(row) for row in df.values],
dtype=dtype_list
)
return rec_array
# Usage
rec_array = csv_to_recarray('data.csv')Additional Context
No response