Skip to content

ENH: Fixed-length strings in read_csv #63373

@pulkin

Description

@pulkin

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I am not sure if I am looking in the right direction for this very simple problem so please advise.

I think the shortest formulation is this: I would like read_csv to use fixed-length string column data types where it currently uses "object" column data type.

Feature Description

Some combination of arguments in read_csv prohibiting 'object' data type and, instead, promoting appropriately-sized fixed-length string data types as a replacement. This will (hopefully) enable processing huge csv files while spawning a reasonably small number of python objects.

For my application, I am not even interested in the pandas dataframe itself but rather in read_csv that would return a numpy record array which I can use with cython (the latter does not support object-typed fields because reference counting with C structs is not really possible).

Alternative Solutions

Chatbot suggested this:

import numpy as np
import pandas as pd

def csv_to_recarray(filepath, delimiter=','):
    # Read CSV with pandas - single pass, infers types well
    df = pd.read_csv(filepath, delimiter=delimiter)
    
    # Build dtype for record array
    dtype_list = []
    for col in df.columns:
        if pd.api.types.is_integer_dtype(df[col]):
            dtype_list.append((col, 'i8'))
        elif pd.api.types.is_float_dtype(df[col]):
            dtype_list.append((col, 'f8'))
        elif pd.api.types.is_bool_dtype(df[col]):
            dtype_list.append((col, '?'))
        else:
            # String column - get max length
            max_len = df[col].astype(str).str.len().max()
            # Add buffer for safety, minimum 1
            max_len = max(1, max_len + 10)
            dtype_list.append((col, f'U{max_len}'))
    
    # Convert to record array
    rec_array = np.rec.array(
        [tuple(row) for row in df.values],
        dtype=dtype_list
    )
    
    return rec_array

# Usage
rec_array = csv_to_recarray('data.csv')

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions