Don't use non-API functions SETLENGTH and SET_TRUELENGTH by dtburk · Pull Request #8 · ipums/hipread

dtburk · 2025-06-11T20:33:21Z

From CRAN checks:

Version: 0.2.4
Check: compiled code
Result: NOTE
  File ‘hipread/libs/hipread.so’:
    Found non-API calls to R: ‘SETLENGTH’, ‘SET_TRUELENGTH’
  
  Compiled code should not call non-API entry points in R.
  
  See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual,
  and section ‘Moving into C API compliance’ for issues with the use of
  non-API entry points.
Flavors: [r-devel-linux-x86_64-debian-clang](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/hipread-00check.html), [r-devel-linux-x86_64-debian-gcc](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/hipread-00check.html), [r-devel-linux-x86_64-fedora-clang](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-fedora-clang/hipread-00check.html), [r-devel-linux-x86_64-fedora-gcc](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-fedora-gcc/hipread-00check.html), [r-patched-linux-x86_64](https://www.r-project.org/nosvn/R.check/r-patched-linux-x86_64/hipread-00check.html), [r-release-linux-x86_64](https://www.r-project.org/nosvn/R.check/r-release-linux-x86_64/hipread-00check.html), [r-release-macos-arm64](https://www.r-project.org/nosvn/R.check/r-release-macos-arm64/hipread-00check.html), [r-release-macos-x86_64](https://www.r-project.org/nosvn/R.check/r-release-macos-x86_64/hipread-00check.html)

Version: 0.2.4
Check: compiled code
Result: NOTE
  File 'hipread/libs/x64/hipread.dll':
    Found non-API calls to R: 'SETLENGTH', 'SET_TRUELENGTH'
  
  Compiled code should not call non-API entry points in R.
  
  See 'Writing portable packages' in the 'Writing R Extensions' manual,
  and section 'Moving into C API compliance' for issues with the use of
  non-API entry points.
Flavors: [r-devel-windows-x86_64](https://www.r-project.org/nosvn/R.check/r-devel-windows-x86_64/hipread-00check.html), [r-release-windows-x86_64](https://www.r-project.org/nosvn/R.check/r-release-windows-x86_64/hipread-00check.html)

The solution I've implemented here involves a fair bit of copying, unfortunately. However, in my testing, it doesn't cause a drastic slowdown in file read time -- as one example, a file with 19 million records and 31 variables that previously took 1.5 minutes to read will now take 2 minutes.

…GTH and SET_TRUELENGTH

dtburk · 2025-06-11T20:34:29Z

@gergness @mpadge if you have time to review these changes, I would greatly appreciate it!

gergness · 2025-06-13T02:01:41Z

Bummer, I was very proud I of the speed I was able to wring out of hipread, so I'm sad to lose some of it (but I can't imagine it will matter much)!

Your changes look fine to me, though this was definitely copied straight out of the vintage readr package, so I'm way out of my depth.

A few thoughts:

It may make sense to change the rescale factor code here:

hipread/src/read_full.cpp

Lines 63 to 66 in f96c3c8

    
           if (i >= out[0]->size()) { 
        
             // Resize by guessing from the progress bar 
        
             resizeAllColumns(out, static_cast<int>((i / data->progress_info().first) * 1.1)); 
        
           }

and here

hipread/src/read_full.cpp

Lines 174 to 178 in f96c3c8

    
           // Resize by guessing from the progress bar 
        
           double resize_scale = 1.1 / data->progress_info().first; 
        
           // But make sure you at least 1.5x size because it's better to overallocate 
        
           resize_scale = std::max(1.5, resize_scale); 
        
           resizeAllColumns(out[rt_index], static_cast<int>(cur_pos_rt[rt_index] * resize_scale));

These rely on it being better to over than under allocate, but I don't think that's true any more. So I think rather than inflating the guess by 1.1X (with a minimum of 1.5X), it may be faster to just use the guess. Ideally you'd benchmark to make sure my intuition is right.

If I remember correctly, a non gzipped, rectangular file will have the guess be exactly correct, but it can be off for hierarchical files (because each record type's line lengths can be different, and the record types may not be uniformly distributed) and gzipped files (because the compression makes the file size per line non-exact).

Also worth noting that both data.table and vctrs have open issues on this, so it could be worth waiting for them to act first? However, I was able to find this data.table blog post, which if I understand correctly means they're planning a whole ALTREP implementation, which seems beyond the scope for this.

Also, ya'll really should get on a parquet extract, it seems like such an obvious win for both R and python users to me!

gergness · 2025-06-13T02:12:44Z

Was curious how replaceable I am by ChatGPT, and it had 2 suggestions that could be worth pursuing. Don't know how much time you're able to dedicate to this, and I've done no due diligence, but it seems like this is a common enough problem that you shouldn't always have to copy the whole vector.

Rf_lengthgets()

SEXP new_vec = Rf_lengthgets(old_vec, new_size);

This is a semi-supported function in the R API.
It reallocates the vector to a new size, copying over old elements and zero-filling or truncating as needed.
However:

It only works on certain vector types (REALSXP, INTSXP, STRSXP, etc.).
The resulting vector might be reallocated (i.e., it may not modify in place).
May not preserve attributes, depending on context.
✅ Safer than SETLENGTH, and often used in base R code, but still not ideal for CRAN, especially when combined with internal manipulations.

Rf_allocVector3() (R >= 4.4.0)
This is a newer API that allows allocation of a vector with pre-filled memory:

SEXP vec = Rf_allocVector3(TYPE, length, TRUE);  // TRUE = fill with NA

Only available in R 4.4.0+
Allows more efficient, zero-initialized vector allocation (NA-filled where appropriate).
Safer and more modern — but not suitable for older R versions unless you do version checks.

dtburk · 2025-06-16T15:15:49Z

Closing this in favor of alternative gergness PR.

Refactor column resizing code to avoid using non-API functions SETLEN…

f96c3c8

…GTH and SET_TRUELENGTH

dtburk closed this Jun 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use non-API functions SETLENGTH and SET_TRUELENGTH#8

Don't use non-API functions SETLENGTH and SET_TRUELENGTH#8
dtburk wants to merge 1 commit intomainfrom
replace-setlength-and-set_truelength

dtburk commented Jun 11, 2025

Uh oh!

dtburk commented Jun 11, 2025

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

dtburk commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dtburk commented Jun 11, 2025

Uh oh!

dtburk commented Jun 11, 2025

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

dtburk commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants