Don't use non-API functions SETLENGTH and SET_TRUELENGTH#8
Don't use non-API functions SETLENGTH and SET_TRUELENGTH#8
Conversation
…GTH and SET_TRUELENGTH
|
Bummer, I was very proud I of the speed I was able to wring out of hipread, so I'm sad to lose some of it (but I can't imagine it will matter much)! Your changes look fine to me, though this was definitely copied straight out of the vintage A few thoughts: It may make sense to change the rescale factor code here: Lines 63 to 66 in f96c3c8 and here Lines 174 to 178 in f96c3c8 These rely on it being better to over than under allocate, but I don't think that's true any more. So I think rather than inflating the guess by 1.1X (with a minimum of 1.5X), it may be faster to just use the guess. Ideally you'd benchmark to make sure my intuition is right. If I remember correctly, a non gzipped, rectangular file will have the guess be exactly correct, but it can be off for hierarchical files (because each record type's line lengths can be different, and the record types may not be uniformly distributed) and gzipped files (because the compression makes the file size per line non-exact). Also worth noting that both Also, ya'll really should get on a parquet extract, it seems like such an obvious win for both R and python users to me! |
|
Was curious how replaceable I am by ChatGPT, and it had 2 suggestions that could be worth pursuing. Don't know how much time you're able to dedicate to this, and I've done no due diligence, but it seems like this is a common enough problem that you shouldn't always have to copy the whole vector.
This is a semi-supported function in the R API.
|
|
Closing this in favor of alternative gergness PR. |
From CRAN checks:
The solution I've implemented here involves a fair bit of copying, unfortunately. However, in my testing, it doesn't cause a drastic slowdown in file read time -- as one example, a file with 19 million records and 31 variables that previously took 1.5 minutes to read will now take 2 minutes.