Don't pool small inline string columns by default by quinnj · Pull Request #983 · JuliaData/CSV.jl

quinnj · 2022-02-19T23:53:37Z

Fixes #982. Since the size of String1 and String3 are <= the size of
the ref integer type we use for pooling (UInt32), let's avoid pooling
them by default. Users can still request specific columns be pooled like
always.

Fixes #592. With the new `select`/`drop` keyword arguments, we'll now be seeing many more column types w/ typecodes like `(USER | MISSING)`. Which are handled specially when parsing. In particular, the parsing tape is allocated with length 0. For parsing a column of this type, we just consume the cell in the file without storing anything. The problem in the original issue was in the tape re-allocation code, we assumed every tape had length to copy from old tape to new tape. There was another issue in the invalid row checking code where if a row didn't have enough columns, we usually set the value of each unparsed column in the row to `missing`, but that doesn't matter or is needed for `(USER | MISSING)` columns.

We already allow fairly efficient means of parsing typed values when iterating via `CSV.Rows` with `Parsers.parse` and `CSV.detect` definitions, but ideally, you could pass your own types that you care about to `CSV.Rows` and have them parsed while iterating/consuming the file. To do this properly and not cause massive code duplication, it's required refactoring things quite a bit to share a lot of common logic between CSV.File and CSV.Rows (hence a previous PR to introduce CSV.Header). This PR further refactors things so that CSV.File and CSV.Row use the same core `parserow` method for consuming a single row of a file. One potential downside is that we're changing CSV.Rows from using a single contiguous `Vector{UInt64}` as a tape, to using a `Vector{Vector{UInt64}`, where each element vector has length 1. It's not clear to me what kind of memory cache line penalty we might pay, but I plan on doing some benchmarking to try and see if it's drastic (computers are pretty smart these days, so hopefully it's fine). Otherwise, the changes here aren't too drastic; mainly the refactoring and then the "value access" functions changing from always returning Strings, to returning the typed value if there was one. I need to add some additional tests here, and generally benchmark/kick the tires to see if anything else shakes out, but seems to work fairly well so far.

file in testfiles (by using the types detected in CSV.File), that should give us pretty good coverage. The other change is properly supporting select/drop in CSV.Rows. It's not too bad, but just requires keeping a `columnmap` field around so that when you request column "3", we can figure out if that means actual third column in the file, or if there were some columns dropped, and you're actually accessing the 4th column.

* Fix AM/PM parsing Fixes #597. The issue here was two-fold: one in Parsers.jl for supporting AM/PM parsing on byte buffers (fix [here](JuliaData/Parsers.jl@c0ec6ab)), and an issue here in CSV.jl which involved `CSV.timetype`. This function takes a user-provided `DateFormat` and tries to figure out if it's meant to parse `Time`, `Date`, or `DateTime`. The logic, however, was pretty basic and only checked if the `DateFormat` had a year part for `Date`, and `H` (hour) part for `Time`, and if a `DateFormat` had both, it was `DateTime`. The issue is when a user passes `'I'` for 12-based hour parts, we didn't detect that the date format had a time part. The logic is now more robust by checking any year/month/day parts for `Date`, and any hour, minute, second, millisecond parts for `Time`. * Fix tests because AM/PM isn't supported on Julia 1.0

* Support different dateformats per column This has been long requested, but the internal structure always assumed that we'd only ever need a single `Parsers.Options` for an entire file. For certain parsing features, like date formats, they can vary column to column, so it makes sense to allow specifying options for a specific column. This isn't too onerous internally, because we're currently restricting the option to just dateformats; theoretically we could open up additional options that may affect a column's type like truestrings, falsestrings, decimal, and missingstring (though for several of these, you can already specify multiple options, but this would allow, for example, specifying a missing string for one column and a different missingstring for another, without overlap). I'm limiting to just multiple dateformats for now to kind of kick the tires and see how minimal of a change we can make this to support the actual requested feature, but I'm interested to see if we want any other parsing options to be configurable per column in the future. * fix CSV.Row2 * fix nothign printing

Should fix #603. Every once in a while, a bug comes along that just makes you feel bad. You dig in, you discover the root cause, and you're like, "dammit past self, what were you thinking?". This is one of those bugs. A little background: last time I overhauled `CSV.write`, the aim was to make the code much simpler and more performant. Before, we were doing individual writes to the output `IO`, which can be really slow depending on the OS, filesystem, etc. Having a buffer, in our case 4MB worth, can increase performance because then the bulk of operations are just storing bytes in an array, with the occasional "flush" out to the `IO`. In write.jl, we have a handy little `@check` macro that allows a writing function to say `@check 3` to ensure there are at least 3 bytes of space left in the buffer to write to. This, again, is an optimization vs. doing an "end-of-buffer" length check for each byte written. We check once upfront, then `@inbounds buf[pos] = byte` for as many bytes as we checked for. So obviously you're going to get in trouble if you ever forget to call `@check n` before writing to the buffer, which is what we did here. I'm sure at the time, my innocent, yet prideful self thought, "ha, this is just one measly `'-'` byte I'm writing to the buffer, it's probably fine to avoid the check just this once". But therein lies the problem: a previous write operation may have, with just the right alignment/operations, ended writing at the exact end of the 4,194,304 byte buffer. This means the current writing position would be one byte past the end of the buffer. Then along comes a negative Integer and pow!, not only do we write past the end of the buffer with `@inbounds buf[pos] = UInt8('-')`, but we then increment `pos += 1`, grab the length of the digits in our Integer, call `@check n`, which realizes we're going to be (are) past the end of the buffer, so it goes to flush and whoops: we just tried to flush `pos - 1`, which is also one byte past the end of the buffer. Goodness. The moral of the story is: always check your buffer lengths, folks.

* Provide a row writing iterator Implements suggestion in #608. This introduces a new structure `CSV.RowWriter`, which is the writing complement to `CSV.Rows`. It allows taking any Tables.jl-compatible input, with the same options as `CSV.write`, but instead of writing out to a file, returns an iterator, where elements are a full csv-written row. We mostly had all the machinery in place to provide this, so it's not a ton of new code. I wondered a bit if the first iteration should produce column names or just data and if that should be configurable, but decided to just do column names on first iteration and data thereafter. I think it's easy enough for callers to drop the first iteration if not needed anyway, but I'd love to hear people's feedback on that. I haven't done any tests or docs for this yet; mainly put it up in order for people to review, so let me know what you think of the design and if we're missing anything major here. * lots of tests * Add docs and error when we overflow teh row buffer * couple fixes

* Alternative bool values added by default * removed Manifest.toml

The change to `levels!` implies that all reference codes will be updated at the end to match the new order of levels (unless they were already sorted). Require DataFrames 0.21 which is the only version to support CategoricalArrays 0.8. Bump version to 0.6.2

) * Refactor CSV internals to produce fully mutable columns by default This involves quite a bit, so I'll try to spell out some of the chunks of work going on here, starting with the key insight that kicked off all this work. Previously, to get a "type-stable" inner parsing loop over columns, I had come to the conclusion you actually needed the same concrete typed column type, no matter the column type of the data. This led to the development of the `typecode` + `tapes` system that has served us pretty well over the last little while. The idea there was any typed value could be coerced to a `UInt64` and thus every column's underlying storage was `Vector{UInt64}`, wrapped in our `CSV.Column` type to reinterpret the bits as `Int64`, `Float64`, etc. when indexed. This was a huge step forward for the package because we were no longer trying to compile custom parsing kernels for every single file with a unique set of column types (which could be fast once compiled, but with considerable overhead at every first run). The disadvantage of the tape system was it left our columns read-only; yes, we could have made the regular bits types mutable without too much trouble, but it was hard to generalize to all types and we'd be stuck playing the "make `CSV.Column` be more like `Array` in _this_ way" game forever. All those poor users to have tried mutating operations on `CSV.Column` not realizing they needed to make a copy. While refactoring ODBC.jl recently, at one point I realized that with the fixed, small set of unique types you can receive from a database, it'd be nice to roll my own "union splitting" optimization and unroll the 10-15 types myself in order to specialize. This is because the core Julia provided union-splitting algorithm will bail once you hit 4 unique types. But in the case of ODBC.jl, I knew the static set of types and unrolling the 12 or so types for basically `setindex!` operations could be a boon to performance. As I sat pondering how I could reach into the compiler with generated functions or macros or some other nonsense, the ever heroic Tim Holy swooped in on my over-complicated thinking and showed me that just writing the `if` branches and checking the objects against concrete types _is exactly the same_ as what union-splitting does. What luck! What this means for CSV.jl is we can now allocate "normal" vectors and in the innermost parsing loop (see `parserow` in `file.jl`), pull the column out of our `columns::Vector{AbstractVector}` and check it against the standard set of types we support and manually dispatch to the right parsing function. All without spurious allocations or any kind of dynamic dispatch. Brilliant! In implementing this "trick", I realized several opportunities to simplify/clean things up (net -400 LOC!), which include: * Getting rid of `iteration.jl` and `tables.jl`; we only need a couple lines from each now that we are getting rid of `CSV.Column` and the threaded-row iteration shenanigans we were up to * Getting rid of the `typecodes` we were using as our pseudo-type system. While this technically could have stayed, I opted to switch to using regular Julia types to streamline the process from parsing -> output, and hopefully pave the way to supporting a broader range of types while parsing (see #431). * Move all the "sentinel array" tricks we were using into a more formally established (and tested) [SentinelArrays.jl](https://github.com/JuliaData/SentinelArrays.jl) package; a `SentinelArray` uses a sentinel value of the underlying array type for `missing`, so it can be thought of as a `Vector{Union{T, Missing}}` for all intents and purposes. This package's array types will be used as solutions to using regular `Vector{Union{T, Missing}}` are worked out; specifically, how to efficiently allow a non-copying operation like `convert(Vector{T}, A::Vector{Union{T, Missing}})` Probably most surprisingly with this PR is that it's pretty much non-breaking! I haven't finished the `CategoricalArrays` support just yet, but we'll include it for now as we get ready to make other deprecations in preparation for 1.0. The other bit I'm still mulling over is how exactly to treat string columns; in this PR, we just fully materialize them as `Vector{String}`, but that can be a tad expensive depending on the data. I'm going to try out some solutions locally to see if we can utilize `WeakRefStringArray` but will probably leave the option to just materialize the full string columns, since we've had people ask for that before. If you've made it this far, congratulations! I hope it was useful in explaining a bit about what's going on here. I'd love any feedback/thoughts if you have them. I know it's a lot to ask anyone to review such a monster PR in their free time, but if you're willing, I'm more than happy to answer questions or chat on slack (#data channel) to help clarify things. * Support CategoricalArrays * fixes * travis updates * Make string allocating more efficient, particularly in the threaded case where we'll piece the threaded chunks together via threads.: * More multithreaded refactoring * Get tests passing again * fix 32-bit

Previously, we relied on a BoundsError to be thrown, but that can be cryptic for users who inadvertently hit the internal buffer size limit. Fixes #643 by throwing a more descriptive, helpful error.

turning them into Vector{Missing}

Instead of keeping track of the position/length of every single cell, we move to a "reparse" strategy. The problem with tracking poslengths was the prohibitive memory cost, usually doubling the memory requirements in the type detection case. This change tracks the starting position of a chunk and will go back and reparse a single column as string if needed. I'm not exactly sure how much test coverage we have on this, but I'll dive in to review and make sure we have some good threaded/non-threaded cases where promotion is needed.

…le parsing (#650) Support lazystrings=true option to avoid fully allocating strings while parsing

* Clean up limit support and allow limiting when multithreaded parsing

Fixes #642. There were a couple of problems with our error/warnings reporting which row was affected. We weren't accounting for any initial data row offsets. For multithreading, this is a tricky problem because each chunk doesn't inherently know which absolute row it's parsing when it encounters a problem. For now, we pass in the "guess" of which row we're on based on the estimated # of rows. It's not amazingly accurate by any means, but should at least get the general ballpark. The other consolation is that in non-threaded mode or CSV.Rows, the row # reporting should be exact, if you really need to track down a problematic row.

Fixes #631. skipto and datarow are treated identically. If they're specified, they don't affect the header arg at all, so we clarify that in the docs. We also ensure skipto gets the same validation as datarow had.

)

Fixes #945. The core issue here is an inconsistency between the initial type detection code and the later-on type promotion code while parsing. The type detection code had a limit setup so that inline string column types weren't allowed larger than `String31`. The promotion code, however, allowed inline string types to continue to promote up to `String255`. That means if a column type was at least initiall detected as some smaller-than-`String31` type, then later on a value larger than 31 bytes was parsed, it would continue to promote up to larger inline string types. The change proposed in this PR is to limit the promotion code to be more consistent with the type detection code: if a parsed value "overflows" the `String31` type, we'll just promote to a regular `String` instead of promoting to larger inline string types.

Fixes #948. As noted in the included code comment, if the last expected value to be parsed is `missing`, the delimited file may not include a final newline, so the EOF, while abrupt, can be treated as valid. In this specific case, we can avoid emitting the warning and just ensure that column is marked as having a `missing` value parsed.

* Add tests for duplicate column name handling In the next commit, we'll be changing the code responsible for naming duplicate columns and these tests should ensure that the behavior doesn't change. * Improve performance of reading files with duplicate column names I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because makeunique(), which produces unique column names, has cubic complexity. This commit changes the algorithm to use a Dict to quickly look up the existence of columns and to cache the next numeric suffix used to uniquify column names. Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the previous commit.

* Overhaul how column pooling works while parsing Fixes #950. Ok, so there's a lot going on here, let me try to summarize the motivation and what the changes are doing here: * In 0.8 releases, `CSV.File` would parse the 1st 100 rows of a file, then do a calculation for whether the # of unique values and provided `pool` argument should determine if a column should be pooled or not * In 0.9, this was changed since just considering the 1st 100 rows tends to not be very representative of large files in terms of # of unique values. Instead, we parsed the entire file, and afterwards did the calculation for whether the column should be pooled. * Additionally, in multithreaded parsing, each thread parsed it's own chunk of the file, assuming a column would be pooled. After parsing, we needed to "synchronize" the ref codes that would be used by a pooled column. This ended up being an expensive operation to recode each chunk of a column for the whole file and process and reprocess the unique values of each chunk. This was notably expensive when the # of unique values was large (10s of thousands of values), as reported in #950. That provides the context for why we needed to change, so what is changing here: * We previously tried to assume a column would be pooled before parsing and build up the "pool dict" while parsing. This ended up complicating a lot of code: additional code to build the dict, complications because the element type doesn't match the column type (`UInt32` for ref array, `T` for pool dict key type), and complications for all the promotion machinery (needing to promote/convert pool values instead of ref values). That's in addition to the very complex `syncrefs!` routine, that has been the source of more a few bugs. * Proposed in this PR is to ignore pooling until post-parsing, where we'll start checking unique values and do the calculation for whether a column should be pooled or not. * Also introduced in this PR is allowing the `pool` keyword argument to be a number greater than `1.0`, which will be interpreted as an absolute upper limit on the # of unique values allowed before a column will switch from pooled to non-pooled. This seems to make sense for larger files for several reasons: using a % can result in a really large # of unique values allowed, which can be a performance bottleneck (though still allowed in this PR), and it provides a way for us to "short-circuit" the pooling check post-parsing. If the pool of unique values gets larger than the pool upper limit threshold, we can immediately abort on building an expensive pool with potentially lots of unique values. We update the `pool` keyword argument to be a default of `500`. Another note is that a file must have number of rows > then `pool` upper limit in order to pool, otherwise, the column won't be pooled. One consideration, and potential optimization, is that if `stringtype=String` or a column otherwise has a type of `String`, we'll individually allocate each row `String` during parsing, instead of getting a natural "interning" benefit we got with the pooling strategies of 0.8 or previous to this PR. During various rounds of benchmarking every since before 0.8, one thing that has continued to surface is how efficient allocating individual strings actually is. So while we could _potentially_ get a little efficiency by interning string columns while parsing, it's actually a more minimal benefit that people may guess. The memory saved by interning gets partly used up by having to keep the intern pool around, and the extra cost of hashing to check to intern can be comparable to just allocating a smallish string and setting it in our parsed array. The other consideration is multithreaded parsing: the interning benefit isn't as strong when each thread has to maintain its own intern pool (not as many values to intern against). We could perhaps explore using a lock-free or spin-lock based dict/set for interning globally for a column, which may end up providing enough performance benefit. If we do manage to figure out efficient interning, we could leverage the intern pool post-parsing when considering whether a column should be pooled or not. * Allow pool keyword arg to be Tuple{Float64, Int}

* add option stripwhitespace * add stripwhitespace to Contexts * Add tests for stripwhitespace (issue 951) * Update src/keyworddocs.jl * Update test/basics.jl Co-authored-by: Thomas Poulsen <ta.poulsen@gmail.com> Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>

* Correct error writing struct * Update src/write.jl * Update src/write.jl * Update test/write.jl * Delete jcunwin.jl Co-authored-by: John Unwin <john@fingertip.co.nz> Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>

* Fix use of limit in multithreaded parsing Fixes #963. The issue here is that although we were adjusting the # of rows to a provided limit when multithreaded parsing, we failed to adjust the actual column arrays to the correct size. This was an issue when we converted from the old `CSV.Column` custom array type to returning "normal" arrays in the 0.7 -> 0.8 transition. With `CSV.Column`, we just passed the final row total and it adjusted the size dynamically, without physically resizing the underlying array. With regular arrays, however, we need to ensure the array gets resized appropriately. This became more apparent in the recent pooling change that was released since it actually became a silenced BoundsError because of the use of `@inbounds` in the new `checkpooled!` routine. I've taken out those `@inbounds` uses for now to be more conservative. The fix is fairly straightforward in that if we adjust our final row down to a user-provided limit, then we loop over the parsing tasks and "accumulate" rows until we hit the limit and then resize or `empty!` columns as appropriate. * Fix

These silence some long warnings when inspecting a CSV-dependent package with JET.

Resolves #981.

Implements #1001. We already had the internal methods here, so this PR just adds some higher-level user-facing methods that take a plain "row" (from the Tables.jl "Row" interface) or an `io` and row. This is convenient when you don't have a traditional iterator (and can use RowWriter), but just want to repeatedly call `CSV.writerow` yourself and get the delimited output.

* add alternatives to readme * two more links * tweak * update DelimitedFiles note Co-authored-by: Chris Elrod <elrodc@gmail.com> * load/save suggestion * Update README.md Co-authored-by: Chris Elrod <elrodc@gmail.com>

* typos * typo

…eyword (#1016) * Support `types` being `Dict{Regex} * Validate `types` getting `Regex` * Support `types` being `Dict{Any}` with `Regex` key(s) * Test `dateformat` now also understands Regex * fix typo in test * Add test for when `Regex` and exact name both match * Document that Regex can be used to identify columns by name

…ing (#1023) * Pull out initial type-setting logic into function * Refactor function for initialising cols with types * Re-use type-setting logic when new column found * Split fallback case into specific methods * Add some internal docs/comments * Test `types isa Function` case * fixup! Re-use type-setting logic when new column found * Test `types isa AbstractDict` case * Log message if we expect parsing to fail - We expect parsing to fail in the case edge case where we unexpectedly find an extra column and its type was set via the `types` keyword to be some type `T` which is a type we don't know how to parse (i.e. `T` is not a standard/supported type nor a custom type which we have already compiled a specialised parse method for). - In such a case, log an error-level message noting to inform the user we are in this case and parsing is not expected to work. - In future we could considering trying to support this situation. * fixup! Test `types isa AbstractDict` case * fixup! Log message if we expect parsing to fail

Fixes #982. Since the size of `String1` and `String3` are <= the size of the ref integer type we use for pooling (`UInt32`), let's avoid pooling them by default. Users can still request specific columns be pooled like always.

quinnj and others added 30 commits April 2, 2020 20:53

Remove some inlining

a1e02e6

Add a bunch of source code comments

ac9fe25

Allow FilePathsBase 0.8

250f718

Reading CSV with boolean column (#619)

bc9859d

* Alternative bool values added by default * removed Manifest.toml

bump [compat] FilePathsBase="0.6,0.7,0.8,0.9"; explictly import / (#632)

94aa1e7

Update LICENSE.md

057ae33

Check for write buffer overflow and throw error (#647)

0f34f3b

Previously, we relied on a BoundsError to be thrown, but that can be cryptic for users who inadvertently hit the internal buffer size limit. Fixes #643 by throwing a more descriptive, helpful error.

Move MissingVector to SentinelArrays and chain them directly instead of

d78122e

turning them into Vector{Missing}

Support lazystrings=true option to avoid fully allocating strings whi…

e064d35

…le parsing (#650) Support lazystrings=true option to avoid fully allocating strings while parsing

Cleanup limit support and allow it when multithreaded parsing (#651)

60c0bef

* Clean up limit support and allow limiting when multithreaded parsing

Update the multithreaded chunking error message to be more helpful

72fd512

Clean up skipto and datarow a bit (#653)

54e5443

Fixes #631. skipto and datarow are treated identically. If they're specified, they don't affect the header arg at all, so we clarify that in the docs. We also ensure skipto gets the same validation as datarow had.

Update the multithreaded chunking error message to be more helpful (#654

1704673

)

Start support for parsing custom column types

4ec4154

add randoms file for debug

57fc0d7

more work towards supporting custom types

860dc49

get tests passing

d46da60

fix tests

90b368e

remove decfp from testing because it doesn't work on 32-bit

fd09baf

quinnj and others added 28 commits November 16, 2021 00:26

bump version

879f41b

Correct error writing struct (#961)

654baba

* Correct error writing struct * Update src/write.jl * Update src/write.jl * Update test/write.jl * Delete jcunwin.jl Co-authored-by: John Unwin <john@fingertip.co.nz> Co-authored-by: Jacob Quinn <quinn.jacobd@gmail.com>

Bump version

84232cf

Bump versions

1061e24

Bump version

df718c7

A couple of inferrability improvements (#968)

61f8891

These silence some long warnings when inspecting a CSV-dependent package with JET.

Update index.md (#966)

5145f3d

fix Parsers version to unbreak precompilation (#979)

4fb9b0e

Fix auto bool detection (#980)

2c02394

Require Parsers 2.2.3 to fix precompilation on windows (#986)

4891dd2

Resolves #981.

Bump version

604333f

Simplify precompilation to try and avoid crash on windows (#996)

48245f7

Bump version

3ebd2c9

Show an example on top of CSV.read (#1004)

6073652

Add links to alternatives to the readme (#1006)

cdb0cbb

* add alternatives to readme * two more links * tweak * update DelimitedFiles note Co-authored-by: Chris Elrod <elrodc@gmail.com> * load/save suggestion * Update README.md Co-authored-by: Chris Elrod <elrodc@gmail.com>

Fix typo in README.md (#1020)

1ade6a8

Add example of types taking a function (#1015)

b87a0c7

fix typos (#1026)

b4360cc

* typos * typo

Don't pool small inline string columns by default

33d1d85

Fixes #982. Since the size of `String1` and `String3` are <= the size of the ref integer type we use for pooling (`UInt32`), let's avoid pooling them by default. Users can still request specific columns be pooled like always.

nickrobinson251 force-pushed the jq/982 branch from 04f066a to 33d1d85 Compare October 7, 2022 13:10

quinnj force-pushed the main branch from 04ec1cf to 4f8c505 Compare January 12, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't pool small inline string columns by default#983

Don't pool small inline string columns by default#983
quinnj wants to merge 734 commits intomainfrom
jq/982

quinnj commented Feb 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

quinnj commented Feb 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants