-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
From discord conversation.
The idea is that since we now dynamically adapt predicates/filters on a per-file basis including adding/removing casts, etc. we could dynamically at write time narrow the logical types we encode in Parquet.
For example, if we are writing a column that is Int64 in the schema but for this file in particular the min/max values we see are (2, 19) we can add a UINT_16 annotation to signal to readers that this column is a UInt16 column. I'm not sure but I'd guess this can be done after writing the data as it's a metadata only operation. Then when a reader sees a predicate like col > 5 it can cast 5 to UInt16 and evaluate the whole thing more efficiently. I guess if it were col > 9999 it could (just from the metadata, without even looking at stats) replace that with false?.
A next step would be to narrow the physical representation of the data to save storage costs, etc, but I think that would be more involved since it would require two passes over the data or something like that.
cc @asubiotto