-
Notifications
You must be signed in to change notification settings - Fork 85
Support zarr-python 3.x #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This implements a zarr-python 3.x compatible GDSStore.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
I added a few notes / questions to the original post about how we want to handle backwards compatibility and packaging, if anyone has thoughts there. |
|
I've updated this to make the If the user has zarr-python 2.x installed, then there's no change. If you have zarr-python 3.x, then your code might continue to work if you're just using |
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. One question, can we get CI to run both tests?
Maybe pip install Zarr v3 in run_pytests.sh or something?
|
It looks like zarr-python 3 is python >= 3.11, whereas I think we're >= 3.10. That might complicate things a bit. Per their migration guidelines, v2 is only likely to be supported until the summer. Given that, I would be OK with only supporting v3 and just not supporting this feature with py3.10. |
Is also fine with me. |
|
I'm looking into the versions resolved during CI now. Currently, all jobs are using zarr-python 2.x
Assuming we can get the 3.12 job using zarr 3.x, I think we'll be in a good spot for CI / testing. |
| from importlib import metadata as _metadata | ||
|
|
||
| from packaging.version import Version as _Version, parse as _parse | ||
|
|
||
| if _parse(_metadata.version("zarr")) >= _Version("3.0.0"): | ||
| from ._zarr_python_3 import * # noqa: F401,F403 | ||
| else: | ||
| from ._zarr_python_2 import * # type: ignore[assignment] # noqa: F401,F403 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if we're doing this, I would prefer that we have:
import kvikio.zarr # old zarr-python 2 interface, for backwards compat, warns about deprecation
import kvikio.zarr2 # move the zarr-python 2 code here
import kvikio.zarr3 # new zarr-python 3 interface
This way, a user's scripts don't suddenly break because they run on them on a machine with a different environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two strategies here: try to preserve source-code compatibility or not.
So under option 1, we have the potential for some user's code working with either zarr-python 2.x or 3.x, but if and only if the user limits themselves to kvikio.zarr.GDSStore and the public Zarr API that's compatible between v2 and v3 (essentially, passing in store=store to most of the group and array APIs).
If you're using any of the other features of kvikio.zarr like open_cupy_array or any of the compressors then that will break if you update to zarr-python 3.x, because we don't attempt to provide source compatibility there. And if you're using one of the zarr-python APIs that changed between zarr-python 2.x and 3.x, then stuff will break but that's outside of kvikio's hands.
We also get the (IMO) nicer name of kvikio.zarr. I dislike kvikio.zarr3 since it's not immediately clear whether the 3 refers to zarr format version or the zarr-python version (it's the zarr-python version, which can write files with zarr-format=2 or 3).
I don't have a strong preference between the two strategies. My initial implementation would have required users to import the new module name with zarr-python 3.x, but I very slightly preferred trying to preserve source compatibility for that (potentially narrow) subset of use cases where source compatibility is possible to preserve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess I see that. I'm kind of on the fence because presumably at some point in the future we will only support zarr3 and then there'll be another breakage.
What do we think is our best migration strategy, if at some point in the future kvikio.zarr will mean kvikio.zarr3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point in the future we will only support zarr3 and then there'll be another breakage.
What would cause the second breakage? Moving people back to kvikio.zarr, if we do the kvikio.zarr2 / kvikio.zarr3 split? IMO that wouldn't be worth the extra churn.
Here's an attempt at a summary. Under the PR as currently written, we have
| time | import | behavior |
|---|---|---|
| previous | kvikio.zarr | only zarr-v2 supported |
| today | kvikio.zarr | zarr v2 or v3 supported. API depends on version at runtime. Some code will work with either, some code will break. |
| future | kvikio.zarr | only zarr v3 supported. |
Under the split zarr.v2, zarr.v3 proposal
| time | import | behavior |
|---|---|---|
| previous | kvikio.zarr | only zarr-v2 supported |
| today | kvikio.zarr2 or kvikio.zarr3 | zarr v2 or v3 supported. Users must update source to use the version they want. |
| future | kvikio.zarr (or kvikio.zarr3?) | only zarr v3 supported. Users must update their source if we deprecate kvikio.zarr3, or do nothing if we leave it as kvikio.zarr3. |
I'm still leaning slightly towards just using kvikio.zarr thanks to that potential for maintaining source-compatibility, but won't push strongly for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still leaning slightly towards just using
kvikio.zarrthanks to that potential for maintaining source-compatibility, but won't push strongly for it.
Agree, but I have no strong opinion either :)
@wence- ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a convincing argument in favour of the approach here. Let's go with that
| with kvikio.CuFile(path) as f: | ||
| # Note: this currently creates an IOFuture and then blocks | ||
| # on reading it. The blocking read means this is in a regular | ||
| # sync function, and so this must be run in a threadpool. | ||
| future = f.pread(raw, size=nbytes, file_offset=file_offset) | ||
| future.get() # blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Should we figure out a way to hook this up with the normal asyncio stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe... When I went down a rabbit hole investigating zarr-developers/zarr-python#2863 (comment) I looked into this a bit and came across some discussions around supporting io_uring for asyncio on local files (python/cpython#85443, https://discuss.python.org/t/asyncio-for-files/31077). Those haven't gone anywhere.
In theory we ought to be able to do stuff with selectors and file descriptors. I'll do a bit more research and will open an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#79 (comment) has a bit of a writeup with my findings if you're curious, but the tl/dr is that this seems pretty challenging (at least given my understanding of the C++ / CUDA side of things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks. It looks like if we would want to do that we need to return honest-to-goodness python futures from pread on the python side. There's some work changing how the C++ futures are launched that might help things here.
What I would be worried about getting right is that setting the python future status requires the gil so locking order is going to be tricky.
In https://github.com/rapidsai/kvikio/actions/runs/13633004880/job/38105532469?pr=646#step:10:88 we pick up an alpha release of zarr-python 3. There aren't any official releases of zarr-python 3.x that support python 3.10, but some alphas did. I'm not 100% why pip decided to pick that alpha release, but it's possible that the upper bound on zarr-python<4.0.0a0 allowed pip to select an alpha release. By removing that `.a0` I'm hopeful that pip will only consider official releases, and pick the latest 2.x release on python 3.10
|
CI stuff should be sorted out in 172eeb7. The 3.10 environments are resolving to zarr-python 2.18.3 and running the tests in The 3.12 tests are getting the latest version of zarr and are running the tests in |
madsbk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
| packages: | ||
| - rapids-dask-dependency==25.4.*,>=0.0.0a0 | ||
| - pytest | ||
| - pytest-asyncio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed to run the tests from zarr's StoreTests.
bdice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packaging changes look good.
Before merging, could you rewrite the PR description to reflect the final state rather than the wide range of design options discussed in the early stages of this PR? It will go into the commit message and will also help future readers of this PR.
This adds support for zarr-python 3.x to kvikio.
zarr-python 3.x made large changes to the
storeAPI and the way arrays (and their location) are specified, but minimal changes to the mainGroupandArrayAPIs. So in this context "support" for zarr-python 3.x means just a newStoreclass that useskvikiofor file I/O.In particular, we don't need a replacement for anything else like
open_cupy_arrayor compressors: these are built into zarr-python 3.x now.Closes #582.