-
Notifications
You must be signed in to change notification settings - Fork 5
[RFC] Add module to make datasets IO easier with pandas #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
This is quite interesting! Some initial thoughts, and coming from a place of ignorance:
|
|
Also just for clarity, with the 2 options above, if the 2nd is used, you mean that the whole of pandas would be available as a namespaced component? Or just these functions? For the shorter import, I wonder if either would be more natural (so keeping the original |
|
Thanks @acroz , thought about this a bit more and considered all the comments above. How about the following? from faculty import datasets
url = datasets.presigned_url("/path/to/any/file")We can then add a section in the docs (or docstrings) illustrating usage with As for writing, you can do something like This satisfies the two requirements:
|
|
@acroz Also ran a quick test to compare speed for an AWS backend. For a 139M CSV file,
These are very promising because |
f9d8835 to
8638804
Compare
This needs tests added before merging.
While this is still in draft stage, and prior to implementing tests, I'd like to get review on the API proposed by this PR.
Expected usage looks like:
These mirror closely the pandas API (extra args and kwargs are just passed through), except that the
to_csvfunctionality in pandas is a method and not available (AFAICT) as a static function.Pandas as an optional dependency
facultydoes not currently depend on numpy or pandas. It's nice to keep it that way, as the library can be kept lightweight for the majority of applications where the sometimes-expensive installation of numpy is not required. I propose that an optional dependency on pandas be included for this functionality via anextras_requireentry insetup.py.For the main expected use case (inside the platform), pandas is always expected to be available, so users will rarely encounter the case where it's not available. Managing the case where pandas is not installed could be:
ModuleNotFoundError.I'm interested in input on the above or other options.
## Possible aliases
Current recommended style when using
faculty.datasetsis:People seem to prefer shorter aliases for things (it seems the data science community finds the 5/6 characters of numpy/pandas too lengthy!) so we may want to encourage a particular alias, such as:
Extra ideas welcome.
Alternatively, if we go with option 2 above (import pandas at function call-time), we could import
faculty.datasets.pandasinfaculty/datasets/__init__.py, and then the pandas functionality appears as some namespaced components offaculty.datasets, e.g.: