Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and efficient data upload mechanism by leveraging signed URLs for direct cloud storage interaction and adopting compressed Parquet format for datasets. It also centralizes dataset limit management by fetching them from the server, ensuring consistent validation across the client. These changes collectively improve data transfer security, speed, and maintainability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant change to the data upload mechanism, switching to signed URLs and parquet format for better performance and security. It also fetches dataset limits from the server at initialization, making the client more robust to server-side changes. The implementation is solid, but I have a few suggestions to improve maintainability, resource management, and error handling, particularly around the new upload logic and data caching.
Note: Security Review did not run due to the size of the PR.
d783fee to
c65b8da
Compare
|
@simo-prior, we're going to have @ggprior give this a pass so that he can get some context for the v3 flag and whether that's needed at all as you pointed out. I think the one product question outstanding is whether we should enforce that users can't access the new model unless they're in the experiment or just control this client side with the understanding that some people will slip through early |
| logging.getLogger("httpcore").setLevel(logging.WARNING) | ||
| logging.getLogger("httpcore.http11").setLevel(logging.WARNING) | ||
|
|
||
| _DEFAULT_CHUNK_UPLOAD_PARALLELISM = 16 |
There was a problem hiding this comment.
@simo-prior how did we come up with this level of parallelism?
| def _get_crc32c_hash(data: bytes) -> str: | ||
| """Computes the CRC32C checksum and returns it as a base64 encoded string.""" | ||
| crc32c_value = google_crc32c.value(data) | ||
| return base64.b64encode(struct.pack(">I", crc32c_value)).decode("ascii") |
There was a problem hiding this comment.
My understanding was that we're using the file hash provided by GCS? Or this the only hash we use, including within any upstream systems?
| if cached_dataset_uid: | ||
| return cached_dataset_uid | ||
| limits = cls.get_dataset_limits() | ||
| if limits is not None: |
There was a problem hiding this comment.
What if dataset limits are indeed None - which can happen if the endpoint doesn't return any information about them? How do we treat that case?
| ): i | ||
| for i in range(num_chunks) | ||
| } | ||
| for future in as_completed(futures): |
There was a problem hiding this comment.
Just confirming - will this run _upload_single_chunk in a sequence? Obviously, order matters here.
| raise ValueError( | ||
| f"Number of classes {len(self.classes_)} exceeds the maximal number of " | ||
| f"{limits.max_classes} classes supported by TabPFN. Consider using " | ||
| "a strategy to reduce the number of classes. For code see " |
There was a problem hiding this comment.
"Consider using a strategy..." sounds a bit vague - can we mention the many class extension, besides showing the URL - which you already do nicely?
| logging.getLogger("httpcore").setLevel(logging.WARNING) | ||
| logging.getLogger("httpcore.http11").setLevel(logging.WARNING) | ||
|
|
||
| _DEFAULT_CHUNK_UPLOAD_PARALLELISM = 16 |
There was a problem hiding this comment.
Hm, so we got 2 layers of thread pool executors. So the actual max concurrency is 32. It seems a little high to me given that this should probably support a wide range of client setups
| on_backoff=_on_backoff, | ||
| on_giveup=_on_giveup, | ||
| ) | ||
| def _upload_single_chunk( |
There was a problem hiding this comment.
For my understanding, any reason why we use home-brew implementation over the google.cloud.storage SDK methods (that come with native backoff, concurrency etc bells and whistles and are pretty well tested)?
ggprior
left a comment
There was a problem hiding this comment.
Small concerns from me on the chunked upload implementation as this sort of things are pretty well handled on cloud provider SDK side usually to reduce maintenance overhead in tabpfn-client (but you may not want to add the dependency which might be a good reason to do this)
Also I'm wondering if the functionality could fall back cleanly onto the legacy API on any failures (directly sending the files through API) -- but it's optional and may be even preferable as proposed to get a clean cut.
Description