Support upload via signed-urls by simo-prior · Pull Request #227 · PriorLabs/tabpfn-client

simo-prior · 2026-03-04T17:47:04Z

Description

Support upload via signed-urls.
Dataset limits downloaded at init() from the server.
Convert datasets to compressed parquet before uploading.

chatgpt-codex-connector · 2026-03-04T17:47:09Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

gemini-code-assist · 2026-03-04T17:50:00Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and efficient data upload mechanism by leveraging signed URLs for direct cloud storage interaction and adopting compressed Parquet format for datasets. It also centralizes dataset limit management by fetching them from the server, ensuring consistent validation across the client. These changes collectively improve data transfer security, speed, and maintainability.

Highlights

Signed URL Uploads: Implemented support for uploading datasets using signed URLs, enhancing security and efficiency for data transfer to the server. This involves a new workflow where the client requests signed URLs from the server, then directly uploads data to a cloud storage bucket.
Server-Side Dataset Limit Enforcement: Integrated server-side dataset limits for maximum size, cells, columns, and classes. These limits are now fetched from the server during client initialization and enforced client-side before data upload, providing more dynamic and consistent validation.
Parquet Compression for Datasets: Converted dataset serialization from CSV to compressed Parquet format (using zstd compression) before uploading. This significantly reduces data size and improves upload performance.
Parallel Chunked Uploads: Enabled parallel uploading of dataset chunks to cloud storage when multiple signed URLs are provided, leveraging ThreadPoolExecutor for faster data transfer.
Dependency Updates: Added google-crc32c as a new dependency for CRC32C checksum calculation during uploads and updated various other package versions in uv.lock.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pyproject.toml
- Added google-crc32c to project dependencies.
- Updated tabpfn-client version to 0.2.9rc6.
src/tabpfn_client/client.py
- Imported new modules including base64, concurrent.futures, io, struct, google_crc32c, pandas, and uuid.
- Introduced _dataset_limits global variable and CHUNK_UPLOAD_PARALLELISM constant.
- Defined new Pydantic models (FileInfo, FileUploadInfo, PrepareTrainSetUploadRequest, PrepareTrainSetUploadResponse, DuplicateFilesUploadedResponse, GetDatasetLimitsResponse) for GAPI interactions.
- Added _serialize_to_parquet function to convert dataframes to zstd-compressed Parquet bytes and calculate CRC32C hash.
- Modified ServiceClient to follow_redirects=True for HTTPX client.
- Added get_dataset_limits class method to ServiceClient to fetch dataset constraints from the server.
- Refactored the fit method to use Parquet serialization, enforce server-side limits, prepare uploads via a new GAPI endpoint, and perform parallel chunked uploads to GCS.
- Removed the backoff decorator from the main fit method and applied it to a new internal _fit method for stream processing and _upload_single_chunk for GCS uploads.
- Introduced _prepare_train_set_upload method to handle the initial request for signed URLs from the server.
- Added _upload_to_gcs and _upload_single_chunk methods to manage the actual data transfer to Google Cloud Storage using signed URLs and CRC32C checksums.
- Removed dataset UID caching logic from the fit method.
src/tabpfn_client/config.py
- Removed a redundant connection accessibility check during client initialization.
- Added a call to ServiceClient.get_dataset_limits() during client initialization to pre-fetch server constraints.
- Updated docstrings for clarity regarding init and reset functions.
src/tabpfn_client/estimator.py
- Imported ServiceClient for accessing server-side limits.
- Removed hardcoded MAX_ROWS, MAX_COLS, and MAX_NUMBER_OF_CLASSES constants.
- Updated _validate_targets_and_classes to use ServiceClient.get_dataset_limits() for max_classes validation.
- Updated validate_data_size to use ServiceClient.get_dataset_limits() for max_cells and max_cols validation.
- Added minor formatting adjustments (newlines) in fit and predict methods.
tests/unit/test_tabpfn_classifier.py
- Imported new client modules (GetDatasetLimitsResponse, ServiceClient).
- Added ServiceClient.reset_authorization() and client_module._dataset_limits = None to setUp and tearDown methods to ensure test isolation and clean state.
- Updated mock server responses to include the /tabpfn/get_dataset_limits/ endpoint with dummy limit data.
- Modified mock_prompt_and_set_token to return True for successful token setting.
- Removed a mock for protected_root.path in some tests.
- Adjusted data size check tests to align with the new server-side limit fetching and validation logic.
- Ensured no cached token and reset authorization in test_cache_based_on_paper_version.
tests/unit/test_tabpfn_regressor.py
- Imported new client modules (GetDatasetLimitsResponse, ServiceClient).
- Added ServiceClient.reset_authorization() and client_module._dataset_limits = None to setUp and tearDown methods for test isolation.
- Updated mock server responses to include the /tabpfn/get_dataset_limits/ endpoint with dummy limit data.
- Modified mock_prompt_and_set_token to return True for successful token setting.
- Removed a mock for protected_root.path in some tests.
- Adjusted data size check tests to align with the new server-side limit fetching and validation logic.
- Ensured no cached token and reset authorization in test_cache_based_on_paper_version.
uv.lock
- Updated tabpfn-client version to 0.2.9rc6.
- Added google-crc32c dependency.
- Updated pre-commit to 4.3.0.
- Updated ruff to 0.15.1.
- Added licensecheck and its dependencies (appdirs, attrs, boolean-py, cattrs, fhconfparser, license-expression, loguru, markdown, requests-cache, requirements-parser, url-normalize, uv, win32-setctime, zipp, importlib-metadata).
- Updated various other package versions to their latest compatible releases.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant change to the data upload mechanism, switching to signed URLs and parquet format for better performance and security. It also fetches dataset limits from the server at initialization, making the client more robust to server-side changes. The implementation is solid, but I have a few suggestions to improve maintainability, resource management, and error handling, particularly around the new upload logic and data caching.

_{Note: Security Review did not run due to the size of the PR.}

src/tabpfn_client/client.py

brendan-priorlabs · 2026-03-10T09:27:22Z

@simo-prior, we're going to have @ggprior give this a pass so that he can get some context for the v3 flag and whether that's needed at all as you pointed out. I think the one product question outstanding is whether we should enforce that users can't access the new model unless they're in the experiment or just control this client side with the understanding that some people will slip through early

safaricd · 2026-03-10T09:24:08Z

src/tabpfn_client/client.py

 logging.getLogger("httpcore").setLevel(logging.WARNING)
 logging.getLogger("httpcore.http11").setLevel(logging.WARNING)

+_DEFAULT_CHUNK_UPLOAD_PARALLELISM = 16


@simo-prior how did we come up with this level of parallelism?

safaricd · 2026-03-10T09:25:08Z

src/tabpfn_client/client.py

+def _get_crc32c_hash(data: bytes) -> str:
+    """Computes the CRC32C checksum and returns it as a base64 encoded string."""
+    crc32c_value = google_crc32c.value(data)
+    return base64.b64encode(struct.pack(">I", crc32c_value)).decode("ascii")


My understanding was that we're using the file hash provided by GCS? Or this the only hash we use, including within any upstream systems?

safaricd · 2026-03-10T09:27:36Z

src/tabpfn_client/client.py

-        if cached_dataset_uid:
-            return cached_dataset_uid
+        limits = cls.get_dataset_limits()
+        if limits is not None:


What if dataset limits are indeed None - which can happen if the endpoint doesn't return any information about them? How do we treat that case?

safaricd · 2026-03-10T09:29:11Z

src/tabpfn_client/client.py

+                ): i
+                for i in range(num_chunks)
+            }
+            for future in as_completed(futures):


Just confirming - will this run _upload_single_chunk in a sequence? Obviously, order matters here.

safaricd · 2026-03-10T09:33:02Z

src/tabpfn_client/estimator.py

+                raise ValueError(
+                    f"Number of classes {len(self.classes_)} exceeds the maximal number of "
+                    f"{limits.max_classes} classes supported by TabPFN. Consider using "
+                    "a strategy to reduce the number of classes. For code see "


"Consider using a strategy..." sounds a bit vague - can we mention the many class extension, besides showing the URL - which you already do nicely?

ggprior · 2026-03-11T10:10:27Z

src/tabpfn_client/client.py

 logging.getLogger("httpcore").setLevel(logging.WARNING)
 logging.getLogger("httpcore.http11").setLevel(logging.WARNING)

+_DEFAULT_CHUNK_UPLOAD_PARALLELISM = 16


Hm, so we got 2 layers of thread pool executors. So the actual max concurrency is 32. It seems a little high to me given that this should probably support a wide range of client setups

ggprior · 2026-03-11T10:15:36Z

src/tabpfn_client/client.py

+        on_backoff=_on_backoff,
+        on_giveup=_on_giveup,
+    )
+    def _upload_single_chunk(


For my understanding, any reason why we use home-brew implementation over the google.cloud.storage SDK methods (that come with native backoff, concurrency etc bells and whistles and are pretty well tested)?

ggprior

Small concerns from me on the chunked upload implementation as this sort of things are pretty well handled on cloud provider SDK side usually to reduce maintenance overhead in tabpfn-client (but you may not want to add the dependency which might be a good reason to do this)
Also I'm wondering if the functionality could fall back cleanly onto the legacy API on any failures (directly sending the files through API) -- but it's optional and may be even preferable as proposed to get a clean cut.

support for upload via signed-urls

db187bb

simo-prior requested a review from a team as a code owner March 4, 2026 17:47

simo-prior requested review from ggprior and removed request for a team March 4, 2026 17:47

simo-prior changed the title ~~Support for upload via signed-urls~~ Support upload via signed-urls Mar 4, 2026

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

src/tabpfn_client/client.py Outdated Show resolved Hide resolved

src/tabpfn_client/client.py Show resolved Hide resolved

src/tabpfn_client/client.py Outdated Show resolved Hide resolved

src/tabpfn_client/client.py Show resolved Hide resolved

simo-prior added 17 commits March 4, 2026 18:53

x

34500f1

Improve get_dataset_limits()

50b00d8

x

fc11f7b

add ClientOptions

017eabe

x

1384b57

x

9071b99

x

3e2d842

add pyarrow

8959529

ZLIB

523eee0

x

c9b7857

x

f3266e2

x

3be5b71

x

d08e770

fix tests

23d0aa9

x

db91818

dedup_files and DISABLE_DS_CACHING

4852208

fix tests

c65b8da

simo-prior force-pushed the eng-533/tabpfn-client-signed-urls branch from d783fee to c65b8da Compare March 5, 2026 17:57

simo-prior added 4 commits March 5, 2026 21:27

x

3e9a632

add trace_id across pipeline

d90537e

sentry-trace

6baf7fe

x

a3aeee8

simo-prior added 4 commits March 6, 2026 19:05

x

c415e03

x

0b0b2e7

from __future__ import annotations

d7dd342

extra_headers -> headers

678f30d

simo-prior requested review from brendan-priorlabs and safaricd March 10, 2026 08:42

safaricd reviewed Mar 10, 2026

View reviewed changes

ggprior reviewed Mar 11, 2026

View reviewed changes

Conversation

simo-prior commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

chatgpt-codex-connector bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brendan-priorlabs commented Mar 10, 2026

Uh oh!

safaricd Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

safaricd Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

safaricd Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

safaricd Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

safaricd Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ggprior Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ggprior Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ggprior left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simo-prior commented Mar 4, 2026 •

edited

Loading