-
Notifications
You must be signed in to change notification settings - Fork 27
Implement lazy loading for inline Arrow results #1029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement lazy loading for inline Arrow results #1029
Conversation
This PR introduces lazy loading support for inline Arrow results to improve memory efficiency when handling large result sets.
Previously, InlineChunkProvider would eagerly fetch all arrow batches upfront when results had hasMoreRows = true, which could lead to memory issues with large datasets. This change splits the handling into two separate paths:
1. Lazy path (new): For Thrift-based inline Arrow results (when ARROW_BASED_SET is returned), we now use LazyThriftInlineArrowResult which fetches arrow batches on-demand as the client iterates through rows. This is similar to how LazyThriftResult works for columnar data.
2. Remote path (existing): For URL-based Arrow results (URL_BASED_SET), we continue using ArrowStreamResult with RemoteChunkProvider which downloads chunks from cloud storage.
The InlineChunkProvider is now only used for SEA results with JSON_ARRAY format and INLINE disposition (contain all data inline {no hasMoreRows flag set}).
This should reduce memory consumption and improve performance when dealing with large inline Arrow result sets.
|
I need to make some changes related to JDBC spec around row count because we don't have that data point when lazily fetching the results. |
|
This PR has been marked as Stale because it has been open for 30 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR. |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
src/main/java/com/databricks/jdbc/api/impl/arrow/LazyThriftInlineArrowResult.java
Show resolved
Hide resolved
|
|
||
| // Check if we've reached the maxRows limit | ||
| boolean hasRowLimit = maxRows > 0; | ||
| if (hasRowLimit && globalRowIndex + 1 >= maxRows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so globalRowIndex 0 means 1st row, and maxRows -1 would mean last row.
So, at last row, globalRowIndex +1 = maxRows, which means no more rows. We can just check == instead of >=, any reason you are doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we don't strictly need >= strictly (was being defensive).
| return null; | ||
| } | ||
|
|
||
| private Schema hiveSchemaToArrowSchema(TTableSchema hiveSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not getting arrowSchema for inline arrow result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into this in detail. Arrow parsing is ported as is in this PR (no changes from existing code. The methods were moved from InlineChunkProvider which was the previous class handling this). The PR only changes data download. Let me check this separately.
src/main/java/com/databricks/jdbc/api/impl/arrow/LazyThriftInlineArrowResult.java
Show resolved
Hide resolved
| if (result == null) { | ||
| return null; | ||
| } | ||
| ComplexDataTypeParser parser = new ComplexDataTypeParser(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to create this for every getObject call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is just existing code from method getObject(int columnIndex) in the same class. I moved it to separate method because getObject(int columnIndex) was unreadable.
To answer the question: No we shouldn't create such objects in hot paths like getObjects. But I don't want to change the scope of this PR. Will create a separate change/
|
This PR has been marked as Stale because it has been open for 30 days with no activity. If you would like the PR to remain open, please remove the stale label or comment on the PR. |
nikhilsuri-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have artefacts showing the improvement in memory usage with LazyInline Fetch?
src/main/java/com/databricks/jdbc/api/impl/arrow/LazyThriftInlineArrowResult.java
Show resolved
Hide resolved
The improvements are identical and follow same patterns as images in #966 (comment) |
|
|
||
| // Check if we need to convert geospatial types to string when geospatial support is disabled | ||
| // This check must come before the general complex type check | ||
| if (!isGeoSpatialSupportEnabled && isGeospatialType(requiredType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sreekanth is making changes in similar code, make sure that you don't override his changes for geospatial
Description
This PR introduces lazy loading support for inline Arrow results to improve memory efficiency when handling large result sets.
Previously, InlineChunkProvider would eagerly fetch all arrow batches upfront when results had hasMoreRows = true, which could lead to memory issues with large datasets. This change splits the handling into two separate paths:
The InlineChunkProvider is now only used for SEA results with JSON_ARRAY format and INLINE disposition (contain all data inline {no hasMoreRows flag set}).
This will reduce memory consumption and improve performance when dealing with large inline Arrow result sets similar to #975.
Testing
Additional Notes to the Reviewer