Skip to content

Conversation

@mao-liu
Copy link

@mao-liu mao-liu commented Dec 10, 2025

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

Closes #755

What is the purpose of the pull request

Adds support for Paimon metadata stats

Brief change log

  • Adds support for Paimon file-level stats metadata
  • Adds unit tests for Paimon file stats

Verify this pull request

This change added tests and can be verified as follows:

  • Extends TestPaimonDataFileExtractorto verify the change.
  • Manually verified the change by running a job locally.

Dev Notes

  • build and test this out against a real table
  • open draft PR against upstream
  • complete test cases
  • confirm assumptions with Paimon user group
  • remove logging

@mao-liu mao-liu changed the title Support column stats for paimon #755 - Support column stats for paimon Dec 10, 2025
List<String> colNames = file.valueStatsCols();
// log.info("valueStatsCols: {}", colNames);
if (colNames == null || colNames.isEmpty()) {
// if column names are not present, we assume all columns in the schema are present in the same order as the schema - TODO: validate this assumption
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q to Paimon experts: Is this assumption valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mao-liu are you in contact with anyone on the Paimon side to get these questions answered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedias do you know the answers to any of these Paimon questions on this PR by any chance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't have these answers...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @the-other-tim-brown , apologies I haven't been very active on this PR until this week.
I have just emailed the Paimon user group about these questions, and hoping to hear back soon.

We have been busy test-driving this change, and happy to report it's working well thus far!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch 3 times, most recently from c0cabf3 to 2757735 Compare December 10, 2025 12:38
@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch from 2757735 to 6e927e2 Compare December 10, 2025 13:03
// TODO: Implement logic to extract column stats from the file meta
// https://github.com/apache/incubator-xtable/issues/755
return Collections.emptyList();
private List<ColumnStat> toColumnStats(DataFileMeta file, InternalSchema internalSchema) {
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the column stats conversion to its own class? In the future if we add Paimon as a target then we will also need to convert to the Paimon representation and it would be nice to have all this stats logic in its own class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks for the early review @the-other-tim-brown !

I do wonder though, if it is even possible to have Paimon as a target... Paimon has a pretty unique file layout, and might not be as easily "tricked" as other formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about Paimon to say. Hudi also has a unique native layout structure to allow for update heavy workloads though and we were able to make this work.

Mainly we do this separation to keep the logic isolated though. As not necessarily relevant to Paimon, but if a table format changes how they represent stats in a new version, we can plug in the appropriate converter based on the version.

@mao-liu mao-liu changed the title #755 - Support column stats for paimon [755] Support column stats for paimon Jan 6, 2026
@mao-liu mao-liu marked this pull request as ready for review January 8, 2026 09:43
@mao-liu
Copy link
Author

mao-liu commented Jan 9, 2026

Hey @the-other-tim-brown , we have validated the assumptions in this PR with responses from Paimon maintainers - no more TODOs from me, ready for your review now :)

Comment on lines +112 to +113
Object min = getValue(minValues, i, type, field.getSchema());
Object max = getValue(maxValues, i, type, field.getSchema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the getValue returns null, should we set the range value to null as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, added this

InternalType type = field.getSchema().getDataType();
Object min = getValue(minValues, i, type, field.getSchema());
Object max = getValue(maxValues, i, type, field.getSchema());
Long nullCount = (nullCounts != null && i < nullCounts.size()) ? nullCounts.getLong(i) : 0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: make this a primitive long

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

TestPaimonTable.createTable("test_table", "level", tempDir, new Configuration(), false);
paimonTable = testTable.getPaimonTable();

// just the partition field matters for this test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why the test setup is changing as part of this PR

Copy link
Author

@mao-liu mao-liu Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous tests did not accurately represent the schema of the test tables - the tables had certain columns, but the "testSchema" objects were empty.

The stats extractor rely on the table schema to parse binary stats data from Paimon metadata, hence the change to extract the schema properly

// compaction create commits that are DELETE and ADD on the same file
// with `manifest.delete-file-drop-stats` enabled, this means stats are empty after compaction
// this is a smoke test to ensure exceptions aren't raised for this scenario
// TODO: Question for Paimon experts - is this the expected behaviour?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this question been answered?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't been validated with the Paimon community, though empirically this is the behaviour of the manifest.delete-file-drop-stats configuration today.

I have created a bug report at apache/paimon#7026 , and replaced the TODO with a link to the GH issue

import org.apache.xtable.model.storage.InternalDataFile;

@Log4j2
public class TestPaimonStatsExtractor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test case where the schema has nested fields?

If it will cause a lot of changes, feel free to suggest we handle this in a follow up pull request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't believe Paimon collects stats on nested fields or complex fields

I have added a test case showing this is the case.

// Insert some data to create files
testTable.insertRows(5);

List<InternalDataFile> result =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some basic sanity checks that the stats are non-null/empty for the result?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call! This has been added

@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch from 98012d5 to ad5b35a Compare January 13, 2026 09:57
@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch from ad5b35a to 4867167 Compare January 13, 2026 09:58
@mao-liu
Copy link
Author

mao-liu commented Jan 13, 2026

@the-other-tim-brown thanks for your review!

I have replied to your comments and made the suggested changes, thanks again!

@the-other-tim-brown
Copy link
Contributor

@mao-liu things look good overall but it looks like the tests are not passing. Can you take a look?

@mao-liu
Copy link
Author

mao-liu commented Jan 14, 2026

This workflow requires approval from a maintainer. Learn more about approving workflows.

@the-other-tim-brown Sorry fixed a test! Missed it somehow on my local. Shall we try the build again?

<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mao-liu the Paimon dependency brings in transitive dependency on spark-catalyst. I had to pin the version to ensure it was consistent to solve the test failure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks heaps @the-other-tim-brown

The latest CI run also had some more test failures in xtable-utilities, I've fixed them up now

@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch from b10488e to 1183121 Compare January 14, 2026 21:29
@the-other-tim-brown
Copy link
Contributor

I'm tracking the CI issues in this ticket #787

@the-other-tim-brown
Copy link
Contributor

I'm tracking the CI issues in this ticket #787

The CI Issue is now resolved, please rebase your PR when you have a chance.

@mao-liu
Copy link
Author

mao-liu commented Jan 16, 2026

I'm tracking the CI issues in this ticket #787

The CI Issue is now resolved, please rebase your PR when you have a chance.

Thanks @the-other-tim-brown , done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Apache Paimon] Support for extracting column-level statistics on the source

3 participants