Skip to content

Conversation

@Giulio2002
Copy link

@Giulio2002 Giulio2002 commented Jan 7, 2026

Problem

When distributing pre-built xor/binary fuse filters to users (e.g., as part of a database, blocklist, or other data file), the serialized filter data may be created on a machine with different endianness than the target machine. Since the library stores multi-byte fingerprints (uint16, uint32) in native byte order, a filter built on a little-endian x86 machine will produce incorrect results when loaded on a big-endian PowerPC or SPARC machine.

This is problematic for use cases like:

  • Shipping pre-computed filters as part of an application or data package (my use-case)
  • Storing filters in a database that may be accessed from different architectures
  • Distributing filters over a network to heterogeneous clients
  • Embedding filters in cross-platform file formats

The rationale of defaulting to LE is that LE CPUs are more common than BEs.

Solution

Add a Portable bool field to BinaryFuse[T] and Xor8 structs. When Portable=true:

  1. On filter creation: Fingerprints are converted to little-endian byte order (the most common format on modern systems)
  2. On lookup: Fingerprints are converted back from little-endian to native byte order before comparison

This ensures filters created with Portable=true can be serialized once and used correctly on any platform.

Usage

// Create a portable filter that can be safely serialized
filter, err := xorfilter.NewBinaryFusePortable[uint16](keys)

// Serialize filter.Fingerprints, filter.Seed, etc. to a file
// The file can now be distributed to any platform

// On the target machine (any endianness), deserialize and use:
filter.Contains(key) // Works correctly

Backward Compatibility

This change is fully backward compatible:

  • Existing code continues to work unchanged
  • The Portable field defaults to false, preserving existing behavior
  • Filters created without the Portable flag work exactly as before

@lemire
Copy link
Member

lemire commented Jan 7, 2026

Thanks.

Currently, we do not have any serialization or deserialization function, thus the problem that you describe is not present in the library.

Thus far, the library did not tell you on to share the data. They are simple arrays. Users who wanted to support big endian platforms (that are vanishingly rare these days) should obviously consider the issue when designing their data interchange.

If you use protobuf (for example), it will handle endianness automatically.

We are going to add helper functions that can be used to serialize/deserialize the data for users that do not want to roll their own, or for users that do not want to use protobuf or some existing data interchange format.

#51

The library itself does not care about endianness.

@lemire lemire closed this Jan 7, 2026
@Giulio2002
Copy link
Author

Giulio2002 commented Jan 8, 2026

We were doing it by writting the whole thing to file. every in-memory repressentation is already serialized and yes you do care about endianess, when you do some operations the underlying byte memory is ordered with respect to the CPU endianess. in anycase your PR is also fine and probably has less moat so thanks

@lemire
Copy link
Member

lemire commented Jan 8, 2026

@Giulio2002 Can you elaborate on your business application and how you are encountering big endian hardware? You are working with IBM mainframes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants