Skip to content

Conversation

@abhijat
Copy link
Contributor

@abhijat abhijat commented Jan 11, 2026

WIP

Copilot AI review requested due to automatic review settings January 11, 2026 10:31
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 32413a1 to ac82dfa Compare January 11, 2026 10:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a reference-counted string interning system for JSON object keys to reduce memory usage when the same keys appear multiple times. The implementation adds InternedBlob (raw string storage with refcounting) and InternedString (string-like wrapper) classes, and integrates them into jsoncons via a custom InternedStringPolicy.

Changes:

  • Added InternedBlob class for memory-efficient string storage with reference counting
  • Added InternedString class providing a string-like API over interned blobs
  • Modified JsonType to use InternedStringPolicy for object keys instead of sorted_policy

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/core/json/json_object.h Integrates InternedString into JsonType via custom InternedStringPolicy
src/core/json/detail/interned_string.h Defines InternedString class with string-like interface and template constructors
src/core/json/detail/interned_string.cc Implements InternedString with refcount management and thread_local pool
src/core/json/detail/interned_blob.h Defines InternedBlob for raw string storage with refcounting and hash/equality support
src/core/json/detail/interned_blob.cc Implements InternedBlob memory layout and refcount operations
src/core/json/CMakeLists.txt Adds new source files to jsonpath library

Comment on lines 1 to 87
// Copyright 2025, DragonflyDB authors. All rights reserved.
// See LICENSE for licensing terms.

#pragma once

#include "core/detail/stateless_allocator.h"
#include "core/json/detail/interned_blob.h"

namespace dfly::detail {

// The interned string has access to a thread local pool of InternedBlobs. It only holds a single
// pointer into that pool as state. The pool is a node_hash_set so the pointer is stable.
// InternedString handles incrementing and decrementing reference counts of the blobs tied to its
// own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard.
// TODO examine cross shard json object interactions. Can a pool end up access from another shard?
class InternedString {
public:
using allocator_type = StatelessAllocator<char>;

InternedString();

// The following constructors and members are added because they are required by jsoncons for
// keys. Each of these is added in response to compiler errors and should not be removed, even if
// they are seemingly a no-op or duplicated.
template <typename Alloc> explicit InternedString(Alloc /*unused*/) {
}

explicit InternedString(std::string_view);

template <typename Alloc> InternedString(const char* data, size_t size, Alloc alloc);

template <typename It> InternedString(It begin, It end);

InternedString(const InternedString& other);
InternedString(InternedString&& other) noexcept;
InternedString& operator=(const InternedString& other);
InternedString& operator=(InternedString&& other) noexcept;

~InternedString();

operator std::string_view() const;

const char* data() const;

const char* c_str() const;

void swap(InternedString&) noexcept;

size_t length() const;

size_t size() const;

int compare(const InternedString& other) const;
int compare(std::string_view) const;

bool operator==(const InternedString& other) const;
bool operator!=(const InternedString& other) const;
bool operator<(const InternedString& other) const;

void shrink_to_fit() { // NOLINT
}

private:
// If a string exists in pool, increments its refcount and returns a pointer to it. If not, adds
// the string to the pool.
static const InternedBlob* Intern(std::string_view sv);

// Increments the refcount if the entry is not null
void Acquire();

// Decrements the refcount, removes entry from pool if necessary, destroying the interned blob
void Release();

static InternedBlobPool& GetPoolRef();

static InternedBlobPool pool_;

const InternedBlob* entry_ = nullptr;
};

template <typename Alloc>
InternedString::InternedString(const char* data, size_t size, Alloc /*unused*/)
: InternedString(std::string_view{data, size}) {
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new InternedString and InternedBlob classes lack dedicated test coverage. Given the complexity of reference counting, memory management, and the integration with jsoncons, comprehensive unit tests should be added to verify: empty string handling, refcount correctness, pool insertion/removal, copy/move semantics, self-assignment, iterator constructor edge cases, and memory leak prevention. The existing json_test.cc file shows the project has test infrastructure for JSON components.

Copilot uses AI. Check for mistakes.
Comment on lines 37 to 57
uint32_t InternedBlob::Size() const {
// TODO - should this check be removed to uncover bugs? or an assert here?
if (!blob_)
return 0;

uint32_t size;
std::memcpy(&size, blob_, sizeof(size));
return size;
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO questions whether null checks should be removed or converted to assertions. For production code dealing with memory management and reference counting, these null checks are important for defensive programming. However, if the invariant is that blob_ should never be null during these operations, assertions would be more appropriate to catch bugs in debug builds while avoiding the overhead in release builds.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 11, 2026 10:47
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 1ad1102 to 6734438 Compare January 11, 2026 10:47
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 6734438 to ceb8568 Compare January 11, 2026 10:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Comment on lines 14 to 15
// own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard.
// TODO examine cross shard json object interactions. Can a pool end up access from another shard?
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a TODO about examining cross-shard JSON object interactions. Since the pool is thread_local, if a JSON object with InternedStrings is passed to another shard/thread, those strings would reference a pool on a different thread, potentially causing crashes or undefined behavior. This needs to be resolved before merging, or at minimum, documented as a known limitation with guards to prevent cross-thread usage.

Suggested change
// own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard.
// TODO examine cross shard json object interactions. Can a pool end up access from another shard?
// own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard
// and per thread.
//
// IMPORTANT: InternedString instances (and JSON objects containing them) are local to the shard/
// thread whose pool they were created in. They must not be passed to or accessed from another
// shard/thread. Doing so would cause entry_ to refer to an InternedBlob allocated from a
// different thread-local pool, which can lead to crashes or undefined behavior when reference
// counts are updated or entries are removed from the pool. Callers are responsible for enforcing
// this shard/thread affinity and must avoid cross-shard/thread usage.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +32
if (this != &other) {
Release();
entry_ = other.entry_;
Acquire();
}
return *this;
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison in the copy assignment operator using operator!= may cause issues with self-assignment. Using string content comparison instead of identity check means self-assignment of an InternedString to itself would still trigger Release() and Acquire(), unnecessarily modifying the reference count. Change the condition to 'if (this != &other)' for proper self-assignment detection.

Copilot uses AI. Check for mistakes.
Comment on lines 66 to 74
void InternedBlob::IncrRefCount() const {
const uint32_t updated_count = RefCount() + 1;
std::memcpy(blob_ + sizeof(updated_count), &updated_count, sizeof(updated_count));
}

void InternedBlob::DecrRefCount() const {
// Caller must ensure refcount does not go below 0
const uint32_t updated_count = RefCount() - 1;
std::memcpy(blob_ + sizeof(updated_count), &updated_count, sizeof(updated_count));
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference counting operations IncrRefCount() and DecrRefCount() are not atomic. In a multi-threaded environment where the same InternedString might be accessed from multiple threads, this could lead to race conditions and incorrect reference counts. Consider using std::atomic<uint32_t> for the reference count field or documenting that InternedStrings must not be shared across threads.

Copilot uses AI. Check for mistakes.
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from ceb8568 to bcd7ffe Compare January 11, 2026 10:51
Copilot AI review requested due to automatic review settings January 11, 2026 10:53
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from bcd7ffe to a1f55e6 Compare January 11, 2026 10:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.

Comment on lines +84 to +96
template <typename It> InternedString::InternedString(It begin, It end) {
if (begin == end)
return;

const auto size = std::distance(begin, end);
const auto data_ptr = &*begin;
entry_ = Intern(std::string_view(data_ptr, size));
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator constructor template does not validate that the iterator points to contiguous memory before taking the address with &*begin. If the iterator type is not a contiguous iterator (e.g., a list iterator), this will create undefined behavior. Consider adding a static_assert to enforce that It is a contiguous iterator, or copying the data into a temporary buffer.

Copilot uses AI. Check for mistakes.
Comment on lines 27 to 33
// These setters are const as a hack. They are allowed to be const because they mutate the
// refcount via pointer. If they were non-const InternedString would not be able to mutate pool
// entries to incr/decr refcounts. Since we do not use the refcount or size in hashing, this is
// fine.
void IncrRefCount() const;

void DecrRefCount() const;
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const_cast hack mentioned in the comment is problematic. Making IncrRefCount and DecrRefCount const methods that modify mutable state breaks const-correctness and makes the API misleading. This pattern can hide bugs where const methods unexpectedly modify state. A better design would be to make the InternedBlob* in the hash set mutable, or use a different pool structure that doesn't require const methods to mutate state.

Suggested change
// These setters are const as a hack. They are allowed to be const because they mutate the
// refcount via pointer. If they were non-const InternedString would not be able to mutate pool
// entries to incr/decr refcounts. Since we do not use the refcount or size in hashing, this is
// fine.
void IncrRefCount() const;
void DecrRefCount() const;
// These methods mutate the internal refcount of the blob.
void IncrRefCount();
void DecrRefCount();

Copilot uses AI. Check for mistakes.
Comment on lines 95 to 102
auto it = pool_ref.find(sv);
if (it != pool_ref.end()) {
it->IncrRefCount();
return &*it;
}

auto [new_elem, _] = pool_ref.emplace(sv);
return &*new_elem;
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient double-lookup in Intern method. The code first calls find() at line 95, then if not found, calls emplace() at line 101 which will perform another lookup internally. This results in two hash computations and lookups for new strings. Consider using try_emplace() which performs a single lookup and only constructs the object if insertion occurs.

Suggested change
auto it = pool_ref.find(sv);
if (it != pool_ref.end()) {
it->IncrRefCount();
return &*it;
}
auto [new_elem, _] = pool_ref.emplace(sv);
return &*new_elem;
auto [it, inserted] = pool_ref.emplace(sv);
if (!inserted) {
it->IncrRefCount();
}
return &*it;

Copilot uses AI. Check for mistakes.
Comment on lines 18 to 31
blob_ = alloc.allocate(kHeaderSize + str_len + 1);

std::memcpy(blob_, &str_len, sizeof(uint32_t));
std::memcpy(blob_ + sizeof(uint32_t), &ref_count, sizeof(uint32_t));

std::memcpy(blob_ + kHeaderSize, sv.data(), str_len);

// null terminate so jsoncons can directly access the char* as string
blob_[kHeaderSize + str_len] = '\0';
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alignment issues with the blob memory layout. The code assumes uint32_t can be safely memcpy'd to/from char* at arbitrary offsets. While this works in practice on most platforms, it's technically undefined behavior. Consider using std::aligned_storage or properly aligned allocation to ensure the header is correctly aligned for uint32_t access.

Copilot uses AI. Check for mistakes.

const char* c_str() const;

void swap(InternedString&) noexcept;
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing const qualifier on the swap method parameter. The parameter should be a non-const reference to allow swapping, but the method signature shows it correctly. However, there's an inconsistency with standard library conventions - typically swap is a free function, not a member. Consider providing both a member swap and a non-member swap in the namespace for ADL (Argument-Dependent Lookup) compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 98
// Copyright 2025, DragonflyDB authors. All rights reserved.
// See LICENSE for licensing terms.

#pragma once

#include "core/detail/stateless_allocator.h"
#include "core/json/detail/interned_blob.h"

namespace dfly::detail {

// The interned string has access to a thread local pool of InternedBlobs. It only holds a single
// pointer into that pool as state. The pool is a node_hash_set so the pointer is stable.
// InternedString handles incrementing and decrementing reference counts of the blobs tied to its
// own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard.
// TODO examine cross shard json object interactions. Can a pool end up access from another shard?
class InternedString {
public:
using allocator_type = StatelessAllocator<char>;

InternedString();

// The following constructors and members are added because they are required by jsoncons for
// keys. Each of these is added in response to compiler errors and should not be removed, even if
// they are seemingly a no-op or duplicated.
template <typename Alloc> explicit InternedString(Alloc /*unused*/) {
}

explicit InternedString(std::string_view);

template <typename Alloc> InternedString(const char* data, size_t size, Alloc alloc);

template <typename It> InternedString(It begin, It end);

InternedString(const InternedString& other);
InternedString(InternedString&& other) noexcept;
InternedString& operator=(const InternedString& other);
InternedString& operator=(InternedString&& other) noexcept;

~InternedString();

operator std::string_view() const;

const char* data() const;

const char* c_str() const;

void swap(InternedString&) noexcept;

size_t length() const;

size_t size() const;

int compare(const InternedString& other) const;
int compare(std::string_view) const;

bool operator==(const InternedString& other) const;
bool operator!=(const InternedString& other) const;
bool operator<(const InternedString& other) const;

void shrink_to_fit() { // NOLINT
}

private:
// If a string exists in pool, increments its refcount and returns a pointer to it. If not, adds
// the string to the pool.
static const InternedBlob* Intern(std::string_view sv);

// Increments the refcount if the entry is not null
void Acquire();

// Decrements the refcount, removes entry from pool if necessary, destroying the interned blob
void Release();

static InternedBlobPool& GetPoolRef();

const InternedBlob* entry_ = nullptr;
};

template <typename Alloc>
InternedString::InternedString(const char* data, size_t size, Alloc /*unused*/)
: InternedString(std::string_view{data, size}) {
}

template <typename It> InternedString::InternedString(It begin, It end) {
if (begin == end)
return;

const auto size = std::distance(begin, end);
const auto data_ptr = &*begin;
entry_ = Intern(std::string_view(data_ptr, size));
}

} // namespace dfly::detail
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for the new InternedString and InternedBlob functionality. The repository has comprehensive test coverage for JSON functionality (json_test.cc), but no tests are added for the new interned string implementation. Critical scenarios that should be tested include: reference counting correctness, pool management, string interning deduplication, edge cases like empty strings, move/copy semantics, and memory leak verification. Without tests, bugs in this critical memory management code may not be caught.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +88
bool InternedString::operator==(const InternedString& other) const {
// Compare pointers since we store them in the same pool
return entry_ == other.entry_;
}

bool InternedString::operator!=(const InternedString& other) const {
return entry_ != other.entry_;
}

bool InternedString::operator<(const InternedString& other) const {
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing noexcept specification on comparison operators. The operator< is implemented in terms of compare(), which calls string_view::compare(). Since string_view::compare() is noexcept, this operator should also be marked noexcept for better performance and exception safety guarantees. The same applies to operator== and operator!= which should be noexcept as they only compare pointers.

Suggested change
bool InternedString::operator==(const InternedString& other) const {
// Compare pointers since we store them in the same pool
return entry_ == other.entry_;
}
bool InternedString::operator!=(const InternedString& other) const {
return entry_ != other.entry_;
}
bool InternedString::operator<(const InternedString& other) const {
bool InternedString::operator==(const InternedString& other) const noexcept {
// Compare pointers since we store them in the same pool
return entry_ == other.entry_;
}
bool InternedString::operator!=(const InternedString& other) const noexcept {
return entry_ != other.entry_;
}
bool InternedString::operator<(const InternedString& other) const noexcept {

Copilot uses AI. Check for mistakes.
Comment on lines 37 to 50
uint32_t InternedBlob::Size() const {
// TODO - should this check be removed to uncover bugs? or an assert here?
if (!blob_)
return 0;

uint32_t size;
std::memcpy(&size, blob_, sizeof(size));
return size;
}

uint32_t InternedBlob::RefCount() const {
// TODO - should this check be removed to uncover bugs? or an assert here?
if (!blob_)
return 0;
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comments asking whether null checks should be removed or replaced with asserts indicate incomplete design decisions. These methods are part of the public API (accessible via InternedString), so proper error handling is important. The decision should be made: either document that callers must ensure blob_ is valid (and use asserts in debug builds), or handle null gracefully. The current inconsistency should be resolved before merging.

Copilot uses AI. Check for mistakes.
Comment on lines 66 to 68
void InternedBlob::IncrRefCount() const {
const uint32_t updated_count = RefCount() + 1;
std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count));
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential integer overflow in refcount increment. If RefCount returns UINT32_MAX, adding 1 will cause overflow and wrap to 0, which would then trigger incorrect deletion of the blob. Consider adding overflow detection or using a saturating increment.

Copilot uses AI. Check for mistakes.
Comment on lines 37 to 74
uint32_t InternedBlob::Size() const {
// TODO - should this check be removed to uncover bugs? or an assert here?
if (!blob_)
return 0;

uint32_t size;
std::memcpy(&size, blob_, sizeof(size));
return size;
}

uint32_t InternedBlob::RefCount() const {
// TODO - should this check be removed to uncover bugs? or an assert here?
if (!blob_)
return 0;

uint32_t ref_count;
// Assumes size and refcount are both 4 bytes
std::memcpy(&ref_count, blob_ + sizeof(uint32_t), sizeof(uint32_t));
return ref_count;
}

std::string_view InternedBlob::View() const {
return {blob_ + kHeaderSize, Size()};
}

const char* InternedBlob::Data() const {
return blob_ ? blob_ + kHeaderSize : nullptr;
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent null checking pattern. The Size() and RefCount() methods check for null blob_ and return 0, but View() and Data() handle null differently. View() would create a string_view from an invalid pointer when blob_ is null (though Size() returns 0, which makes it safe), while Data() explicitly checks. This inconsistency could lead to maintenance issues. Consider standardizing the null-checking approach across all methods.

Copilot uses AI. Check for mistakes.
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from ef79c65 to a9b2552 Compare January 12, 2026 07:27
Copilot AI review requested due to automatic review settings January 12, 2026 07:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Comment on lines 10 to 19
namespace dfly::detail {
class InternedBlob {
static constexpr auto kHeaderSize = sizeof(uint32_t) * 2;

public:
// TODO - does this need to accept an allocator??
explicit InternedBlob(std::string_view sv);
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO questions whether InternedBlob needs to accept an allocator. Given that this code uses StatelessAllocator which is designed for the shard-specific mimalloc heap, and the InternedBlob is used in a per-shard pool, it should accept and use the allocator for consistency with the rest of the codebase. This ensures proper memory accounting and allocation on the correct heap.

Update the constructor to accept an allocator parameter and use it instead of creating a new StatelessAllocator instance.

Suggested change
namespace dfly::detail {
class InternedBlob {
static constexpr auto kHeaderSize = sizeof(uint32_t) * 2;
public:
// TODO - does this need to accept an allocator??
explicit InternedBlob(std::string_view sv);
namespace dfly {
template <typename T> class StatelessAllocator;
namespace detail {
class InternedBlob {
static constexpr auto kHeaderSize = sizeof(uint32_t) * 2;
public:
// TODO - does this need to accept an allocator??
explicit InternedBlob(std::string_view sv, const dfly::StatelessAllocator<char>& allocator);

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 98
// Copyright 2025, DragonflyDB authors. All rights reserved.
// See LICENSE for licensing terms.

#pragma once

#include "core/detail/stateless_allocator.h"
#include "core/json/detail/interned_blob.h"

namespace dfly::detail {

// The interned string has access to a thread local pool of InternedBlobs. It only holds a single
// pointer into that pool as state. The pool is a node_hash_set so the pointer is stable.
// InternedString handles incrementing and decrementing reference counts of the blobs tied to its
// own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard.
// TODO examine cross shard json object interactions. Can a pool end up access from another shard?
class InternedString {
public:
using allocator_type = StatelessAllocator<char>;

InternedString();

// The following constructors and members are added because they are required by jsoncons for
// keys. Each of these is added in response to compiler errors and should not be removed, even if
// they are seemingly a no-op or duplicated.
template <typename Alloc> explicit InternedString(Alloc /*unused*/) {
}

explicit InternedString(std::string_view);

template <typename Alloc> InternedString(const char* data, size_t size, Alloc alloc);

template <typename It> InternedString(It begin, It end);

InternedString(const InternedString& other);
InternedString(InternedString&& other) noexcept;
InternedString& operator=(const InternedString& other);
InternedString& operator=(InternedString&& other) noexcept;

~InternedString();

operator std::string_view() const;

const char* data() const;

const char* c_str() const;

void swap(InternedString&) noexcept;

size_t length() const;

size_t size() const;

int compare(const InternedString& other) const;
int compare(std::string_view) const;

bool operator==(const InternedString& other) const;
bool operator!=(const InternedString& other) const;
bool operator<(const InternedString& other) const;

void shrink_to_fit();

// For tests
static void ResetPool();

private:
// If a string exists in pool, increments its refcount and returns a pointer to it. If not, adds
// the string to the pool.
static const InternedBlob* Intern(std::string_view sv);

// Increments the refcount if the entry is not null
void Acquire();

// Decrements the refcount, removes entry from pool if necessary, destroying the interned blob
void Release();

static InternedBlobPool& GetPoolRef();

const InternedBlob* entry_ = nullptr;
};

template <typename Alloc>
InternedString::InternedString(const char* data, size_t size, Alloc /*unused*/)
: InternedString(std::string_view{data, size}) {
}

template <typename It> InternedString::InternedString(It begin, It end) {
if (begin == end)
return;

const auto size = std::distance(begin, end);
const auto data_ptr = &*begin;
entry_ = Intern(std::string_view(data_ptr, size));
}

} // namespace dfly::detail
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new InternedString and InternedBlob classes lack dedicated unit tests. While json_family_memory_test.cc has been updated to account for the new behavior, there are no direct tests for:

  • Reference counting correctness (increment/decrement/cleanup)
  • Pool management (insertion, lookup, deletion)
  • Edge cases (empty strings, large strings, special characters)
  • Memory lifecycle (construction, copy, move, destruction)
  • The ResetPool() functionality

Add a dedicated test file (e.g., interned_string_test.cc) with comprehensive unit tests for these core components before merging.

Copilot uses AI. Check for mistakes.
Comment on lines 58 to 59
// pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably
// could.
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states that the pool "probably could" use StatelessAllocator or mimalloc heap, but doesn't explain why it currently doesn't. Given that this code is part of a system designed around per-shard heaps and stateless allocators (as evidenced by the usage in InternedBlob), the pool should use the same allocation strategy for consistency and proper memory accounting.

Either update the pool to use StatelessAllocator, or document why using the default allocator is the correct choice here.

Suggested change
// pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably
// could.
// pool. The pool intentionally uses the default allocator rather than a stateless allocator or a
// dedicated mimalloc heap: the dominant memory usage comes from the InternedBlob payloads
// themselves (managed by InternedBlob), while the node_hash_set only stores lightweight
// bookkeeping nodes. Using the default allocator here keeps the container simple and avoids
// coupling it to shard-local heap configuration, without materially affecting memory accounting.

Copilot uses AI. Check for mistakes.
Comment on lines 169 to 173
// When deleting, json object member is erased. This causes deletion of two interned string
// wrappers, which has to be accounted for.
const size_t size_after_delete =
start_size - sizeof(detail::InternedBlob) - sizeof(detail::InternedString);
EXPECT_THAT(resp, IntArg(size_after_delete));
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation subtracts sizeof(detail::InternedBlob) and sizeof(detail::InternedString), but this doesn't accurately represent the actual memory freed. When a JSON object member is erased:

  1. sizeof(detail::InternedString) is the size of the wrapper (just a pointer), not the interned blob data
  2. The actual InternedBlob may not be deleted if other JSON objects reference the same interned string (e.g., if "a" appears elsewhere)
  3. The calculation doesn't account for any internal allocations or padding

This test may be fragile and could fail if the string happens to be reused elsewhere. Consider using a more robust approach that doesn't rely on exact sizeof calculations, or ensure the test strings are truly unique.

Copilot uses AI. Check for mistakes.
const uint32_t str_len = sv.size();

// We need \0 because jsoncons expects c_str() and data() style accessors on keys
blob_ = alloc.allocate(kHeaderSize + str_len + 1);
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory allocation calculation kHeaderSize + str_len + 1 could overflow if str_len is close to UINT32_MAX. While unlikely in practice for JSON keys, this should be validated to prevent allocation of incorrect memory sizes.

Add a check to ensure str_len doesn't exceed a reasonable maximum (e.g., SIZE_MAX - kHeaderSize - 1) before allocating.

Copilot uses AI. Check for mistakes.
return compare(other) < 0;
}
void InternedString::shrink_to_fit() { // NOLINT (must be non-const to align with jsoncons usage)
DCHECK(entry_ != nullptr);
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shrink_to_fit method contains a DCHECK that entry_ is not nullptr, but this assertion will fail for default-constructed or empty InternedStrings where entry_ is legitimately nullptr. The check should either allow nullptr (making it a no-op) or the function should not be called on empty strings. Review jsoncons usage to determine the correct behavior.

Suggested change
DCHECK(entry_ != nullptr);
if (!entry_) {
return;
}
// Interned strings are pooled and immutable; nothing to shrink.

Copilot uses AI. Check for mistakes.
Comment on lines 14 to 20
public:
// TODO - does this need to accept an allocator??
explicit InternedBlob(std::string_view sv);
~InternedBlob();
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternedBlob lacks explicit copy constructor, move constructor, copy assignment, and move assignment operators. The compiler-generated versions will be incorrect:

  • Copy operations will shallow-copy blob_ pointer, leading to double-free
  • Move operations will leave the moved-from object with a dangling pointer that gets deallocated

Since InternedBlob manages raw memory with new/delete-like semantics, it should follow the Rule of Five. Either:

  1. Delete copy/move operations if blobs should not be copied/moved (they're stored in hash_set by value)
  2. Implement proper deep-copy and move semantics
  3. Use a smart pointer like unique_ptr for blob_

Given that these are stored in a hash_set and never copied/moved after insertion, deleting these operations is likely the correct choice.

Copilot uses AI. Check for mistakes.
@abhijat
Copy link
Contributor Author

abhijat commented Jan 12, 2026

@copilot please do not review this PR until explicitly told to do so in future.

This is a spike PR and is not ready for review yet.

Copy link
Contributor

Copilot AI commented Jan 12, 2026

@abhijat I've opened a new pull request, #6369, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI review requested due to automatic review settings January 12, 2026 08:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Comment on lines 73 to 82
void InternedBlob::IncrRefCount() const {
const uint32_t updated_count = RefCount() + 1;
std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count));
}

void InternedBlob::DecrRefCount() const {
// Caller must ensure refcount does not go below 0
const uint32_t updated_count = RefCount() - 1;
std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count));
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refcount increment and decrement operations are not atomic, which could lead to race conditions in concurrent scenarios. While the comment at line 15 states the pool is per-shard and thread_local, if InternedBlob instances are accessed from multiple fibers on the same shard, the read-modify-write sequence (RefCount() + 1, then memcpy) is not atomic. This could result in lost updates or incorrect refcounts. Consider using atomic operations or documenting that InternedBlob access must be properly synchronized.

Copilot uses AI. Check for mistakes.
Comment on lines 66 to 67
// pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably
// could.
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "This pool holds blobs and is used by InternedString to manage string access. node_hash_set instead of flat to maintain pointer stability, as interned strings hold onto pointers into this pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably could." This is misleading because while the pool container doesn't use StatelessAllocator, the blob_ data inside each InternedBlob does use StatelessAllocator (see interned_blob.cc lines 14, 20, 87). The comment should clarify that only the pool's node allocation doesn't use StatelessAllocator, but the blob data itself does.

Suggested change
// pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably
// could.
// pool. Note that the pool's node allocations (the hash_set container itself) do not use
// StatelessAllocator or the mimalloc heap, though the blob_ data inside each InternedBlob does.

Copilot uses AI. Check for mistakes.
Comment on lines 73 to 76
void InternedBlob::IncrRefCount() const {
const uint32_t updated_count = RefCount() + 1;
std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count));
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IncrRefCount method does not check for potential reference count overflow. Since the refcount is uint32_t, incrementing it when it's already at UINT32_MAX would cause it to wrap to 0, leading to premature deletion of the blob. Consider adding a DCHECK or LOG(FATAL) to prevent this.

Copilot uses AI. Check for mistakes.
Comment on lines 13 to 33
InternedBlob::InternedBlob(const std::string_view sv) {
auto alloc = StatelessAllocator<char>{};

constexpr uint32_t ref_count = 1;
const uint32_t str_len = sv.size();

// We need \0 because jsoncons expects c_str() and data() style accessors on keys
blob_ = alloc.allocate(kHeaderSize + str_len + 1);

std::memcpy(blob_, &str_len, sizeof(uint32_t));
std::memcpy(blob_ + sizeof(uint32_t), &ref_count, sizeof(uint32_t));

std::memcpy(blob_ + kHeaderSize, sv.data(), str_len);

// null terminate so jsoncons can directly access the char* as string
blob_[kHeaderSize + str_len] = '\0';
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InternedBlob constructor lacks exception safety. If memcpy throws (unlikely but possible with invalid pointers), or if allocation fails, the blob_ pointer is left in an inconsistent state. While StatelessAllocator likely throws on allocation failure rather than returning nullptr, the code should use RAII or ensure proper cleanup in case of exceptions. Consider using a unique_ptr with a custom deleter or ensuring proper exception handling.

Copilot uses AI. Check for mistakes.
Comment on lines 124 to 138
entry_->DecrRefCount();

if (entry_->RefCount() == 0) {
GetPoolRef().erase(*entry_);
entry_ = nullptr;
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Release() method has a use-after-free bug. After calling GetPoolRef().erase(*entry_), the entry_ pointer is dereferenced, but the blob has already been deleted from the pool. The erase operation will invoke the InternedBlob destructor, which deallocates the blob_ memory. Setting entry_ to nullptr after the erase is meaningless since we're working with a dangling pointer. The entry_ should be saved to a local variable, then set to nullptr before calling erase.

Suggested change
entry_->DecrRefCount();
if (entry_->RefCount() == 0) {
GetPoolRef().erase(*entry_);
entry_ = nullptr;
// Save the current entry pointer locally and clear the member pointer
auto* entry = entry_;
entry_ = nullptr;
entry->DecrRefCount();
if (entry->RefCount() == 0) {
GetPoolRef().erase(*entry);

Copilot uses AI. Check for mistakes.
Comment on lines 65 to 70
return blob_ ? std::string_view{blob_ + kHeaderSize, Size()} : "";
}

const char* InternedBlob::Data() const {
DCHECK(blob_) << "Called Data() on empty blob";
return blob_ ? blob_ + kHeaderSize : nullptr;
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The View() and Data() methods contain redundant null checks. The DCHECK at the start already validates that blob_ is not null, but the ternary operators check blob_ again. This is inconsistent - either remove the DCHECK assertions and keep the null checks, or remove the redundant null checks since the DCHECK ensures blob_ is not null. The latter is more appropriate since these methods should not be called on empty blobs.

Suggested change
return blob_ ? std::string_view{blob_ + kHeaderSize, Size()} : "";
}
const char* InternedBlob::Data() const {
DCHECK(blob_) << "Called Data() on empty blob";
return blob_ ? blob_ + kHeaderSize : nullptr;
return std::string_view{blob_ + kHeaderSize, Size()};
}
const char* InternedBlob::Data() const {
DCHECK(blob_) << "Called Data() on empty blob";
return blob_ + kHeaderSize;

Copilot uses AI. Check for mistakes.
return compare(other) < 0;
}
void InternedString::shrink_to_fit() { // NOLINT (must be non-const to align with jsoncons usage)
DCHECK(entry_ != nullptr);
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shrink_to_fit method has a DCHECK that is never evaluated because it's checking a condition that should be verified. The DCHECK statement "DCHECK(entry_ != nullptr);" appears to be incomplete - it should either verify an actual invariant or be removed if shrink_to_fit is expected to be a no-op for interned strings.

Suggested change
DCHECK(entry_ != nullptr);

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 12, 2026 13:33
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 3f576f3 to cc627b4 Compare January 12, 2026 13:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

}

void InternedString::ResetPool() {
GetPoolRef().clear();
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResetPool function clears the pool but does not deallocate the InternedBlob objects that are stored in it. This will cause a memory leak if the pool contains any entries when cleared. Consider iterating through the pool and deleting each blob before clearing, or ensure that this function is only called when the pool is guaranteed to be empty (i.e., all refcounts are zero).

Suggested change
GetPoolRef().clear();
auto& pool = GetPoolRef();
for (InternedBlob* blob : pool) {
delete blob;
}
pool.clear();

Copilot uses AI. Check for mistakes.
Comment on lines 72 to 81
void InternedBlob::IncrRefCount() {
const uint32_t updated_count = RefCount() + 1;
std::memcpy(blob_ - sizeof(uint32_t), &updated_count, sizeof(updated_count));
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The increment and decrement operations on refcount do not check for overflow or underflow. While the comment on line 78 states "Caller must ensure refcount does not go below 0", there are no safeguards to detect when uint32_t refcount overflows past its maximum value during IncrRefCount. Consider adding DCHECK or runtime assertions to detect these conditions in debug builds.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +69
std::string_view InternedBlob::View() const {
DCHECK(blob_) << "Called View() on empty blob";
return {blob_, Size()};
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The View method returns a string_view constructed from the blob_ pointer without explicitly providing the size. While this works because blob_ is null-terminated, it relies on the std::string_view constructor that searches for the null terminator, which may be less efficient than providing the size explicitly. Consider using string_view{blob_, Size()} for better performance and clarity.

Copilot uses AI. Check for mistakes.

using TmpJson = jsoncons::json;
using JsonType = jsoncons::basic_json<char, jsoncons::sorted_policy, StatelessAllocator<char>>;

Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InternedStringPolicy struct lacks documentation explaining its purpose and how it integrates with jsoncons. Consider adding a comment that explains this is a policy class for jsoncons::basic_json that customizes the member key type to use InternedString for memory optimization of JSON object keys.

Suggested change
// Policy class for jsoncons::basic_json that customizes the member key type.
// Inherits from jsoncons::sorted_policy and replaces the default object member
// key type with detail::InternedString so that JSON object keys are interned,
// reducing memory usage when many identical keys are present.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 13, 2026 07:30
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 8321c49 to d7ef37e Compare January 13, 2026 07:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings January 14, 2026 11:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

#include "core/detail/stateless_allocator.h"

namespace {
constexpr size_t int_size = sizeof(uint32_t);
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant int_size is misleading as it refers to the size of uint32_t, not int. Consider renaming to uint32_size or field_size for clarity.

Copilot uses AI. Check for mistakes.
return *new_elem;
}

void InternedString::Acquire() { // NOLINT
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NOLINT comment lacks context explaining why the suppression is needed. Add a brief comment explaining the specific warning being suppressed and why it's safe to ignore.

Suggested change
void InternedString::Acquire() { // NOLINT
void InternedString::Acquire() { // NOLINT: small ref-counting helper; suppression avoids noise from style/usage warnings and is safe as it only increments an existing entry_.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 14, 2026 12:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings January 15, 2026 10:36
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 4e8f135 to 6fab414 Compare January 15, 2026 10:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Comment on lines +89 to +96
template <typename It> InternedString::InternedString(It begin, It end) {
if (begin == end)
return;

const auto size = std::distance(begin, end);
const auto data_ptr = &*begin;
entry_ = Intern(std::string_view(data_ptr, size));
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator-based constructor uses std::distance and dereferences the iterator to get a data pointer. However, this assumes the iterator provides contiguous storage, which is not guaranteed for all iterator types. This constructor should either be constrained to contiguous iterators using concepts/SFINAE, or should build the string character-by-character for non-contiguous iterators.

Copilot uses AI. Check for mistakes.
auto alloc = StatelessAllocator<char>{};

constexpr uint32_t ref_count = 1;
const uint32_t str_len = sv.size();
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string view size is stored as uint32_t, but std::string_view::size() returns size_t. On 64-bit systems, if the string view has a size larger than UINT32_MAX, this will cause silent truncation and memory corruption. The code should either validate that sv.size() fits in a uint32_t or use a larger type for the size field.

Copilot uses AI. Check for mistakes.
const uint32_t str_len = sv.size();

// We need +1 byte for \0 because jsoncons expects c_str() and data() style accessors on keys
blob_ = alloc.allocate(kHeaderSize + str_len + 1);
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation size calculation (kHeaderSize + str_len + 1) could theoretically overflow if str_len is close to UINT32_MAX, though this is unlikely in practice. For robustness, consider checking for overflow before allocation or using a checked arithmetic library.

Copilot uses AI. Check for mistakes.
inline void CleanupStatelessAllocMR() {
for (auto it = detail::cleanup_queue.rbegin(); it != detail::cleanup_queue.rend(); ++it) {
(*it)();
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup_queue is not being cleared after running all cleanup functions. This means that on subsequent calls to CleanupStatelessAllocMR(), the queue will still contain function pointers that have already been executed. This could lead to use-after-free bugs if the cleanup functions capture state that gets deallocated. The queue should be cleared after executing all cleanup functions.

Suggested change
}
}
detail::cleanup_queue.clear();

Copilot uses AI. Check for mistakes.
mi_free(shard_);
shard_ = nullptr;
CompactObj::InitThreadLocal(nullptr);
detail::InternedString::ResetPool();
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InternedString::ResetPool() call is placed before CleanupStatelessAllocMR(), but CleanupStatelessAllocMR() is responsible for calling the pool reset (as registered via EnqueueCleanup in GetPoolRef). This means ResetPool() is called twice: once explicitly here and once via the cleanup queue. The explicit call should be removed since the cleanup mechanism handles it automatically.

Suggested change
detail::InternedString::ResetPool();

Copilot uses AI. Check for mistakes.

info = (await p.execute())[-1]
assert info["used_memory"] < 2 * 1_000_000 # Table memory
assert info["used_memory"] < 4 * 1_000_000 # Table memory
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory threshold has been doubled from 2MB to 4MB. While this change is likely necessary due to the additional overhead from InternedString/InternedBlob wrappers, the comment should be updated to reflect why this increase is needed, especially given that this is a WIP PR implementing interned strings.

Suggested change
assert info["used_memory"] < 4 * 1_000_000 # Table memory
# After FLUSHALL only a small amount of table metadata should remain. The threshold was raised
# from 2MB to 4MB to account for overhead introduced by InternedString/InternedBlob wrappers
# used by the interned strings implementation.
assert info["used_memory"] < 4 * 1_000_000

Copilot uses AI. Check for mistakes.
// BaseFamilyTest initializes the heap on shards serving transactions, the core running the test
// needs this initialized explicitly.
InitTLStatelessAllocMR(GetMemoryResource());
detail::InternedString::ResetPool();
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in engine_shard.cc: InternedString::ResetPool() is being called explicitly before CleanupStatelessAllocMR(), but the cleanup mechanism already handles this. This results in the pool being reset twice, which is redundant and could mask issues if the cleanup mechanism fails. Remove the explicit ResetPool() call.

Suggested change
detail::InternedString::ResetPool();

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 16, 2026 08:52
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 2da8c4d to d36a8c0 Compare January 16, 2026 08:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comment on lines +28 to +30
void TearDown() override {
CleanupStatelessAllocMR();
}
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TearDown method should also reset the InternedString pool to prevent test contamination between test cases. Since InternedString uses a thread-local pool that persists across tests, strings created in one test could affect memory usage calculations in subsequent tests. Add detail::InternedString::ResetPool(); before CleanupStatelessAllocMR();.

Copilot uses AI. Check for mistakes.
EXPECT_THAT(resp, IntArg(start_size));
// When deleting, json object member is erased. This causes deletion of the blob and the data it
// contained.
const size_t size_after_delete = start_size - sizeof(detail::InternedBlob) - 16;
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 16 is unclear. Add a comment explaining what this represents (e.g., the string data size for key 'a' plus allocation overhead, or jsoncons internal structures).

Suggested change
const size_t size_after_delete = start_size - sizeof(detail::InternedBlob) - 16;
const size_t size_after_delete =
start_size - sizeof(detail::InternedBlob) -
16; // 16 bytes account for jsoncons' internal storage/overhead for the erased string value

Copilot uses AI. Check for mistakes.
// The interned string has access to a thread local pool of InternedBlob pointers.
// InternedString handles incrementing and decrementing reference counts of the blobs tied to its
// own lifecycle. It deletes the blob from the pool when refcount is 0.
// TODO examine cross shard json object interactions. Can a pool end up access from another shard?
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO raises an important thread-safety concern. Since the pool is thread-local, cross-shard access would be unsafe. Either confirm this is safe by design (JSON objects don't move between shards) or add runtime assertions to detect violations.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 16, 2026 15:39
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from fdcd914 to d36a8c0 Compare January 16, 2026 15:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from d36a8c0 to 3601e58 Compare January 17, 2026 02:14
Copilot AI review requested due to automatic review settings January 19, 2026 07:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

Comment on lines +7 to 9
from .utility import *


Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement 'from .utility import *' appears twice (lines 1 and 7). This is redundant and should be removed. Keep only one import from .utility at the top of the file.

Suggested change
from .utility import *

Copilot uses AI. Check for mistakes.
InitTLStatelessAllocMR(GetMemoryResource());
detail::InternedString::ResetPool();
}

Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SetUp method calls InitTLStatelessAllocMR and ResetPool, but there is no corresponding TearDown method to clean up these resources. Consider adding a TearDown method that calls CleanupStatelessAllocMR() to match the pattern used in other test files like interned_blob_test.cc and page_usage_stats_test.cc.

Suggested change
void TearDown() override {
CleanupStatelessAllocMR();
BaseFamilyTest::TearDown();
}

Copilot uses AI. Check for mistakes.
from redis import asyncio as aioredis
from .utility import *
import logging
import logging
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement for 'logging' is duplicated on lines 2 and 3. Remove the duplicate import.

Suggested change
import logging

Copilot uses AI. Check for mistakes.

void InternedBlob::DecrRefCount() {
const uint32_t ref_count = RefCount();
CHECK_GE(ref_count, 1ul) << "Attempt to decrease zero refcount";
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The literal '1ul' is used with CHECK_GE for comparing with a uint32_t variable. For consistency with the rest of the codebase and to avoid potential type issues, use '1u' instead of '1ul' since ref_count is uint32_t, not unsigned long.

Suggested change
CHECK_GE(ref_count, 1ul) << "Attempt to decrease zero refcount";
CHECK_GE(ref_count, 1u) << "Attempt to decrease zero refcount";

Copilot uses AI. Check for mistakes.
EXPECT_THAT(resp, IntArg(start_size));
// When deleting, json object member is erased. This causes deletion of the blob and the data it
// contained.
const size_t size_after_delete = start_size - sizeof(detail::InternedBlob) - 16;
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number '16' is used to calculate the expected size after deletion. This appears to be related to the string size ("a":"some text"), but it's not documented. Consider adding a comment explaining what this 16 represents or using a named constant to make the calculation more maintainable and clear.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +115
blob.DecrRefCount();
EXPECT_EQ(blob.RefCount(), 0);
EXPECT_DEATH(blob.DecrRefCount(), "Attempt to decrease zero refcount");
blob.SetRefCount(std::numeric_limits<uint32_t>::max());
EXPECT_DEATH(blob.IncrRefCount(), "Attempt to increase max refcount");
}

Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test manipulates the reference count to 0 and then continues to use the blob object to test EXPECT_DEATH scenarios. After decrementing the refcount to 0 at line 66, the blob's internal state indicates it should be deleted. Using the blob after this point (lines 67-69) may lead to undefined behavior in production code, even though it works in the test. Consider restructuring the test to avoid this pattern or add a comment explaining why this is safe in the test context.

Suggested change
blob.DecrRefCount();
EXPECT_EQ(blob.RefCount(), 0);
EXPECT_DEATH(blob.DecrRefCount(), "Attempt to decrease zero refcount");
blob.SetRefCount(std::numeric_limits<uint32_t>::max());
EXPECT_DEATH(blob.IncrRefCount(), "Attempt to increase max refcount");
}
EXPECT_EQ(blob.RefCount(), 1);
}
TEST_F(InternedBlobTest, RefCountsDeath) {
// These death tests intentionally violate the refcount invariants to verify
// that InternedBlob enforces them. Each scenario uses a fresh blob instance
// scoped to the EXPECT_DEATH body, so no object is reused after reaching a
// terminal refcount state in normal test execution.
EXPECT_DEATH(
[] {
auto blob = InternedBlob{"1234567"};
blob.DecrRefCount(); // refcount goes from 1 to 0
blob.DecrRefCount(); // invalid: attempt to decrement zero refcount
}(),
"Attempt to decrease zero refcount");
EXPECT_DEATH(
[] {
auto blob = InternedBlob{"1234567"};
blob.SetRefCount(std::numeric_limits<uint32_t>::max());
blob.IncrRefCount(); // invalid: attempt to increment max refcount
}(),
"Attempt to increase max refcount");
}

Copilot uses AI. Check for mistakes.
from . import dfly_args
from .instance import DflyInstance, DflyInstanceFactory
from .instance import DflyInstanceFactory
from .utility import *
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import pollutes the enclosing namespace, as the imported module dragonfly.utility does not define 'all'.

Copilot uses AI. Check for mistakes.
The stateless allocator is backed by a thread local memory resource. In
order to catch bugs during testing, this resource is reset to null
pointer at the end of tests where applicable.

One purpose is to catch issues with interned strings introduced in the
next few changes, which use a thread local string pool which allocates
from the stateless allocator. Setting the memory resource to null
ensures that buggy behavior where blobs or strings outlive the memory
resource is caught early, via the assert within the allocator.

Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 20ec6da to 206cb5b Compare January 23, 2026 10:09
A blob contains one char * allocated on the mimalloc heap via stateless
allocator.

The blob contains a string, its size and reference count. The size and
ref count are stored before the char data. The chars end with a nul char
so as to be able to use the char* directly as a string. This is
important for jsoncons which expects such an API.

There are custom hash and eq operators which allow for heterogenous
lookup ie do not require conversion to blob type for hashing/comparison,
and also allow only using the string content for checks, ignoring ref.
count and size.

An interned pool is introduced which is simply a flat set of blob
pointers.

The blob itself does not control its reference count. In a future PR a
string will be added, which will use refcount.

Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
The string class wraps the previously added blob class. It decides at
creation time whether to create a new blob, or bump up the refcount of
an existing blob in the pool.

The pool is thread local, and is used by the strings on that thread to
store pointers to allocated blobs.

When a string is destroyed it decrements the ref-count, dropping the
blob if necessary (ie refcount == 0).

A lot of boilerplate in this class is from trial and error, adding
methods that jsoncons required to be able to use this class as a key.

Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
A policy is defined which hooks interned strings into jsoncons.

Tests are adjusted for the now reduced memory size.

Engine shard is set up to reset (clear out) the blob pool before setting
the memory resource to nullptr. This way no blobs are around which may
refer to the memory resource after it is set to nullptr.

Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
Some memory usage is increased because of the blob pool. The tests are
adjusted accordingly.

Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
@abhijat abhijat force-pushed the abhijat/spike/interned-strings branch from 206cb5b to 0593c8f Compare January 23, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants