-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[WIP]: use refcounted interned strings #6368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
32413a1 to
ac82dfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a reference-counted string interning system for JSON object keys to reduce memory usage when the same keys appear multiple times. The implementation adds InternedBlob (raw string storage with refcounting) and InternedString (string-like wrapper) classes, and integrates them into jsoncons via a custom InternedStringPolicy.
Changes:
- Added
InternedBlobclass for memory-efficient string storage with reference counting - Added
InternedStringclass providing a string-like API over interned blobs - Modified
JsonTypeto useInternedStringPolicyfor object keys instead ofsorted_policy
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/core/json/json_object.h | Integrates InternedString into JsonType via custom InternedStringPolicy |
| src/core/json/detail/interned_string.h | Defines InternedString class with string-like interface and template constructors |
| src/core/json/detail/interned_string.cc | Implements InternedString with refcount management and thread_local pool |
| src/core/json/detail/interned_blob.h | Defines InternedBlob for raw string storage with refcounting and hash/equality support |
| src/core/json/detail/interned_blob.cc | Implements InternedBlob memory layout and refcount operations |
| src/core/json/CMakeLists.txt | Adds new source files to jsonpath library |
| // Copyright 2025, DragonflyDB authors. All rights reserved. | ||
| // See LICENSE for licensing terms. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include "core/detail/stateless_allocator.h" | ||
| #include "core/json/detail/interned_blob.h" | ||
|
|
||
| namespace dfly::detail { | ||
|
|
||
| // The interned string has access to a thread local pool of InternedBlobs. It only holds a single | ||
| // pointer into that pool as state. The pool is a node_hash_set so the pointer is stable. | ||
| // InternedString handles incrementing and decrementing reference counts of the blobs tied to its | ||
| // own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard. | ||
| // TODO examine cross shard json object interactions. Can a pool end up access from another shard? | ||
| class InternedString { | ||
| public: | ||
| using allocator_type = StatelessAllocator<char>; | ||
|
|
||
| InternedString(); | ||
|
|
||
| // The following constructors and members are added because they are required by jsoncons for | ||
| // keys. Each of these is added in response to compiler errors and should not be removed, even if | ||
| // they are seemingly a no-op or duplicated. | ||
| template <typename Alloc> explicit InternedString(Alloc /*unused*/) { | ||
| } | ||
|
|
||
| explicit InternedString(std::string_view); | ||
|
|
||
| template <typename Alloc> InternedString(const char* data, size_t size, Alloc alloc); | ||
|
|
||
| template <typename It> InternedString(It begin, It end); | ||
|
|
||
| InternedString(const InternedString& other); | ||
| InternedString(InternedString&& other) noexcept; | ||
| InternedString& operator=(const InternedString& other); | ||
| InternedString& operator=(InternedString&& other) noexcept; | ||
|
|
||
| ~InternedString(); | ||
|
|
||
| operator std::string_view() const; | ||
|
|
||
| const char* data() const; | ||
|
|
||
| const char* c_str() const; | ||
|
|
||
| void swap(InternedString&) noexcept; | ||
|
|
||
| size_t length() const; | ||
|
|
||
| size_t size() const; | ||
|
|
||
| int compare(const InternedString& other) const; | ||
| int compare(std::string_view) const; | ||
|
|
||
| bool operator==(const InternedString& other) const; | ||
| bool operator!=(const InternedString& other) const; | ||
| bool operator<(const InternedString& other) const; | ||
|
|
||
| void shrink_to_fit() { // NOLINT | ||
| } | ||
|
|
||
| private: | ||
| // If a string exists in pool, increments its refcount and returns a pointer to it. If not, adds | ||
| // the string to the pool. | ||
| static const InternedBlob* Intern(std::string_view sv); | ||
|
|
||
| // Increments the refcount if the entry is not null | ||
| void Acquire(); | ||
|
|
||
| // Decrements the refcount, removes entry from pool if necessary, destroying the interned blob | ||
| void Release(); | ||
|
|
||
| static InternedBlobPool& GetPoolRef(); | ||
|
|
||
| static InternedBlobPool pool_; | ||
|
|
||
| const InternedBlob* entry_ = nullptr; | ||
| }; | ||
|
|
||
| template <typename Alloc> | ||
| InternedString::InternedString(const char* data, size_t size, Alloc /*unused*/) | ||
| : InternedString(std::string_view{data, size}) { | ||
| } |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new InternedString and InternedBlob classes lack dedicated test coverage. Given the complexity of reference counting, memory management, and the integration with jsoncons, comprehensive unit tests should be added to verify: empty string handling, refcount correctness, pool insertion/removal, copy/move semantics, self-assignment, iterator constructor edge cases, and memory leak prevention. The existing json_test.cc file shows the project has test infrastructure for JSON components.
| uint32_t InternedBlob::Size() const { | ||
| // TODO - should this check be removed to uncover bugs? or an assert here? | ||
| if (!blob_) | ||
| return 0; | ||
|
|
||
| uint32_t size; | ||
| std::memcpy(&size, blob_, sizeof(size)); | ||
| return size; | ||
| } |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO questions whether null checks should be removed or converted to assertions. For production code dealing with memory management and reference counting, these null checks are important for defensive programming. However, if the invariant is that blob_ should never be null during these operations, assertions would be more appropriate to catch bugs in debug builds while avoiding the overhead in release builds.
1ad1102 to
6734438
Compare
6734438 to
ceb8568
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
| // own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard. | ||
| // TODO examine cross shard json object interactions. Can a pool end up access from another shard? |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a TODO about examining cross-shard JSON object interactions. Since the pool is thread_local, if a JSON object with InternedStrings is passed to another shard/thread, those strings would reference a pool on a different thread, potentially causing crashes or undefined behavior. This needs to be resolved before merging, or at minimum, documented as a known limitation with guards to prevent cross-thread usage.
| // own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard. | |
| // TODO examine cross shard json object interactions. Can a pool end up access from another shard? | |
| // own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard | |
| // and per thread. | |
| // | |
| // IMPORTANT: InternedString instances (and JSON objects containing them) are local to the shard/ | |
| // thread whose pool they were created in. They must not be passed to or accessed from another | |
| // shard/thread. Doing so would cause entry_ to refer to an InternedBlob allocated from a | |
| // different thread-local pool, which can lead to crashes or undefined behavior when reference | |
| // counts are updated or entries are removed from the pool. Callers are responsible for enforcing | |
| // this shard/thread affinity and must avoid cross-shard/thread usage. |
| if (this != &other) { | ||
| Release(); | ||
| entry_ = other.entry_; | ||
| Acquire(); | ||
| } | ||
| return *this; | ||
| } |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison in the copy assignment operator using operator!= may cause issues with self-assignment. Using string content comparison instead of identity check means self-assignment of an InternedString to itself would still trigger Release() and Acquire(), unnecessarily modifying the reference count. Change the condition to 'if (this != &other)' for proper self-assignment detection.
| void InternedBlob::IncrRefCount() const { | ||
| const uint32_t updated_count = RefCount() + 1; | ||
| std::memcpy(blob_ + sizeof(updated_count), &updated_count, sizeof(updated_count)); | ||
| } | ||
|
|
||
| void InternedBlob::DecrRefCount() const { | ||
| // Caller must ensure refcount does not go below 0 | ||
| const uint32_t updated_count = RefCount() - 1; | ||
| std::memcpy(blob_ + sizeof(updated_count), &updated_count, sizeof(updated_count)); |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference counting operations IncrRefCount() and DecrRefCount() are not atomic. In a multi-threaded environment where the same InternedString might be accessed from multiple threads, this could lead to race conditions and incorrect reference counts. Consider using std::atomic<uint32_t> for the reference count field or documenting that InternedStrings must not be shared across threads.
ceb8568 to
bcd7ffe
Compare
bcd7ffe to
a1f55e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.
| template <typename It> InternedString::InternedString(It begin, It end) { | ||
| if (begin == end) | ||
| return; | ||
|
|
||
| const auto size = std::distance(begin, end); | ||
| const auto data_ptr = &*begin; | ||
| entry_ = Intern(std::string_view(data_ptr, size)); | ||
| } |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator constructor template does not validate that the iterator points to contiguous memory before taking the address with &*begin. If the iterator type is not a contiguous iterator (e.g., a list iterator), this will create undefined behavior. Consider adding a static_assert to enforce that It is a contiguous iterator, or copying the data into a temporary buffer.
src/core/json/detail/interned_blob.h
Outdated
| // These setters are const as a hack. They are allowed to be const because they mutate the | ||
| // refcount via pointer. If they were non-const InternedString would not be able to mutate pool | ||
| // entries to incr/decr refcounts. Since we do not use the refcount or size in hashing, this is | ||
| // fine. | ||
| void IncrRefCount() const; | ||
|
|
||
| void DecrRefCount() const; |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The const_cast hack mentioned in the comment is problematic. Making IncrRefCount and DecrRefCount const methods that modify mutable state breaks const-correctness and makes the API misleading. This pattern can hide bugs where const methods unexpectedly modify state. A better design would be to make the InternedBlob* in the hash set mutable, or use a different pool structure that doesn't require const methods to mutate state.
| // These setters are const as a hack. They are allowed to be const because they mutate the | |
| // refcount via pointer. If they were non-const InternedString would not be able to mutate pool | |
| // entries to incr/decr refcounts. Since we do not use the refcount or size in hashing, this is | |
| // fine. | |
| void IncrRefCount() const; | |
| void DecrRefCount() const; | |
| // These methods mutate the internal refcount of the blob. | |
| void IncrRefCount(); | |
| void DecrRefCount(); |
| auto it = pool_ref.find(sv); | ||
| if (it != pool_ref.end()) { | ||
| it->IncrRefCount(); | ||
| return &*it; | ||
| } | ||
|
|
||
| auto [new_elem, _] = pool_ref.emplace(sv); | ||
| return &*new_elem; |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient double-lookup in Intern method. The code first calls find() at line 95, then if not found, calls emplace() at line 101 which will perform another lookup internally. This results in two hash computations and lookups for new strings. Consider using try_emplace() which performs a single lookup and only constructs the object if insertion occurs.
| auto it = pool_ref.find(sv); | |
| if (it != pool_ref.end()) { | |
| it->IncrRefCount(); | |
| return &*it; | |
| } | |
| auto [new_elem, _] = pool_ref.emplace(sv); | |
| return &*new_elem; | |
| auto [it, inserted] = pool_ref.emplace(sv); | |
| if (!inserted) { | |
| it->IncrRefCount(); | |
| } | |
| return &*it; |
| blob_ = alloc.allocate(kHeaderSize + str_len + 1); | ||
|
|
||
| std::memcpy(blob_, &str_len, sizeof(uint32_t)); | ||
| std::memcpy(blob_ + sizeof(uint32_t), &ref_count, sizeof(uint32_t)); | ||
|
|
||
| std::memcpy(blob_ + kHeaderSize, sv.data(), str_len); | ||
|
|
||
| // null terminate so jsoncons can directly access the char* as string | ||
| blob_[kHeaderSize + str_len] = '\0'; |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential alignment issues with the blob memory layout. The code assumes uint32_t can be safely memcpy'd to/from char* at arbitrary offsets. While this works in practice on most platforms, it's technically undefined behavior. Consider using std::aligned_storage or properly aligned allocation to ensure the header is correctly aligned for uint32_t access.
|
|
||
| const char* c_str() const; | ||
|
|
||
| void swap(InternedString&) noexcept; |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing const qualifier on the swap method parameter. The parameter should be a non-const reference to allow swapping, but the method signature shows it correctly. However, there's an inconsistency with standard library conventions - typically swap is a free function, not a member. Consider providing both a member swap and a non-member swap in the namespace for ADL (Argument-Dependent Lookup) compatibility.
| // Copyright 2025, DragonflyDB authors. All rights reserved. | ||
| // See LICENSE for licensing terms. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include "core/detail/stateless_allocator.h" | ||
| #include "core/json/detail/interned_blob.h" | ||
|
|
||
| namespace dfly::detail { | ||
|
|
||
| // The interned string has access to a thread local pool of InternedBlobs. It only holds a single | ||
| // pointer into that pool as state. The pool is a node_hash_set so the pointer is stable. | ||
| // InternedString handles incrementing and decrementing reference counts of the blobs tied to its | ||
| // own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard. | ||
| // TODO examine cross shard json object interactions. Can a pool end up access from another shard? | ||
| class InternedString { | ||
| public: | ||
| using allocator_type = StatelessAllocator<char>; | ||
|
|
||
| InternedString(); | ||
|
|
||
| // The following constructors and members are added because they are required by jsoncons for | ||
| // keys. Each of these is added in response to compiler errors and should not be removed, even if | ||
| // they are seemingly a no-op or duplicated. | ||
| template <typename Alloc> explicit InternedString(Alloc /*unused*/) { | ||
| } | ||
|
|
||
| explicit InternedString(std::string_view); | ||
|
|
||
| template <typename Alloc> InternedString(const char* data, size_t size, Alloc alloc); | ||
|
|
||
| template <typename It> InternedString(It begin, It end); | ||
|
|
||
| InternedString(const InternedString& other); | ||
| InternedString(InternedString&& other) noexcept; | ||
| InternedString& operator=(const InternedString& other); | ||
| InternedString& operator=(InternedString&& other) noexcept; | ||
|
|
||
| ~InternedString(); | ||
|
|
||
| operator std::string_view() const; | ||
|
|
||
| const char* data() const; | ||
|
|
||
| const char* c_str() const; | ||
|
|
||
| void swap(InternedString&) noexcept; | ||
|
|
||
| size_t length() const; | ||
|
|
||
| size_t size() const; | ||
|
|
||
| int compare(const InternedString& other) const; | ||
| int compare(std::string_view) const; | ||
|
|
||
| bool operator==(const InternedString& other) const; | ||
| bool operator!=(const InternedString& other) const; | ||
| bool operator<(const InternedString& other) const; | ||
|
|
||
| void shrink_to_fit() { // NOLINT | ||
| } | ||
|
|
||
| private: | ||
| // If a string exists in pool, increments its refcount and returns a pointer to it. If not, adds | ||
| // the string to the pool. | ||
| static const InternedBlob* Intern(std::string_view sv); | ||
|
|
||
| // Increments the refcount if the entry is not null | ||
| void Acquire(); | ||
|
|
||
| // Decrements the refcount, removes entry from pool if necessary, destroying the interned blob | ||
| void Release(); | ||
|
|
||
| static InternedBlobPool& GetPoolRef(); | ||
|
|
||
| const InternedBlob* entry_ = nullptr; | ||
| }; | ||
|
|
||
| template <typename Alloc> | ||
| InternedString::InternedString(const char* data, size_t size, Alloc /*unused*/) | ||
| : InternedString(std::string_view{data, size}) { | ||
| } | ||
|
|
||
| template <typename It> InternedString::InternedString(It begin, It end) { | ||
| if (begin == end) | ||
| return; | ||
|
|
||
| const auto size = std::distance(begin, end); | ||
| const auto data_ptr = &*begin; | ||
| entry_ = Intern(std::string_view(data_ptr, size)); | ||
| } | ||
|
|
||
| } // namespace dfly::detail |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for the new InternedString and InternedBlob functionality. The repository has comprehensive test coverage for JSON functionality (json_test.cc), but no tests are added for the new interned string implementation. Critical scenarios that should be tested include: reference counting correctness, pool management, string interning deduplication, edge cases like empty strings, move/copy semantics, and memory leak verification. Without tests, bugs in this critical memory management code may not be caught.
| bool InternedString::operator==(const InternedString& other) const { | ||
| // Compare pointers since we store them in the same pool | ||
| return entry_ == other.entry_; | ||
| } | ||
|
|
||
| bool InternedString::operator!=(const InternedString& other) const { | ||
| return entry_ != other.entry_; | ||
| } | ||
|
|
||
| bool InternedString::operator<(const InternedString& other) const { |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing noexcept specification on comparison operators. The operator< is implemented in terms of compare(), which calls string_view::compare(). Since string_view::compare() is noexcept, this operator should also be marked noexcept for better performance and exception safety guarantees. The same applies to operator== and operator!= which should be noexcept as they only compare pointers.
| bool InternedString::operator==(const InternedString& other) const { | |
| // Compare pointers since we store them in the same pool | |
| return entry_ == other.entry_; | |
| } | |
| bool InternedString::operator!=(const InternedString& other) const { | |
| return entry_ != other.entry_; | |
| } | |
| bool InternedString::operator<(const InternedString& other) const { | |
| bool InternedString::operator==(const InternedString& other) const noexcept { | |
| // Compare pointers since we store them in the same pool | |
| return entry_ == other.entry_; | |
| } | |
| bool InternedString::operator!=(const InternedString& other) const noexcept { | |
| return entry_ != other.entry_; | |
| } | |
| bool InternedString::operator<(const InternedString& other) const noexcept { |
| uint32_t InternedBlob::Size() const { | ||
| // TODO - should this check be removed to uncover bugs? or an assert here? | ||
| if (!blob_) | ||
| return 0; | ||
|
|
||
| uint32_t size; | ||
| std::memcpy(&size, blob_, sizeof(size)); | ||
| return size; | ||
| } | ||
|
|
||
| uint32_t InternedBlob::RefCount() const { | ||
| // TODO - should this check be removed to uncover bugs? or an assert here? | ||
| if (!blob_) | ||
| return 0; |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comments asking whether null checks should be removed or replaced with asserts indicate incomplete design decisions. These methods are part of the public API (accessible via InternedString), so proper error handling is important. The decision should be made: either document that callers must ensure blob_ is valid (and use asserts in debug builds), or handle null gracefully. The current inconsistency should be resolved before merging.
| void InternedBlob::IncrRefCount() const { | ||
| const uint32_t updated_count = RefCount() + 1; | ||
| std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count)); |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential integer overflow in refcount increment. If RefCount returns UINT32_MAX, adding 1 will cause overflow and wrap to 0, which would then trigger incorrect deletion of the blob. Consider adding overflow detection or using a saturating increment.
| uint32_t InternedBlob::Size() const { | ||
| // TODO - should this check be removed to uncover bugs? or an assert here? | ||
| if (!blob_) | ||
| return 0; | ||
|
|
||
| uint32_t size; | ||
| std::memcpy(&size, blob_, sizeof(size)); | ||
| return size; | ||
| } | ||
|
|
||
| uint32_t InternedBlob::RefCount() const { | ||
| // TODO - should this check be removed to uncover bugs? or an assert here? | ||
| if (!blob_) | ||
| return 0; | ||
|
|
||
| uint32_t ref_count; | ||
| // Assumes size and refcount are both 4 bytes | ||
| std::memcpy(&ref_count, blob_ + sizeof(uint32_t), sizeof(uint32_t)); | ||
| return ref_count; | ||
| } | ||
|
|
||
| std::string_view InternedBlob::View() const { | ||
| return {blob_ + kHeaderSize, Size()}; | ||
| } | ||
|
|
||
| const char* InternedBlob::Data() const { | ||
| return blob_ ? blob_ + kHeaderSize : nullptr; | ||
| } |
Copilot
AI
Jan 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent null checking pattern. The Size() and RefCount() methods check for null blob_ and return 0, but View() and Data() handle null differently. View() would create a string_view from an invalid pointer when blob_ is null (though Size() returns 0, which makes it safe), while Data() explicitly checks. This inconsistency could lead to maintenance issues. Consider standardizing the null-checking approach across all methods.
ef79c65 to
a9b2552
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
| namespace dfly::detail { | ||
| class InternedBlob { | ||
| static constexpr auto kHeaderSize = sizeof(uint32_t) * 2; | ||
|
|
||
| public: | ||
| // TODO - does this need to accept an allocator?? | ||
| explicit InternedBlob(std::string_view sv); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO questions whether InternedBlob needs to accept an allocator. Given that this code uses StatelessAllocator which is designed for the shard-specific mimalloc heap, and the InternedBlob is used in a per-shard pool, it should accept and use the allocator for consistency with the rest of the codebase. This ensures proper memory accounting and allocation on the correct heap.
Update the constructor to accept an allocator parameter and use it instead of creating a new StatelessAllocator instance.
| namespace dfly::detail { | |
| class InternedBlob { | |
| static constexpr auto kHeaderSize = sizeof(uint32_t) * 2; | |
| public: | |
| // TODO - does this need to accept an allocator?? | |
| explicit InternedBlob(std::string_view sv); | |
| namespace dfly { | |
| template <typename T> class StatelessAllocator; | |
| namespace detail { | |
| class InternedBlob { | |
| static constexpr auto kHeaderSize = sizeof(uint32_t) * 2; | |
| public: | |
| // TODO - does this need to accept an allocator?? | |
| explicit InternedBlob(std::string_view sv, const dfly::StatelessAllocator<char>& allocator); |
| // Copyright 2025, DragonflyDB authors. All rights reserved. | ||
| // See LICENSE for licensing terms. | ||
|
|
||
| #pragma once | ||
|
|
||
| #include "core/detail/stateless_allocator.h" | ||
| #include "core/json/detail/interned_blob.h" | ||
|
|
||
| namespace dfly::detail { | ||
|
|
||
| // The interned string has access to a thread local pool of InternedBlobs. It only holds a single | ||
| // pointer into that pool as state. The pool is a node_hash_set so the pointer is stable. | ||
| // InternedString handles incrementing and decrementing reference counts of the blobs tied to its | ||
| // own lifecycle. It deletes the blob from the pool when refcount is 0. The pool is per shard. | ||
| // TODO examine cross shard json object interactions. Can a pool end up access from another shard? | ||
| class InternedString { | ||
| public: | ||
| using allocator_type = StatelessAllocator<char>; | ||
|
|
||
| InternedString(); | ||
|
|
||
| // The following constructors and members are added because they are required by jsoncons for | ||
| // keys. Each of these is added in response to compiler errors and should not be removed, even if | ||
| // they are seemingly a no-op or duplicated. | ||
| template <typename Alloc> explicit InternedString(Alloc /*unused*/) { | ||
| } | ||
|
|
||
| explicit InternedString(std::string_view); | ||
|
|
||
| template <typename Alloc> InternedString(const char* data, size_t size, Alloc alloc); | ||
|
|
||
| template <typename It> InternedString(It begin, It end); | ||
|
|
||
| InternedString(const InternedString& other); | ||
| InternedString(InternedString&& other) noexcept; | ||
| InternedString& operator=(const InternedString& other); | ||
| InternedString& operator=(InternedString&& other) noexcept; | ||
|
|
||
| ~InternedString(); | ||
|
|
||
| operator std::string_view() const; | ||
|
|
||
| const char* data() const; | ||
|
|
||
| const char* c_str() const; | ||
|
|
||
| void swap(InternedString&) noexcept; | ||
|
|
||
| size_t length() const; | ||
|
|
||
| size_t size() const; | ||
|
|
||
| int compare(const InternedString& other) const; | ||
| int compare(std::string_view) const; | ||
|
|
||
| bool operator==(const InternedString& other) const; | ||
| bool operator!=(const InternedString& other) const; | ||
| bool operator<(const InternedString& other) const; | ||
|
|
||
| void shrink_to_fit(); | ||
|
|
||
| // For tests | ||
| static void ResetPool(); | ||
|
|
||
| private: | ||
| // If a string exists in pool, increments its refcount and returns a pointer to it. If not, adds | ||
| // the string to the pool. | ||
| static const InternedBlob* Intern(std::string_view sv); | ||
|
|
||
| // Increments the refcount if the entry is not null | ||
| void Acquire(); | ||
|
|
||
| // Decrements the refcount, removes entry from pool if necessary, destroying the interned blob | ||
| void Release(); | ||
|
|
||
| static InternedBlobPool& GetPoolRef(); | ||
|
|
||
| const InternedBlob* entry_ = nullptr; | ||
| }; | ||
|
|
||
| template <typename Alloc> | ||
| InternedString::InternedString(const char* data, size_t size, Alloc /*unused*/) | ||
| : InternedString(std::string_view{data, size}) { | ||
| } | ||
|
|
||
| template <typename It> InternedString::InternedString(It begin, It end) { | ||
| if (begin == end) | ||
| return; | ||
|
|
||
| const auto size = std::distance(begin, end); | ||
| const auto data_ptr = &*begin; | ||
| entry_ = Intern(std::string_view(data_ptr, size)); | ||
| } | ||
|
|
||
| } // namespace dfly::detail |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new InternedString and InternedBlob classes lack dedicated unit tests. While json_family_memory_test.cc has been updated to account for the new behavior, there are no direct tests for:
- Reference counting correctness (increment/decrement/cleanup)
- Pool management (insertion, lookup, deletion)
- Edge cases (empty strings, large strings, special characters)
- Memory lifecycle (construction, copy, move, destruction)
- The ResetPool() functionality
Add a dedicated test file (e.g., interned_string_test.cc) with comprehensive unit tests for these core components before merging.
src/core/json/detail/interned_blob.h
Outdated
| // pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably | ||
| // could. |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states that the pool "probably could" use StatelessAllocator or mimalloc heap, but doesn't explain why it currently doesn't. Given that this code is part of a system designed around per-shard heaps and stateless allocators (as evidenced by the usage in InternedBlob), the pool should use the same allocation strategy for consistency and proper memory accounting.
Either update the pool to use StatelessAllocator, or document why using the default allocator is the correct choice here.
| // pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably | |
| // could. | |
| // pool. The pool intentionally uses the default allocator rather than a stateless allocator or a | |
| // dedicated mimalloc heap: the dominant memory usage comes from the InternedBlob payloads | |
| // themselves (managed by InternedBlob), while the node_hash_set only stores lightweight | |
| // bookkeeping nodes. Using the default allocator here keeps the container simple and avoids | |
| // coupling it to shard-local heap configuration, without materially affecting memory accounting. |
| // When deleting, json object member is erased. This causes deletion of two interned string | ||
| // wrappers, which has to be accounted for. | ||
| const size_t size_after_delete = | ||
| start_size - sizeof(detail::InternedBlob) - sizeof(detail::InternedString); | ||
| EXPECT_THAT(resp, IntArg(size_after_delete)); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation subtracts sizeof(detail::InternedBlob) and sizeof(detail::InternedString), but this doesn't accurately represent the actual memory freed. When a JSON object member is erased:
- sizeof(detail::InternedString) is the size of the wrapper (just a pointer), not the interned blob data
- The actual InternedBlob may not be deleted if other JSON objects reference the same interned string (e.g., if "a" appears elsewhere)
- The calculation doesn't account for any internal allocations or padding
This test may be fragile and could fail if the string happens to be reused elsewhere. Consider using a more robust approach that doesn't rely on exact sizeof calculations, or ensure the test strings are truly unique.
| const uint32_t str_len = sv.size(); | ||
|
|
||
| // We need \0 because jsoncons expects c_str() and data() style accessors on keys | ||
| blob_ = alloc.allocate(kHeaderSize + str_len + 1); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory allocation calculation kHeaderSize + str_len + 1 could overflow if str_len is close to UINT32_MAX. While unlikely in practice for JSON keys, this should be validated to prevent allocation of incorrect memory sizes.
Add a check to ensure str_len doesn't exceed a reasonable maximum (e.g., SIZE_MAX - kHeaderSize - 1) before allocating.
| return compare(other) < 0; | ||
| } | ||
| void InternedString::shrink_to_fit() { // NOLINT (must be non-const to align with jsoncons usage) | ||
| DCHECK(entry_ != nullptr); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shrink_to_fit method contains a DCHECK that entry_ is not nullptr, but this assertion will fail for default-constructed or empty InternedStrings where entry_ is legitimately nullptr. The check should either allow nullptr (making it a no-op) or the function should not be called on empty strings. Review jsoncons usage to determine the correct behavior.
| DCHECK(entry_ != nullptr); | |
| if (!entry_) { | |
| return; | |
| } | |
| // Interned strings are pooled and immutable; nothing to shrink. |
| public: | ||
| // TODO - does this need to accept an allocator?? | ||
| explicit InternedBlob(std::string_view sv); | ||
| ~InternedBlob(); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternedBlob lacks explicit copy constructor, move constructor, copy assignment, and move assignment operators. The compiler-generated versions will be incorrect:
- Copy operations will shallow-copy blob_ pointer, leading to double-free
- Move operations will leave the moved-from object with a dangling pointer that gets deallocated
Since InternedBlob manages raw memory with new/delete-like semantics, it should follow the Rule of Five. Either:
- Delete copy/move operations if blobs should not be copied/moved (they're stored in hash_set by value)
- Implement proper deep-copy and move semantics
- Use a smart pointer like unique_ptr for blob_
Given that these are stored in a hash_set and never copied/moved after insertion, deleting these operations is likely the correct choice.
|
@copilot please do not review this PR until explicitly told to do so in future. This is a spike PR and is not ready for review yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
| void InternedBlob::IncrRefCount() const { | ||
| const uint32_t updated_count = RefCount() + 1; | ||
| std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count)); | ||
| } | ||
|
|
||
| void InternedBlob::DecrRefCount() const { | ||
| // Caller must ensure refcount does not go below 0 | ||
| const uint32_t updated_count = RefCount() - 1; | ||
| std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count)); | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refcount increment and decrement operations are not atomic, which could lead to race conditions in concurrent scenarios. While the comment at line 15 states the pool is per-shard and thread_local, if InternedBlob instances are accessed from multiple fibers on the same shard, the read-modify-write sequence (RefCount() + 1, then memcpy) is not atomic. This could result in lost updates or incorrect refcounts. Consider using atomic operations or documenting that InternedBlob access must be properly synchronized.
src/core/json/detail/interned_blob.h
Outdated
| // pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably | ||
| // could. |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "This pool holds blobs and is used by InternedString to manage string access. node_hash_set instead of flat to maintain pointer stability, as interned strings hold onto pointers into this pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably could." This is misleading because while the pool container doesn't use StatelessAllocator, the blob_ data inside each InternedBlob does use StatelessAllocator (see interned_blob.cc lines 14, 20, 87). The comment should clarify that only the pool's node allocation doesn't use StatelessAllocator, but the blob data itself does.
| // pool. Note that the pool itself does not use stateless allocator or mimalloc heap, it probably | |
| // could. | |
| // pool. Note that the pool's node allocations (the hash_set container itself) do not use | |
| // StatelessAllocator or the mimalloc heap, though the blob_ data inside each InternedBlob does. |
| void InternedBlob::IncrRefCount() const { | ||
| const uint32_t updated_count = RefCount() + 1; | ||
| std::memcpy(blob_ + sizeof(uint32_t), &updated_count, sizeof(updated_count)); | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IncrRefCount method does not check for potential reference count overflow. Since the refcount is uint32_t, incrementing it when it's already at UINT32_MAX would cause it to wrap to 0, leading to premature deletion of the blob. Consider adding a DCHECK or LOG(FATAL) to prevent this.
| InternedBlob::InternedBlob(const std::string_view sv) { | ||
| auto alloc = StatelessAllocator<char>{}; | ||
|
|
||
| constexpr uint32_t ref_count = 1; | ||
| const uint32_t str_len = sv.size(); | ||
|
|
||
| // We need \0 because jsoncons expects c_str() and data() style accessors on keys | ||
| blob_ = alloc.allocate(kHeaderSize + str_len + 1); | ||
|
|
||
| std::memcpy(blob_, &str_len, sizeof(uint32_t)); | ||
| std::memcpy(blob_ + sizeof(uint32_t), &ref_count, sizeof(uint32_t)); | ||
|
|
||
| std::memcpy(blob_ + kHeaderSize, sv.data(), str_len); | ||
|
|
||
| // null terminate so jsoncons can directly access the char* as string | ||
| blob_[kHeaderSize + str_len] = '\0'; | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InternedBlob constructor lacks exception safety. If memcpy throws (unlikely but possible with invalid pointers), or if allocation fails, the blob_ pointer is left in an inconsistent state. While StatelessAllocator likely throws on allocation failure rather than returning nullptr, the code should use RAII or ensure proper cleanup in case of exceptions. Consider using a unique_ptr with a custom deleter or ensuring proper exception handling.
| entry_->DecrRefCount(); | ||
|
|
||
| if (entry_->RefCount() == 0) { | ||
| GetPoolRef().erase(*entry_); | ||
| entry_ = nullptr; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Release() method has a use-after-free bug. After calling GetPoolRef().erase(*entry_), the entry_ pointer is dereferenced, but the blob has already been deleted from the pool. The erase operation will invoke the InternedBlob destructor, which deallocates the blob_ memory. Setting entry_ to nullptr after the erase is meaningless since we're working with a dangling pointer. The entry_ should be saved to a local variable, then set to nullptr before calling erase.
| entry_->DecrRefCount(); | |
| if (entry_->RefCount() == 0) { | |
| GetPoolRef().erase(*entry_); | |
| entry_ = nullptr; | |
| // Save the current entry pointer locally and clear the member pointer | |
| auto* entry = entry_; | |
| entry_ = nullptr; | |
| entry->DecrRefCount(); | |
| if (entry->RefCount() == 0) { | |
| GetPoolRef().erase(*entry); |
| return blob_ ? std::string_view{blob_ + kHeaderSize, Size()} : ""; | ||
| } | ||
|
|
||
| const char* InternedBlob::Data() const { | ||
| DCHECK(blob_) << "Called Data() on empty blob"; | ||
| return blob_ ? blob_ + kHeaderSize : nullptr; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The View() and Data() methods contain redundant null checks. The DCHECK at the start already validates that blob_ is not null, but the ternary operators check blob_ again. This is inconsistent - either remove the DCHECK assertions and keep the null checks, or remove the redundant null checks since the DCHECK ensures blob_ is not null. The latter is more appropriate since these methods should not be called on empty blobs.
| return blob_ ? std::string_view{blob_ + kHeaderSize, Size()} : ""; | |
| } | |
| const char* InternedBlob::Data() const { | |
| DCHECK(blob_) << "Called Data() on empty blob"; | |
| return blob_ ? blob_ + kHeaderSize : nullptr; | |
| return std::string_view{blob_ + kHeaderSize, Size()}; | |
| } | |
| const char* InternedBlob::Data() const { | |
| DCHECK(blob_) << "Called Data() on empty blob"; | |
| return blob_ + kHeaderSize; |
| return compare(other) < 0; | ||
| } | ||
| void InternedString::shrink_to_fit() { // NOLINT (must be non-const to align with jsoncons usage) | ||
| DCHECK(entry_ != nullptr); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shrink_to_fit method has a DCHECK that is never evaluated because it's checking a condition that should be verified. The DCHECK statement "DCHECK(entry_ != nullptr);" appears to be incomplete - it should either verify an actual invariant or be removed if shrink_to_fit is expected to be a no-op for interned strings.
| DCHECK(entry_ != nullptr); |
3f576f3 to
cc627b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
| } | ||
|
|
||
| void InternedString::ResetPool() { | ||
| GetPoolRef().clear(); |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ResetPool function clears the pool but does not deallocate the InternedBlob objects that are stored in it. This will cause a memory leak if the pool contains any entries when cleared. Consider iterating through the pool and deleting each blob before clearing, or ensure that this function is only called when the pool is guaranteed to be empty (i.e., all refcounts are zero).
| GetPoolRef().clear(); | |
| auto& pool = GetPoolRef(); | |
| for (InternedBlob* blob : pool) { | |
| delete blob; | |
| } | |
| pool.clear(); |
| void InternedBlob::IncrRefCount() { | ||
| const uint32_t updated_count = RefCount() + 1; | ||
| std::memcpy(blob_ - sizeof(uint32_t), &updated_count, sizeof(updated_count)); | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The increment and decrement operations on refcount do not check for overflow or underflow. While the comment on line 78 states "Caller must ensure refcount does not go below 0", there are no safeguards to detect when uint32_t refcount overflows past its maximum value during IncrRefCount. Consider adding DCHECK or runtime assertions to detect these conditions in debug builds.
| std::string_view InternedBlob::View() const { | ||
| DCHECK(blob_) << "Called View() on empty blob"; | ||
| return {blob_, Size()}; |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The View method returns a string_view constructed from the blob_ pointer without explicitly providing the size. While this works because blob_ is null-terminated, it relies on the std::string_view constructor that searches for the null terminator, which may be less efficient than providing the size explicitly. Consider using string_view{blob_, Size()} for better performance and clarity.
|
|
||
| using TmpJson = jsoncons::json; | ||
| using JsonType = jsoncons::basic_json<char, jsoncons::sorted_policy, StatelessAllocator<char>>; | ||
|
|
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InternedStringPolicy struct lacks documentation explaining its purpose and how it integrates with jsoncons. Consider adding a comment that explains this is a policy class for jsoncons::basic_json that customizes the member key type to use InternedString for memory optimization of JSON object keys.
| // Policy class for jsoncons::basic_json that customizes the member key type. | |
| // Inherits from jsoncons::sorted_policy and replaces the default object member | |
| // key type with detail::InternedString so that JSON object keys are interned, | |
| // reducing memory usage when many identical keys are present. |
8321c49 to
d7ef37e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
| #include "core/detail/stateless_allocator.h" | ||
|
|
||
| namespace { | ||
| constexpr size_t int_size = sizeof(uint32_t); |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant int_size is misleading as it refers to the size of uint32_t, not int. Consider renaming to uint32_size or field_size for clarity.
| return *new_elem; | ||
| } | ||
|
|
||
| void InternedString::Acquire() { // NOLINT |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NOLINT comment lacks context explaining why the suppression is needed. Add a brief comment explaining the specific warning being suppressed and why it's safe to ignore.
| void InternedString::Acquire() { // NOLINT | |
| void InternedString::Acquire() { // NOLINT: small ref-counting helper; suppression avoids noise from style/usage warnings and is safe as it only increments an existing entry_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
4e8f135 to
6fab414
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
| template <typename It> InternedString::InternedString(It begin, It end) { | ||
| if (begin == end) | ||
| return; | ||
|
|
||
| const auto size = std::distance(begin, end); | ||
| const auto data_ptr = &*begin; | ||
| entry_ = Intern(std::string_view(data_ptr, size)); | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator-based constructor uses std::distance and dereferences the iterator to get a data pointer. However, this assumes the iterator provides contiguous storage, which is not guaranteed for all iterator types. This constructor should either be constrained to contiguous iterators using concepts/SFINAE, or should build the string character-by-character for non-contiguous iterators.
| auto alloc = StatelessAllocator<char>{}; | ||
|
|
||
| constexpr uint32_t ref_count = 1; | ||
| const uint32_t str_len = sv.size(); |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string view size is stored as uint32_t, but std::string_view::size() returns size_t. On 64-bit systems, if the string view has a size larger than UINT32_MAX, this will cause silent truncation and memory corruption. The code should either validate that sv.size() fits in a uint32_t or use a larger type for the size field.
| const uint32_t str_len = sv.size(); | ||
|
|
||
| // We need +1 byte for \0 because jsoncons expects c_str() and data() style accessors on keys | ||
| blob_ = alloc.allocate(kHeaderSize + str_len + 1); |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocation size calculation (kHeaderSize + str_len + 1) could theoretically overflow if str_len is close to UINT32_MAX, though this is unlikely in practice. For robustness, consider checking for overflow before allocation or using a checked arithmetic library.
| inline void CleanupStatelessAllocMR() { | ||
| for (auto it = detail::cleanup_queue.rbegin(); it != detail::cleanup_queue.rend(); ++it) { | ||
| (*it)(); | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup_queue is not being cleared after running all cleanup functions. This means that on subsequent calls to CleanupStatelessAllocMR(), the queue will still contain function pointers that have already been executed. This could lead to use-after-free bugs if the cleanup functions capture state that gets deallocated. The queue should be cleared after executing all cleanup functions.
| } | |
| } | |
| detail::cleanup_queue.clear(); |
| mi_free(shard_); | ||
| shard_ = nullptr; | ||
| CompactObj::InitThreadLocal(nullptr); | ||
| detail::InternedString::ResetPool(); |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InternedString::ResetPool() call is placed before CleanupStatelessAllocMR(), but CleanupStatelessAllocMR() is responsible for calling the pool reset (as registered via EnqueueCleanup in GetPoolRef). This means ResetPool() is called twice: once explicitly here and once via the cleanup queue. The explicit call should be removed since the cleanup mechanism handles it automatically.
| detail::InternedString::ResetPool(); |
|
|
||
| info = (await p.execute())[-1] | ||
| assert info["used_memory"] < 2 * 1_000_000 # Table memory | ||
| assert info["used_memory"] < 4 * 1_000_000 # Table memory |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory threshold has been doubled from 2MB to 4MB. While this change is likely necessary due to the additional overhead from InternedString/InternedBlob wrappers, the comment should be updated to reflect why this increase is needed, especially given that this is a WIP PR implementing interned strings.
| assert info["used_memory"] < 4 * 1_000_000 # Table memory | |
| # After FLUSHALL only a small amount of table metadata should remain. The threshold was raised | |
| # from 2MB to 4MB to account for overhead introduced by InternedString/InternedBlob wrappers | |
| # used by the interned strings implementation. | |
| assert info["used_memory"] < 4 * 1_000_000 |
| // BaseFamilyTest initializes the heap on shards serving transactions, the core running the test | ||
| // needs this initialized explicitly. | ||
| InitTLStatelessAllocMR(GetMemoryResource()); | ||
| detail::InternedString::ResetPool(); |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as in engine_shard.cc: InternedString::ResetPool() is being called explicitly before CleanupStatelessAllocMR(), but the cleanup mechanism already handles this. This results in the pool being reset twice, which is redundant and could mask issues if the cleanup mechanism fails. Remove the explicit ResetPool() call.
| detail::InternedString::ResetPool(); |
2da8c4d to
d36a8c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
| void TearDown() override { | ||
| CleanupStatelessAllocMR(); | ||
| } |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TearDown method should also reset the InternedString pool to prevent test contamination between test cases. Since InternedString uses a thread-local pool that persists across tests, strings created in one test could affect memory usage calculations in subsequent tests. Add detail::InternedString::ResetPool(); before CleanupStatelessAllocMR();.
| EXPECT_THAT(resp, IntArg(start_size)); | ||
| // When deleting, json object member is erased. This causes deletion of the blob and the data it | ||
| // contained. | ||
| const size_t size_after_delete = start_size - sizeof(detail::InternedBlob) - 16; |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 16 is unclear. Add a comment explaining what this represents (e.g., the string data size for key 'a' plus allocation overhead, or jsoncons internal structures).
| const size_t size_after_delete = start_size - sizeof(detail::InternedBlob) - 16; | |
| const size_t size_after_delete = | |
| start_size - sizeof(detail::InternedBlob) - | |
| 16; // 16 bytes account for jsoncons' internal storage/overhead for the erased string value |
| // The interned string has access to a thread local pool of InternedBlob pointers. | ||
| // InternedString handles incrementing and decrementing reference counts of the blobs tied to its | ||
| // own lifecycle. It deletes the blob from the pool when refcount is 0. | ||
| // TODO examine cross shard json object interactions. Can a pool end up access from another shard? |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO raises an important thread-safety concern. Since the pool is thread-local, cross-shard access would be unsafe. Either confirm this is safe by design (JSON objects don't move between shards) or add runtime assertions to detect violations.
fdcd914 to
d36a8c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.
d36a8c0 to
3601e58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
| from .utility import * | ||
|
|
||
|
|
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement 'from .utility import *' appears twice (lines 1 and 7). This is redundant and should be removed. Keep only one import from .utility at the top of the file.
| from .utility import * |
| InitTLStatelessAllocMR(GetMemoryResource()); | ||
| detail::InternedString::ResetPool(); | ||
| } | ||
|
|
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SetUp method calls InitTLStatelessAllocMR and ResetPool, but there is no corresponding TearDown method to clean up these resources. Consider adding a TearDown method that calls CleanupStatelessAllocMR() to match the pattern used in other test files like interned_blob_test.cc and page_usage_stats_test.cc.
| void TearDown() override { | |
| CleanupStatelessAllocMR(); | |
| BaseFamilyTest::TearDown(); | |
| } |
| from redis import asyncio as aioredis | ||
| from .utility import * | ||
| import logging | ||
| import logging |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement for 'logging' is duplicated on lines 2 and 3. Remove the duplicate import.
| import logging |
|
|
||
| void InternedBlob::DecrRefCount() { | ||
| const uint32_t ref_count = RefCount(); | ||
| CHECK_GE(ref_count, 1ul) << "Attempt to decrease zero refcount"; |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The literal '1ul' is used with CHECK_GE for comparing with a uint32_t variable. For consistency with the rest of the codebase and to avoid potential type issues, use '1u' instead of '1ul' since ref_count is uint32_t, not unsigned long.
| CHECK_GE(ref_count, 1ul) << "Attempt to decrease zero refcount"; | |
| CHECK_GE(ref_count, 1u) << "Attempt to decrease zero refcount"; |
| EXPECT_THAT(resp, IntArg(start_size)); | ||
| // When deleting, json object member is erased. This causes deletion of the blob and the data it | ||
| // contained. | ||
| const size_t size_after_delete = start_size - sizeof(detail::InternedBlob) - 16; |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number '16' is used to calculate the expected size after deletion. This appears to be related to the string size ("a":"some text"), but it's not documented. Consider adding a comment explaining what this 16 represents or using a named constant to make the calculation more maintainable and clear.
| blob.DecrRefCount(); | ||
| EXPECT_EQ(blob.RefCount(), 0); | ||
| EXPECT_DEATH(blob.DecrRefCount(), "Attempt to decrease zero refcount"); | ||
| blob.SetRefCount(std::numeric_limits<uint32_t>::max()); | ||
| EXPECT_DEATH(blob.IncrRefCount(), "Attempt to increase max refcount"); | ||
| } | ||
|
|
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test manipulates the reference count to 0 and then continues to use the blob object to test EXPECT_DEATH scenarios. After decrementing the refcount to 0 at line 66, the blob's internal state indicates it should be deleted. Using the blob after this point (lines 67-69) may lead to undefined behavior in production code, even though it works in the test. Consider restructuring the test to avoid this pattern or add a comment explaining why this is safe in the test context.
| blob.DecrRefCount(); | |
| EXPECT_EQ(blob.RefCount(), 0); | |
| EXPECT_DEATH(blob.DecrRefCount(), "Attempt to decrease zero refcount"); | |
| blob.SetRefCount(std::numeric_limits<uint32_t>::max()); | |
| EXPECT_DEATH(blob.IncrRefCount(), "Attempt to increase max refcount"); | |
| } | |
| EXPECT_EQ(blob.RefCount(), 1); | |
| } | |
| TEST_F(InternedBlobTest, RefCountsDeath) { | |
| // These death tests intentionally violate the refcount invariants to verify | |
| // that InternedBlob enforces them. Each scenario uses a fresh blob instance | |
| // scoped to the EXPECT_DEATH body, so no object is reused after reaching a | |
| // terminal refcount state in normal test execution. | |
| EXPECT_DEATH( | |
| [] { | |
| auto blob = InternedBlob{"1234567"}; | |
| blob.DecrRefCount(); // refcount goes from 1 to 0 | |
| blob.DecrRefCount(); // invalid: attempt to decrement zero refcount | |
| }(), | |
| "Attempt to decrease zero refcount"); | |
| EXPECT_DEATH( | |
| [] { | |
| auto blob = InternedBlob{"1234567"}; | |
| blob.SetRefCount(std::numeric_limits<uint32_t>::max()); | |
| blob.IncrRefCount(); // invalid: attempt to increment max refcount | |
| }(), | |
| "Attempt to increase max refcount"); | |
| } |
| from . import dfly_args | ||
| from .instance import DflyInstance, DflyInstanceFactory | ||
| from .instance import DflyInstanceFactory | ||
| from .utility import * |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import pollutes the enclosing namespace, as the imported module dragonfly.utility does not define 'all'.
The stateless allocator is backed by a thread local memory resource. In order to catch bugs during testing, this resource is reset to null pointer at the end of tests where applicable. One purpose is to catch issues with interned strings introduced in the next few changes, which use a thread local string pool which allocates from the stateless allocator. Setting the memory resource to null ensures that buggy behavior where blobs or strings outlive the memory resource is caught early, via the assert within the allocator. Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
20ec6da to
206cb5b
Compare
A blob contains one char * allocated on the mimalloc heap via stateless allocator. The blob contains a string, its size and reference count. The size and ref count are stored before the char data. The chars end with a nul char so as to be able to use the char* directly as a string. This is important for jsoncons which expects such an API. There are custom hash and eq operators which allow for heterogenous lookup ie do not require conversion to blob type for hashing/comparison, and also allow only using the string content for checks, ignoring ref. count and size. An interned pool is introduced which is simply a flat set of blob pointers. The blob itself does not control its reference count. In a future PR a string will be added, which will use refcount. Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
The string class wraps the previously added blob class. It decides at creation time whether to create a new blob, or bump up the refcount of an existing blob in the pool. The pool is thread local, and is used by the strings on that thread to store pointers to allocated blobs. When a string is destroyed it decrements the ref-count, dropping the blob if necessary (ie refcount == 0). A lot of boilerplate in this class is from trial and error, adding methods that jsoncons required to be able to use this class as a key. Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
A policy is defined which hooks interned strings into jsoncons. Tests are adjusted for the now reduced memory size. Engine shard is set up to reset (clear out) the blob pool before setting the memory resource to nullptr. This way no blobs are around which may refer to the memory resource after it is set to nullptr. Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
Some memory usage is increased because of the blob pool. The tests are adjusted accordingly. Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
206cb5b to
0593c8f
Compare
WIP