A high-performance, thread-safe memory allocator featuring segregated free lists and lock-free operations. Built from scratch to understand low-level memory management and demonstrate systems programming expertise.
| Scenario | imalloc | system malloc | Advantage |
|---|---|---|---|
| Small allocations (32B) | 9.06 ns | 9.47 ns | 4% faster |
| Medium allocations (512B) | 7.82 ns | 9.41 ns | 17% faster |
| Bulk allocations (10k blocks) | 111 ΞΌs | 732 ΞΌs | 85% faster |
| Fragmented workloads | 11.3 ΞΌs | 22.8 ΞΌs | 50% faster |
Latest benchmarks on 16-core 3GHz CPU
- High-frequency small/medium allocations (16B-8KB)
- Bulk allocation patterns (85% faster than malloc)
- Fragmentation-heavy workloads (50% faster)
- Very large allocations (>8KB) - imalloc uses direct mmap (slower)
- Lower memory constraints - imalloc has a lot of internal fragmentation, wasting memory
- Game engines - frequent small object allocation
- Web servers - request/response object pools
- Data processing - temporary buffer management
- Lock-free operations using atomic compare-and-swap for thread safety
- Segregated free lists with optimized bucket sizes (16B-12KB)
- mmap-based heap management for efficient memory usage
- Complete malloc API - drop-in replacement with
imalloc,ifree,icalloc,irealloc - Comprehensive test suite with 18 unit tests and performance benchmarks
imalloc uses a multi-step bucket system inspired by jemalloc:
Bucket Sizes:
βββ Tier 1: 16-128 bytes (16B increments) - Small objects
βββ Tier 2: 128-512 bytes (32B increments) - Medium objects
βββ Tier 3: 512-12KB (128B increments) - Large objects
Core Components:
- Lock-free free lists - Atomic operations for thread-safe bucket management
- Arena-based allocation - Efficient memory regions with mmap/munmap
- Boundary tag system - Block headers for fast coalescing
- Size-class optimization - Minimizes internal fragmentation
- C++20 compatible compiler (GCC 10+, Clang 12+)
- CMake 3.14+
- Linux/Unix system (uses mmap/munmap)
Optional Development Tools:
- clang-tidy (for static analysis and code quality checks)
git clone https://github.com/username/imalloc.git
cd imalloc
mkdir build && cd build
cmake ..
make# Enable clang-tidy during configuration
cmake .. -DENABLE_CLANG_TIDY=ON
make
# Run static analysis
make tidy# Run all tests
./test_imalloc
# Run only benchmarks
./test_imalloc --gtest_filter="*Benchmark*"#include "imalloc.h"
// Drop-in replacement for malloc/free
void* ptr = imalloc(1024);
memset(ptr, 0, 1024);
ifree(ptr);
// Zeroed allocation
void* clean_mem = icalloc(100, sizeof(int));
ifree(clean_mem);
// Reallocation
void* small = imalloc(64);
void* large = irealloc(small, 1024); // Efficiently resized
ifree(large);// Custom allocator for containers
template<typename T>
class IMallocAllocator {
public:
T* allocate(size_t n) {
return static_cast<T*>(imalloc(n * sizeof(T)));
}
void deallocate(T* ptr, size_t) {
ifree(ptr);
}
};
// Use with STL containers
std::vector<int, IMallocAllocator<int>> fast_vector;Free List Operations:
// Lock-free push using atomic compare-and-swap
do {
old_head = free_lists[bucket].load(memory_order_acquire);
block->next_free = old_head;
} while (!free_lists[bucket].compare_exchange_weak(
old_head, block, memory_order_release, memory_order_relaxed));Key Design Decisions:
- Memory ordering:
acquirefor loads,releasefor stores ensures proper synchronization - ABA prevention: Using pointers instead of indices eliminates ABA problem in most cases
- Retry loops: CAS failures trigger automatic retry without blocking other threads
Three-Tier System Design:
Tier 1 (16-128B): 16B steps β 8 buckets β Small objects
Tier 2 (128-512B): 32B steps β 12 buckets β Medium objects
Tier 3 (512-12KB): 128B steps β 92 buckets β Large objects
Total: 112 buckets
Size-to-Bucket Mapping:
size_t get_bucket_index(size_t size) {
if (size <= 128) return (size - 16 + 15) / 16; // O(1) division
if (size <= 512) return 8 + (size - 128 + 31) / 32; // Constant offset
return 20 + (size - 512 + 127) / 128; // Minimal buckets
}This design helps to lessen the amount of internal fragmentation at smaller sizes while keeping bucket count and initial memory usage manageable.
Block Structure:
[BlockHeader: 16B][Payload: N bytes][Padding: align to 16B]
ββ size: 8B
ββ next_free: 8B (pointer to next free block or LONE_BLOCK_POINTER)
ββ Payload aligned to max_align_t (16B)
Arena Organization:
[ArenaHeader: 24B][Blockβ][Blockβ]...[BlockN]
ββ size: 8B
ββ prev: 8B
ββ next: 8B
All arenas are page-aligned (4KB) for optimal mmap performance.
Lock-Free Components:
- Free list operations: Pure atomic CAS-based, no blocking
- Bucket initialization: One-time atomic flag with spin-wait
- Memory allocation: Fully concurrent for different size classes
Mutex-Protected Components:
- Arena list management: Complex doubly-linked list operations
- Large allocation tracking: Infrequent but requires consistency
Memory Ordering Guarantees:
// Load with acquire semantics
head = free_lists[bucket].load(memory_order_acquire);
// Store with release semantics
arena_list.store(new_arena, memory_order_release);This ensures all memory operations before a release are visible after an acquire.
Cache Optimization:
- Sequential block layout: New allocations come from the same cache line
- Minimal metadata: 16B header
- Hot path optimization: Common allocations avoid mmap syscalls as they can be placed on the preallocated arenas
Memory Management Strategy:
- Lazy expansion: Only create new arenas when current ones are exhausted
- Bulk initialization: Pre-allocate entire arena's worth of blocks
Size Class Design Rationale:
Small (16-128B): Most common, tight packing crucial
Medium (128-512B): Object-oriented allocations
Large (512KB+): Are commonly array allocations, less frequent
# All tests (unit + benchmarks)
./test_imalloc
# Unit tests only
./test_imalloc --gtest_filter="-*Benchmark*"
# Specific test category
./test_imalloc --gtest_filter="IMallocTest.*"The project includes comprehensive static analysis using clang-tidy:
# Configure with static analysis enabled
cmake .. -DENABLE_CLANG_TIDY=ON
# Run clang-tidy analysis
make tidyEnabled Checks:
readability-*- Code readability and styleperformance-*- Performance optimization opportunitiesmodernize-*- Modern C++ best practicesbugprone-*- Potential bug detectionclang-analyzer-*- Deep static analysis
The codebase maintains high quality standards with zero critical static analysis warnings.
Achieved:
- β Constant-time bucket lookup
- β Thread-safe with efficient use of locks
- β High performance compared system malloc
Technical Challenges Solved:
- Lock-free data structures with ABA prevention
- Memory coalescing algorithm optimization
- Cache-efficient bucket organization
- Cross-platform mmap abstraction