Skip to content

My personal implementation of malloc in an attempt to learn more about the workings of computer architecture and memory.

License

Notifications You must be signed in to change notification settings

IkechiNk/imalloc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

imalloc - High-Performance Lock-Free Memory Allocator

Build Status Tests License: Unlicense

A high-performance, thread-safe memory allocator featuring segregated free lists and lock-free operations. Built from scratch to understand low-level memory management and demonstrate systems programming expertise.

πŸš€ Performance Highlights

Scenario imalloc system malloc Advantage
Small allocations (32B) 9.06 ns 9.47 ns 4% faster
Medium allocations (512B) 7.82 ns 9.41 ns 17% faster
Bulk allocations (10k blocks) 111 ΞΌs 732 ΞΌs 85% faster
Fragmented workloads 11.3 ΞΌs 22.8 ΞΌs 50% faster

Latest benchmarks on 16-core 3GHz CPU

🎯 When to Use imalloc vs malloc

βœ… imalloc Excels At:

  • High-frequency small/medium allocations (16B-8KB)
  • Bulk allocation patterns (85% faster than malloc)
  • Fragmentation-heavy workloads (50% faster)

⚠️ malloc Still Better For:

  • Very large allocations (>8KB) - imalloc uses direct mmap (slower)
  • Lower memory constraints - imalloc has a lot of internal fragmentation, wasting memory

πŸ”§ Sweet Spot Applications:

  • Game engines - frequent small object allocation
  • Web servers - request/response object pools
  • Data processing - temporary buffer management

✨ Key Features

  • Lock-free operations using atomic compare-and-swap for thread safety
  • Segregated free lists with optimized bucket sizes (16B-12KB)
  • mmap-based heap management for efficient memory usage
  • Complete malloc API - drop-in replacement with imalloc, ifree, icalloc, irealloc
  • Comprehensive test suite with 18 unit tests and performance benchmarks

πŸ—οΈ Architecture

imalloc uses a multi-step bucket system inspired by jemalloc:

Bucket Sizes:
β”œβ”€β”€ Tier 1: 16-128 bytes (16B increments) - Small objects
β”œβ”€β”€ Tier 2: 128-512 bytes (32B increments) - Medium objects  
└── Tier 3: 512-12KB (128B increments) - Large objects

Core Components:

  • Lock-free free lists - Atomic operations for thread-safe bucket management
  • Arena-based allocation - Efficient memory regions with mmap/munmap
  • Boundary tag system - Block headers for fast coalescing
  • Size-class optimization - Minimizes internal fragmentation

πŸ“¦ Quick Start

Prerequisites

  • C++20 compatible compiler (GCC 10+, Clang 12+)
  • CMake 3.14+
  • Linux/Unix system (uses mmap/munmap)

Optional Development Tools:

  • clang-tidy (for static analysis and code quality checks)

Build

git clone https://github.com/username/imalloc.git
cd imalloc
mkdir build && cd build
cmake ..
make

Build with Static Analysis

# Enable clang-tidy during configuration
cmake .. -DENABLE_CLANG_TIDY=ON
make

# Run static analysis
make tidy

Test

# Run all tests
./test_imalloc

# Run only benchmarks
./test_imalloc --gtest_filter="*Benchmark*"

πŸ’» Usage Examples

Basic Usage

#include "imalloc.h"

// Drop-in replacement for malloc/free
void* ptr = imalloc(1024);
memset(ptr, 0, 1024);
ifree(ptr);

// Zeroed allocation
void* clean_mem = icalloc(100, sizeof(int));
ifree(clean_mem);

// Reallocation
void* small = imalloc(64);
void* large = irealloc(small, 1024); // Efficiently resized
ifree(large);

Integration Example

// Custom allocator for containers
template<typename T>
class IMallocAllocator {
public:
    T* allocate(size_t n) {
        return static_cast<T*>(imalloc(n * sizeof(T)));
    }
    
    void deallocate(T* ptr, size_t) {
        ifree(ptr);
    }
};

// Use with STL containers
std::vector<int, IMallocAllocator<int>> fast_vector;

πŸ§ͺ Technical Implementation Details

Lock-Free Algorithm Design

Free List Operations:

// Lock-free push using atomic compare-and-swap
do {
    old_head = free_lists[bucket].load(memory_order_acquire);
    block->next_free = old_head;
} while (!free_lists[bucket].compare_exchange_weak(
    old_head, block, memory_order_release, memory_order_relaxed));

Key Design Decisions:

  • Memory ordering: acquire for loads, release for stores ensures proper synchronization
  • ABA prevention: Using pointers instead of indices eliminates ABA problem in most cases
  • Retry loops: CAS failures trigger automatic retry without blocking other threads

Bucket Sizing Mathematics

Three-Tier System Design:

Tier 1 (16-128B):   16B steps  β†’ 8 buckets  β†’ Small objects
Tier 2 (128-512B):  32B steps  β†’ 12 buckets β†’ Medium objects  
Tier 3 (512-12KB):  128B steps β†’ 92 buckets β†’ Large objects
Total: 112 buckets

Size-to-Bucket Mapping:

size_t get_bucket_index(size_t size) {
    if (size <= 128)  return (size - 16 + 15) / 16;        // O(1) division
    if (size <= 512)  return 8 + (size - 128 + 31) / 32;  // Constant offset  
    return 20 + (size - 512 + 127) / 128;                 // Minimal buckets
}

This design helps to lessen the amount of internal fragmentation at smaller sizes while keeping bucket count and initial memory usage manageable.

Memory Layout & Alignment

Block Structure:

[BlockHeader: 16B][Payload: N bytes][Padding: align to 16B]
β”œβ”€ size: 8B           
β”œβ”€ next_free: 8B (pointer to next free block or LONE_BLOCK_POINTER)
└─ Payload aligned to max_align_t (16B)

Arena Organization:

[ArenaHeader: 24B][Block₁][Blockβ‚‚]...[BlockN]
β”œβ”€ size: 8B     
β”œβ”€ prev: 8B     
└─ next: 8B     

All arenas are page-aligned (4KB) for optimal mmap performance.

Thread Safety Implementation

Lock-Free Components:

  • Free list operations: Pure atomic CAS-based, no blocking
  • Bucket initialization: One-time atomic flag with spin-wait
  • Memory allocation: Fully concurrent for different size classes

Mutex-Protected Components:

  • Arena list management: Complex doubly-linked list operations
  • Large allocation tracking: Infrequent but requires consistency

Memory Ordering Guarantees:

// Load with acquire semantics
head = free_lists[bucket].load(memory_order_acquire);

// Store with release semantics  
arena_list.store(new_arena, memory_order_release);

This ensures all memory operations before a release are visible after an acquire.

Performance Engineering

Cache Optimization:

  • Sequential block layout: New allocations come from the same cache line
  • Minimal metadata: 16B header
  • Hot path optimization: Common allocations avoid mmap syscalls as they can be placed on the preallocated arenas

Memory Management Strategy:

  • Lazy expansion: Only create new arenas when current ones are exhausted
  • Bulk initialization: Pre-allocate entire arena's worth of blocks

Size Class Design Rationale:

Small (16-128B):  Most common, tight packing crucial
Medium (128-512B): Object-oriented allocations  
Large (512KB+):   Are commonly array allocations, less frequent

πŸ› οΈ Development

Running Tests

# All tests (unit + benchmarks)
./test_imalloc

# Unit tests only  
./test_imalloc --gtest_filter="-*Benchmark*"

# Specific test category
./test_imalloc --gtest_filter="IMallocTest.*"

Static Analysis

The project includes comprehensive static analysis using clang-tidy:

# Configure with static analysis enabled
cmake .. -DENABLE_CLANG_TIDY=ON

# Run clang-tidy analysis
make tidy

Enabled Checks:

  • readability-* - Code readability and style
  • performance-* - Performance optimization opportunities
  • modernize-* - Modern C++ best practices
  • bugprone-* - Potential bug detection
  • clang-analyzer-* - Deep static analysis

The codebase maintains high quality standards with zero critical static analysis warnings.

🎯 Design Goals

Achieved:

  • βœ… Constant-time bucket lookup
  • βœ… Thread-safe with efficient use of locks
  • βœ… High performance compared system malloc

Technical Challenges Solved:

  • Lock-free data structures with ABA prevention
  • Memory coalescing algorithm optimization
  • Cache-efficient bucket organization
  • Cross-platform mmap abstraction

About

My personal implementation of malloc in an attempt to learn more about the workings of computer architecture and memory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published