Skip to content

[SIMD] Adaptive Vector Width: Support Vector128/256/512 Based on Hardware #579

@Nucs

Description

@Nucs

Overview

Add runtime detection and adaptive code generation to support all vector widths (128, 256, 512 bits) based on available hardware. Currently hardcoded to Vector256 only.

Parent issue: #545

Current State

// ILKernelGenerator.cs - HARDCODED to Vector256
private static int GetVectorCount<T>() => Vector256<T>.Count;

EmitVectorLoad()typeof(Vector256).GetMethod("Load", ...)
EmitVectorStore()typeof(Vector256).GetMethod("Store", ...)
EmitVectorOperation()typeof(Vector256<T>).GetMethod("op_Addition", ...)

Problem

Hardware Vector Support Current NumSharp Issue
Intel Xeon, AMD Zen4 V512 ✓ Uses V256 Missing 2× speedup
Most consumer CPUs V256 ✓ Uses V256 OK
Older CPUs, ARM V128 only Crashes or falls back to scalar No SIMD benefit
No SIMD None Falls back to scalar OK

Solution: Runtime Detection + Parameterized Emission

Step 1: Detect Hardware Once at Startup

public static class ILKernelGenerator
{
    /// <summary>
    /// Detected vector width at startup. Checked once, used forever.
    /// 512, 256, 128, or 0 (no SIMD).
    /// </summary>
    public static readonly int VectorBits = 
        Vector512.IsHardwareAccelerated ? 512 :
        Vector256.IsHardwareAccelerated ? 256 :
        Vector128.IsHardwareAccelerated ? 128 : 0;
}

Step 2: Parameterize Helper Methods

/// <summary>
/// Get element count for current hardware's vector width.
/// </summary>
private static int GetVectorCount(NPTypeCode type)
{
    int typeSize = GetTypeSize(type);
    return VectorBits / (typeSize * 8);  // bits / bits-per-element
}
// VectorBits=512, Int32 → 512/32 = 16 elements
// VectorBits=256, Int32 → 256/32 = 8 elements
// VectorBits=128, Int32 → 128/32 = 4 elements

/// <summary>
/// Get the Vector container type (Vector128, Vector256, or Vector512).
/// </summary>
private static Type GetVectorContainerType() => VectorBits switch
{
    512 => typeof(Vector512),
    256 => typeof(Vector256),
    128 => typeof(Vector128),
    _ => throw new NotSupportedException("No SIMD support")
};

/// <summary>
/// Get the Vector<T> generic type for current width.
/// </summary>
private static Type GetVectorType(Type elementType) => VectorBits switch
{
    512 => typeof(Vector512<>).MakeGenericType(elementType),
    256 => typeof(Vector256<>).MakeGenericType(elementType),
    128 => typeof(Vector128<>).MakeGenericType(elementType),
    _ => throw new NotSupportedException("No SIMD support")
};

/// <summary>
/// Check if SIMD is available for this type.
/// </summary>
private static bool CanUseSimd(NPTypeCode type)
{
    if (VectorBits == 0) return false;  // No SIMD hardware
    
    return type switch
    {
        NPTypeCode.Byte => true,
        NPTypeCode.Int16 or NPTypeCode.UInt16 => true,
        NPTypeCode.Int32 or NPTypeCode.UInt32 => true,
        NPTypeCode.Int64 or NPTypeCode.UInt64 => true,
        NPTypeCode.Single or NPTypeCode.Double => true,
        _ => false  // Boolean, Char, Decimal - no SIMD
    };
}

Step 3: Update Emit Methods

private static void EmitVectorLoad(ILGenerator il, NPTypeCode type)
{
    var containerType = GetVectorContainerType();  // Vector128/256/512
    var elementType = GetClrType(type);
    
    var loadMethod = containerType
        .GetMethod("Load", BindingFlags.Public | BindingFlags.Static)
        .MakeGenericMethod(elementType);
    
    il.EmitCall(OpCodes.Call, loadMethod, null);
}

private static void EmitVectorStore(ILGenerator il, NPTypeCode type)
{
    var containerType = GetVectorContainerType();
    var elementType = GetClrType(type);
    
    var storeMethod = containerType
        .GetMethods(BindingFlags.Public | BindingFlags.Static)
        .First(m => m.Name == "Store" && 
                    m.GetParameters().Length == 2 &&
                    m.GetParameters()[0].ParameterType.IsGenericType)
        .MakeGenericMethod(elementType);
    
    il.EmitCall(OpCodes.Call, storeMethod, null);
}

private static void EmitVectorOperation(ILGenerator il, BinaryOp op, NPTypeCode type)
{
    var elementType = GetClrType(type);
    var vectorType = GetVectorType(elementType);  // Vector128<T>/256<T>/512<T>
    
    string methodName = op switch
    {
        BinaryOp.Add => "op_Addition",
        BinaryOp.Subtract => "op_Subtraction",
        BinaryOp.Multiply => "op_Multiply",
        BinaryOp.Divide => "op_Division",
        _ => throw new NotSupportedException()
    };
    
    var opMethod = vectorType.GetMethod(methodName, 
        BindingFlags.Public | BindingFlags.Static,
        null, new[] { vectorType, vectorType }, null);
    
    il.EmitCall(OpCodes.Call, opMethod, null);
}

private static void EmitVectorCreate(ILGenerator il, NPTypeCode type)
{
    var containerType = GetVectorContainerType();
    var elementType = GetClrType(type);
    
    var createMethod = containerType.GetMethod("Create", new[] { elementType });
    il.EmitCall(OpCodes.Call, createMethod, null);
}

Step 4: Loop Code Unchanged!

The SIMD loop structure stays exactly the same - only vectorCount changes:

private static void EmitSimdLoop(ILGenerator il, ...)
{
    int vectorCount = GetVectorCount(resultType);  // 4, 8, or 16
    
    // vectorEnd = totalSize - vectorCount
    il.Emit(OpCodes.Ldarg, totalSizeArg);
    il.Emit(OpCodes.Ldc_I4, vectorCount);
    il.Emit(OpCodes.Sub);
    il.Emit(OpCodes.Stloc, locVectorEnd);
    
    // SIMD loop - identical structure for V128/V256/V512
    il.MarkLabel(lblSimdLoop);
    EmitVectorLoad(il, lhsType);      // Emits V128/V256/V512.Load
    EmitVectorLoad(il, rhsType);
    EmitVectorOperation(il, op, resultType);
    EmitVectorStore(il, resultType);  // Emits V128/V256/V512.Store
    // ... increment by vectorCount, loop
}

Task List

  • Add VectorBits static readonly field with hardware detection
  • Add GetVectorContainerType() helper
  • Add GetVectorType(Type elementType) helper
  • Update GetVectorCount() to use VectorBits
  • Update CanUseSimd() to check VectorBits > 0
  • Update EmitVectorLoad() to use parameterized types
  • Update EmitVectorStore() to use parameterized types
  • Update EmitVectorOperation() to use parameterized types
  • Update EmitVectorCreate() to use parameterized types
  • Update SimdThresholds for width-appropriate thresholds
  • Add unit tests for V128 path (can force via reflection)
  • Benchmark on AVX-512 hardware if available

Files to Modify

File Changes
ILKernelGenerator.cs Add detection + parameterize ~10 methods
SimdThresholds.cs Adjust thresholds per vector width
Tests Add V128/V512 path verification

Expected Results

Hardware VectorBits Elements/Vector (int32) Speedup vs Scalar
No SIMD 0 1 1× (baseline)
SSE2/NEON 128 4 ~4×
AVX2 256 8 ~8×
AVX-512 512 16 ~16×

V512 vs V256 Comparison (10M elements)

Operation V256 Time V512 Time Improvement
a + b ~16 ms ~8 ms
np.sum ~5 ms ~2.5 ms
a * b ~16 ms ~8 ms

Hardware Coverage

Vector Width CPUs
V512 Intel Xeon Scalable (Skylake-SP+), AMD EPYC/Ryzen 7000+ (Zen4)
V256 Intel Core (Haswell+, 2013+), AMD (Excavator+, Zen+)
V128 All x64 CPUs (SSE2), Apple Silicon (NEON), older AMD

Implementation Complexity

Aspect Assessment
Lines of code ~80 lines changed
Risk Low - clean parameterization
Testing Medium - need V128/V512 path coverage
Backwards compatible Yes - V256 remains default on most hardware

Success Criteria

  1. VectorBits correctly detects hardware at startup
  2. V512 path used automatically on AVX-512 hardware
  3. V128 path works on older/ARM hardware
  4. No performance regression on V256 hardware
  5. All existing tests pass
  6. Kernel cache works correctly (same key → same kernel)

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestperformancePerformance improvements or optimizations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions