-
Notifications
You must be signed in to change notification settings - Fork 205
Open
Labels
coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestNew feature or requestperformancePerformance improvements or optimizationsPerformance improvements or optimizations
Description
Overview
Add runtime detection and adaptive code generation to support all vector widths (128, 256, 512 bits) based on available hardware. Currently hardcoded to Vector256 only.
Parent issue: #545
Current State
// ILKernelGenerator.cs - HARDCODED to Vector256
private static int GetVectorCount<T>() => Vector256<T>.Count;
EmitVectorLoad() → typeof(Vector256).GetMethod("Load", ...)
EmitVectorStore() → typeof(Vector256).GetMethod("Store", ...)
EmitVectorOperation() → typeof(Vector256<T>).GetMethod("op_Addition", ...)Problem
| Hardware | Vector Support | Current NumSharp | Issue |
|---|---|---|---|
| Intel Xeon, AMD Zen4 | V512 ✓ | Uses V256 | Missing 2× speedup |
| Most consumer CPUs | V256 ✓ | Uses V256 | OK |
| Older CPUs, ARM | V128 only | Crashes or falls back to scalar | No SIMD benefit |
| No SIMD | None | Falls back to scalar | OK |
Solution: Runtime Detection + Parameterized Emission
Step 1: Detect Hardware Once at Startup
public static class ILKernelGenerator
{
/// <summary>
/// Detected vector width at startup. Checked once, used forever.
/// 512, 256, 128, or 0 (no SIMD).
/// </summary>
public static readonly int VectorBits =
Vector512.IsHardwareAccelerated ? 512 :
Vector256.IsHardwareAccelerated ? 256 :
Vector128.IsHardwareAccelerated ? 128 : 0;
}Step 2: Parameterize Helper Methods
/// <summary>
/// Get element count for current hardware's vector width.
/// </summary>
private static int GetVectorCount(NPTypeCode type)
{
int typeSize = GetTypeSize(type);
return VectorBits / (typeSize * 8); // bits / bits-per-element
}
// VectorBits=512, Int32 → 512/32 = 16 elements
// VectorBits=256, Int32 → 256/32 = 8 elements
// VectorBits=128, Int32 → 128/32 = 4 elements
/// <summary>
/// Get the Vector container type (Vector128, Vector256, or Vector512).
/// </summary>
private static Type GetVectorContainerType() => VectorBits switch
{
512 => typeof(Vector512),
256 => typeof(Vector256),
128 => typeof(Vector128),
_ => throw new NotSupportedException("No SIMD support")
};
/// <summary>
/// Get the Vector<T> generic type for current width.
/// </summary>
private static Type GetVectorType(Type elementType) => VectorBits switch
{
512 => typeof(Vector512<>).MakeGenericType(elementType),
256 => typeof(Vector256<>).MakeGenericType(elementType),
128 => typeof(Vector128<>).MakeGenericType(elementType),
_ => throw new NotSupportedException("No SIMD support")
};
/// <summary>
/// Check if SIMD is available for this type.
/// </summary>
private static bool CanUseSimd(NPTypeCode type)
{
if (VectorBits == 0) return false; // No SIMD hardware
return type switch
{
NPTypeCode.Byte => true,
NPTypeCode.Int16 or NPTypeCode.UInt16 => true,
NPTypeCode.Int32 or NPTypeCode.UInt32 => true,
NPTypeCode.Int64 or NPTypeCode.UInt64 => true,
NPTypeCode.Single or NPTypeCode.Double => true,
_ => false // Boolean, Char, Decimal - no SIMD
};
}Step 3: Update Emit Methods
private static void EmitVectorLoad(ILGenerator il, NPTypeCode type)
{
var containerType = GetVectorContainerType(); // Vector128/256/512
var elementType = GetClrType(type);
var loadMethod = containerType
.GetMethod("Load", BindingFlags.Public | BindingFlags.Static)
.MakeGenericMethod(elementType);
il.EmitCall(OpCodes.Call, loadMethod, null);
}
private static void EmitVectorStore(ILGenerator il, NPTypeCode type)
{
var containerType = GetVectorContainerType();
var elementType = GetClrType(type);
var storeMethod = containerType
.GetMethods(BindingFlags.Public | BindingFlags.Static)
.First(m => m.Name == "Store" &&
m.GetParameters().Length == 2 &&
m.GetParameters()[0].ParameterType.IsGenericType)
.MakeGenericMethod(elementType);
il.EmitCall(OpCodes.Call, storeMethod, null);
}
private static void EmitVectorOperation(ILGenerator il, BinaryOp op, NPTypeCode type)
{
var elementType = GetClrType(type);
var vectorType = GetVectorType(elementType); // Vector128<T>/256<T>/512<T>
string methodName = op switch
{
BinaryOp.Add => "op_Addition",
BinaryOp.Subtract => "op_Subtraction",
BinaryOp.Multiply => "op_Multiply",
BinaryOp.Divide => "op_Division",
_ => throw new NotSupportedException()
};
var opMethod = vectorType.GetMethod(methodName,
BindingFlags.Public | BindingFlags.Static,
null, new[] { vectorType, vectorType }, null);
il.EmitCall(OpCodes.Call, opMethod, null);
}
private static void EmitVectorCreate(ILGenerator il, NPTypeCode type)
{
var containerType = GetVectorContainerType();
var elementType = GetClrType(type);
var createMethod = containerType.GetMethod("Create", new[] { elementType });
il.EmitCall(OpCodes.Call, createMethod, null);
}Step 4: Loop Code Unchanged!
The SIMD loop structure stays exactly the same - only vectorCount changes:
private static void EmitSimdLoop(ILGenerator il, ...)
{
int vectorCount = GetVectorCount(resultType); // 4, 8, or 16
// vectorEnd = totalSize - vectorCount
il.Emit(OpCodes.Ldarg, totalSizeArg);
il.Emit(OpCodes.Ldc_I4, vectorCount);
il.Emit(OpCodes.Sub);
il.Emit(OpCodes.Stloc, locVectorEnd);
// SIMD loop - identical structure for V128/V256/V512
il.MarkLabel(lblSimdLoop);
EmitVectorLoad(il, lhsType); // Emits V128/V256/V512.Load
EmitVectorLoad(il, rhsType);
EmitVectorOperation(il, op, resultType);
EmitVectorStore(il, resultType); // Emits V128/V256/V512.Store
// ... increment by vectorCount, loop
}Task List
- Add
VectorBitsstatic readonly field with hardware detection - Add
GetVectorContainerType()helper - Add
GetVectorType(Type elementType)helper - Update
GetVectorCount()to useVectorBits - Update
CanUseSimd()to checkVectorBits > 0 - Update
EmitVectorLoad()to use parameterized types - Update
EmitVectorStore()to use parameterized types - Update
EmitVectorOperation()to use parameterized types - Update
EmitVectorCreate()to use parameterized types - Update
SimdThresholdsfor width-appropriate thresholds - Add unit tests for V128 path (can force via reflection)
- Benchmark on AVX-512 hardware if available
Files to Modify
| File | Changes |
|---|---|
ILKernelGenerator.cs |
Add detection + parameterize ~10 methods |
SimdThresholds.cs |
Adjust thresholds per vector width |
| Tests | Add V128/V512 path verification |
Expected Results
| Hardware | VectorBits | Elements/Vector (int32) | Speedup vs Scalar |
|---|---|---|---|
| No SIMD | 0 | 1 | 1× (baseline) |
| SSE2/NEON | 128 | 4 | ~4× |
| AVX2 | 256 | 8 | ~8× |
| AVX-512 | 512 | 16 | ~16× |
V512 vs V256 Comparison (10M elements)
| Operation | V256 Time | V512 Time | Improvement |
|---|---|---|---|
| a + b | ~16 ms | ~8 ms | 2× |
| np.sum | ~5 ms | ~2.5 ms | 2× |
| a * b | ~16 ms | ~8 ms | 2× |
Hardware Coverage
| Vector Width | CPUs |
|---|---|
| V512 | Intel Xeon Scalable (Skylake-SP+), AMD EPYC/Ryzen 7000+ (Zen4) |
| V256 | Intel Core (Haswell+, 2013+), AMD (Excavator+, Zen+) |
| V128 | All x64 CPUs (SSE2), Apple Silicon (NEON), older AMD |
Implementation Complexity
| Aspect | Assessment |
|---|---|
| Lines of code | ~80 lines changed |
| Risk | Low - clean parameterization |
| Testing | Medium - need V128/V512 path coverage |
| Backwards compatible | Yes - V256 remains default on most hardware |
Success Criteria
VectorBitscorrectly detects hardware at startup- V512 path used automatically on AVX-512 hardware
- V128 path works on older/ARM hardware
- No performance regression on V256 hardware
- All existing tests pass
- Kernel cache works correctly (same key → same kernel)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
coreInternal engine: Shape, Storage, TensorEngine, iteratorsInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestNew feature or requestperformancePerformance improvements or optimizationsPerformance improvements or optimizations