[SIMD] Adaptive Vector Width: Support Vector128/256/512 Based on Hardware

## Overview

Add runtime detection and adaptive code generation to support all vector widths (128, 256, 512 bits) based on available hardware. Currently hardcoded to Vector256 only.

**Parent issue:** #545

## Current State

```csharp
// ILKernelGenerator.cs - HARDCODED to Vector256
private static int GetVectorCount<T>() => Vector256<T>.Count;

EmitVectorLoad()  → typeof(Vector256).GetMethod("Load", ...)
EmitVectorStore() → typeof(Vector256).GetMethod("Store", ...)
EmitVectorOperation() → typeof(Vector256<T>).GetMethod("op_Addition", ...)
```

## Problem

| Hardware | Vector Support | Current NumSharp | Issue |
|----------|----------------|------------------|-------|
| Intel Xeon, AMD Zen4 | V512 ✓ | Uses V256 | **Missing 2× speedup** |
| Most consumer CPUs | V256 ✓ | Uses V256 | OK |
| Older CPUs, ARM | V128 only | **Crashes or falls back to scalar** | No SIMD benefit |
| No SIMD | None | Falls back to scalar | OK |

## Solution: Runtime Detection + Parameterized Emission

### Step 1: Detect Hardware Once at Startup

```csharp
public static class ILKernelGenerator
{
    /// <summary>
    /// Detected vector width at startup. Checked once, used forever.
    /// 512, 256, 128, or 0 (no SIMD).
    /// </summary>
    public static readonly int VectorBits = 
        Vector512.IsHardwareAccelerated ? 512 :
        Vector256.IsHardwareAccelerated ? 256 :
        Vector128.IsHardwareAccelerated ? 128 : 0;
}
```

### Step 2: Parameterize Helper Methods

```csharp
/// <summary>
/// Get element count for current hardware's vector width.
/// </summary>
private static int GetVectorCount(NPTypeCode type)
{
    int typeSize = GetTypeSize(type);
    return VectorBits / (typeSize * 8);  // bits / bits-per-element
}
// VectorBits=512, Int32 → 512/32 = 16 elements
// VectorBits=256, Int32 → 256/32 = 8 elements
// VectorBits=128, Int32 → 128/32 = 4 elements

/// <summary>
/// Get the Vector container type (Vector128, Vector256, or Vector512).
/// </summary>
private static Type GetVectorContainerType() => VectorBits switch
{
    512 => typeof(Vector512),
    256 => typeof(Vector256),
    128 => typeof(Vector128),
    _ => throw new NotSupportedException("No SIMD support")
};

/// <summary>
/// Get the Vector<T> generic type for current width.
/// </summary>
private static Type GetVectorType(Type elementType) => VectorBits switch
{
    512 => typeof(Vector512<>).MakeGenericType(elementType),
    256 => typeof(Vector256<>).MakeGenericType(elementType),
    128 => typeof(Vector128<>).MakeGenericType(elementType),
    _ => throw new NotSupportedException("No SIMD support")
};

/// <summary>
/// Check if SIMD is available for this type.
/// </summary>
private static bool CanUseSimd(NPTypeCode type)
{
    if (VectorBits == 0) return false;  // No SIMD hardware
    
    return type switch
    {
        NPTypeCode.Byte => true,
        NPTypeCode.Int16 or NPTypeCode.UInt16 => true,
        NPTypeCode.Int32 or NPTypeCode.UInt32 => true,
        NPTypeCode.Int64 or NPTypeCode.UInt64 => true,
        NPTypeCode.Single or NPTypeCode.Double => true,
        _ => false  // Boolean, Char, Decimal - no SIMD
    };
}
```

### Step 3: Update Emit Methods

```csharp
private static void EmitVectorLoad(ILGenerator il, NPTypeCode type)
{
    var containerType = GetVectorContainerType();  // Vector128/256/512
    var elementType = GetClrType(type);
    
    var loadMethod = containerType
        .GetMethod("Load", BindingFlags.Public | BindingFlags.Static)
        .MakeGenericMethod(elementType);
    
    il.EmitCall(OpCodes.Call, loadMethod, null);
}

private static void EmitVectorStore(ILGenerator il, NPTypeCode type)
{
    var containerType = GetVectorContainerType();
    var elementType = GetClrType(type);
    
    var storeMethod = containerType
        .GetMethods(BindingFlags.Public | BindingFlags.Static)
        .First(m => m.Name == "Store" && 
                    m.GetParameters().Length == 2 &&
                    m.GetParameters()[0].ParameterType.IsGenericType)
        .MakeGenericMethod(elementType);
    
    il.EmitCall(OpCodes.Call, storeMethod, null);
}

private static void EmitVectorOperation(ILGenerator il, BinaryOp op, NPTypeCode type)
{
    var elementType = GetClrType(type);
    var vectorType = GetVectorType(elementType);  // Vector128<T>/256<T>/512<T>
    
    string methodName = op switch
    {
        BinaryOp.Add => "op_Addition",
        BinaryOp.Subtract => "op_Subtraction",
        BinaryOp.Multiply => "op_Multiply",
        BinaryOp.Divide => "op_Division",
        _ => throw new NotSupportedException()
    };
    
    var opMethod = vectorType.GetMethod(methodName, 
        BindingFlags.Public | BindingFlags.Static,
        null, new[] { vectorType, vectorType }, null);
    
    il.EmitCall(OpCodes.Call, opMethod, null);
}

private static void EmitVectorCreate(ILGenerator il, NPTypeCode type)
{
    var containerType = GetVectorContainerType();
    var elementType = GetClrType(type);
    
    var createMethod = containerType.GetMethod("Create", new[] { elementType });
    il.EmitCall(OpCodes.Call, createMethod, null);
}
```

### Step 4: Loop Code Unchanged!

The SIMD loop structure stays exactly the same - only `vectorCount` changes:

```csharp
private static void EmitSimdLoop(ILGenerator il, ...)
{
    int vectorCount = GetVectorCount(resultType);  // 4, 8, or 16
    
    // vectorEnd = totalSize - vectorCount
    il.Emit(OpCodes.Ldarg, totalSizeArg);
    il.Emit(OpCodes.Ldc_I4, vectorCount);
    il.Emit(OpCodes.Sub);
    il.Emit(OpCodes.Stloc, locVectorEnd);
    
    // SIMD loop - identical structure for V128/V256/V512
    il.MarkLabel(lblSimdLoop);
    EmitVectorLoad(il, lhsType);      // Emits V128/V256/V512.Load
    EmitVectorLoad(il, rhsType);
    EmitVectorOperation(il, op, resultType);
    EmitVectorStore(il, resultType);  // Emits V128/V256/V512.Store
    // ... increment by vectorCount, loop
}
```

## Task List

- [ ] Add `VectorBits` static readonly field with hardware detection
- [ ] Add `GetVectorContainerType()` helper
- [ ] Add `GetVectorType(Type elementType)` helper  
- [ ] Update `GetVectorCount()` to use `VectorBits`
- [ ] Update `CanUseSimd()` to check `VectorBits > 0`
- [ ] Update `EmitVectorLoad()` to use parameterized types
- [ ] Update `EmitVectorStore()` to use parameterized types
- [ ] Update `EmitVectorOperation()` to use parameterized types
- [ ] Update `EmitVectorCreate()` to use parameterized types
- [ ] Update `SimdThresholds` for width-appropriate thresholds
- [ ] Add unit tests for V128 path (can force via reflection)
- [ ] Benchmark on AVX-512 hardware if available

## Files to Modify

| File | Changes |
|------|---------|
| `ILKernelGenerator.cs` | Add detection + parameterize ~10 methods |
| `SimdThresholds.cs` | Adjust thresholds per vector width |
| Tests | Add V128/V512 path verification |

## Expected Results

| Hardware | VectorBits | Elements/Vector (int32) | Speedup vs Scalar |
|----------|------------|-------------------------|-------------------|
| No SIMD | 0 | 1 | 1× (baseline) |
| SSE2/NEON | 128 | 4 | ~4× |
| AVX2 | 256 | 8 | ~8× |
| AVX-512 | 512 | 16 | ~16× |

### V512 vs V256 Comparison (10M elements)

| Operation | V256 Time | V512 Time | Improvement |
|-----------|-----------|-----------|-------------|
| a + b | ~16 ms | ~8 ms | **2×** |
| np.sum | ~5 ms | ~2.5 ms | **2×** |
| a * b | ~16 ms | ~8 ms | **2×** |

## Hardware Coverage

| Vector Width | CPUs |
|--------------|------|
| **V512** | Intel Xeon Scalable (Skylake-SP+), AMD EPYC/Ryzen 7000+ (Zen4) |
| **V256** | Intel Core (Haswell+, 2013+), AMD (Excavator+, Zen+) |
| **V128** | All x64 CPUs (SSE2), Apple Silicon (NEON), older AMD |

## Implementation Complexity

| Aspect | Assessment |
|--------|------------|
| Lines of code | ~80 lines changed |
| Risk | Low - clean parameterization |
| Testing | Medium - need V128/V512 path coverage |
| Backwards compatible | Yes - V256 remains default on most hardware |

## Success Criteria

1. `VectorBits` correctly detects hardware at startup
2. V512 path used automatically on AVX-512 hardware
3. V128 path works on older/ARM hardware
4. No performance regression on V256 hardware
5. All existing tests pass
6. Kernel cache works correctly (same key → same kernel)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD] Adaptive Vector Width: Support Vector128/256/512 Based on Hardware #579

Overview

Current State

Problem

Solution: Runtime Detection + Parameterized Emission

Step 1: Detect Hardware Once at Startup

Step 2: Parameterize Helper Methods

Step 3: Update Emit Methods

Step 4: Loop Code Unchanged!

Task List

Files to Modify

Expected Results

V512 vs V256 Comparison (10M elements)

Hardware Coverage

Implementation Complexity

Success Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hardware	Vector Support	Current NumSharp	Issue
Intel Xeon, AMD Zen4	V512 ✓	Uses V256	Missing 2× speedup
Most consumer CPUs	V256 ✓	Uses V256	OK
Older CPUs, ARM	V128 only	Crashes or falls back to scalar	No SIMD benefit
No SIMD	None	Falls back to scalar	OK

File	Changes
`ILKernelGenerator.cs`	Add detection + parameterize ~10 methods
`SimdThresholds.cs`	Adjust thresholds per vector width
Tests	Add V128/V512 path verification

Hardware	VectorBits	Elements/Vector (int32)	Speedup vs Scalar
No SIMD	0	1	1× (baseline)
SSE2/NEON	128	4	~4×
AVX2	256	8	~8×
AVX-512	512	16	~16×

Vector Width	CPUs
V512	Intel Xeon Scalable (Skylake-SP+), AMD EPYC/Ryzen 7000+ (Zen4)
V256	Intel Core (Haswell+, 2013+), AMD (Excavator+, Zen+)
V128	All x64 CPUs (SSE2), Apple Silicon (NEON), older AMD

Aspect	Assessment
Lines of code	~80 lines changed
Risk	Low - clean parameterization
Testing	Medium - need V128/V512 path coverage
Backwards compatible	Yes - V256 remains default on most hardware

[SIMD] Adaptive Vector Width: Support Vector128/256/512 Based on Hardware #579

Description

Overview

Current State

Problem

Solution: Runtime Detection + Parameterized Emission

Step 1: Detect Hardware Once at Startup

Step 2: Parameterize Helper Methods

Step 3: Update Emit Methods

Step 4: Loop Code Unchanged!

Task List

Files to Modify

Expected Results

V512 vs V256 Comparison (10M elements)

Hardware Coverage

Implementation Complexity

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions