Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
245 changes: 245 additions & 0 deletions .claude/SIMD_INVESTIGATION_RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# SIMD Optimization Investigation Results

## Executive Summary

The investigation revealed NumSharp already has optimal SIMD scalar paths for **same-type operations** (via C# SimdKernels), but **mixed-type operations** fell back to scalar loops in IL kernels. **This has now been fixed.**

### Implementation Complete ✅

SIMD scalar paths have been added to the IL kernel generator for mixed-type operations where the array type equals the result type (no per-element conversion needed).

**Final Benchmark Results:**
```
Array size: 10,000,000 elements

Same-type operations (C# SIMD baseline):
double + double_scalar 15.29 ms [C# SIMD]
float + float_scalar 8.35 ms [C# SIMD]

Mixed-type with IL SIMD (LHS type == Result type):
double + int_scalar 14.96 ms [IL SIMD ✓] <- NOW OPTIMIZED
float + int_scalar 7.18 ms [IL SIMD ✓] <- NOW OPTIMIZED

Mixed-type without SIMD (requires conversion):
int + double_scalar 15.84 ms [Scalar loop]
```

**Tests:** All 2597 tests pass, 0 failures.

---

## Hardware Detection Results

| Feature | Supported |
|---------|-----------|
| SSE | Yes |
| SSE2 | Yes |
| SSE3 | Yes |
| SSSE3 | Yes |
| SSE4.1 | Yes |
| SSE4.2 | Yes |
| AVX | Yes |
| AVX2 | Yes |
| **AVX-512** | **No** |
| Vector256 | Yes (hardware accelerated) |
| Vector512 | No |

**Conclusion**: This machine (and most consumer CPUs) only supports up to AVX2/Vector256. AVX-512 hardware detection should be added but has lower priority since adoption is limited.

---

## Scalar SIMD Benchmark Results

```
Benchmark: array[10,000,000] + scalar

1. Scalar Loop : 25.42 ms
2. SIMD Hoisted : 16.28 ms (1.56x faster)
3. SIMD In-Loop : 22.42 ms (JIT doesn't fully hoist)
```

**Key Findings:**
- SIMD with hoisted `Vector256.Create(scalar)` is **1.56x faster** than scalar loop
- JIT does NOT fully hoist `Vector256.Create` - explicit hoisting gains another **1.38x**
- Explicit hoisting before the loop is critical for performance

---

## NumSharp Current State Analysis

### Execution Path Dispatch

```
Operation Type | Path Classification | Kernel Used | SIMD Scalar?
------------------|---------------------|----------------------|-------------
double + double | SimdScalarRight | C# SimdKernels | YES (optimal)
int + double | SimdScalarRight | IL MixedTypeKernel | NO (scalar loop)
int + int | SimdScalarRight | C# SimdKernels | YES (for int/double/float/long)
byte + float | SimdScalarRight | IL MixedTypeKernel | NO (scalar loop)
```

### Performance Comparison

```
Benchmark: array[10,000,000] + scalar

Same-type (double+double): 14.26 ms (C# SIMD kernel)
Mixed-type (int+double): 18.07 ms (IL scalar kernel)

Performance gap: ~27%
```

### Code Analysis

**C# SimdKernels.cs (lines 217-231)** - Optimal implementation:
```csharp
private static unsafe void SimdScalarRight_Add_Double(double* lhs, double scalar, double* result, int totalSize)
{
var scalarVec = Vector256.Create(scalar); // Hoisted!
int i = 0;
int vectorEnd = totalSize - Vector256<double>.Count;

for (; i <= vectorEnd; i += Vector256<double>.Count)
{
var vl = Vector256.Load(lhs + i);
Vector256.Store(vl + scalarVec, result + i); // SIMD!
}

for (; i < totalSize; i++)
result[i] = lhs[i] + scalar; // Remainder
}
```

**ILKernelGenerator.cs (lines 912-970)** - Suboptimal implementation:
```csharp
private static void EmitScalarRightLoop(ILGenerator il, MixedTypeKernelKey key, ...)
{
// Line 916-925: Hoist scalar value to local (good!)
var locRhsVal = il.DeclareLocal(GetClrType(key.ResultType));
il.Emit(OpCodes.Ldarg_1); // rhs
EmitLoadIndirect(il, key.RhsType);
EmitConvertTo(il, key.RhsType, key.ResultType);
il.Emit(OpCodes.Stloc, locRhsVal);

// Lines 938-960: Scalar operations only, NO SIMD!
for (int i = 0; i < totalSize; i++)
{
result[i] = lhs[i] + rhsVal; // Scalar add
}
}
```

---

## Recommendations

### Priority 1: Add SIMD to IL Scalar Paths (HIGH IMPACT)

**Why**: 27% speedup for mixed-type scalar operations.

**Implementation**:
1. Modify `EmitScalarRightLoop()` to emit SIMD code for supported types
2. Hoist `Vector256.Create(scalar)` before the loop
3. Add Vector256 load/add/store in the main loop
4. Keep scalar remainder loop for sizes not divisible by vector count

**Target types**: float, double (already have Vector256 support)

**Files to modify**:
- `ILKernelGenerator.cs`: Add `EmitSimdScalarRightLoop()` method
- Update `GenerateSimdScalarRightKernel()` to choose SIMD vs scalar based on type

### Priority 2: Hardware Detection (LOW PRIORITY)

**Why**: AVX-512 adoption is limited. Most CPUs (including this dev machine) only support AVX2.

**Implementation** (when AVX-512 becomes common):
1. Add static readonly flags in `SimdThresholds.cs`:
```csharp
public static readonly bool HasAvx512 = Vector512.IsHardwareAccelerated;
public static readonly int PreferredVectorWidth = HasAvx512 ? 512 : 256;
```
2. Add Vector512 code paths alongside Vector256
3. Use runtime dispatch based on `HasAvx512`

**Expected benefit**: 2x throughput on AVX-512 hardware (16 floats vs 8 floats per instruction)

---

## Implementation Checklist

### Phase 1: SIMD Scalar for IL Kernels ✅ COMPLETE

- [x] Add `EmitSimdScalarRightLoop()` for float/double
- [x] Add `EmitSimdScalarLeftLoop()` for float/double
- [x] Add `EmitVectorCreate()` helper for Vector256.Create(scalar)
- [x] Update `GenerateSimdScalarRightKernel()` to choose SIMD path
- [x] Update `GenerateSimdScalarLeftKernel()` to choose SIMD path
- [x] Verify correctness with small arrays
- [x] Run full test suite (2597 passed, 0 failed)
- [x] Benchmark before/after

### Phase 2: Hardware Detection (Defer)

- [ ] Add `SimdCapabilities` static class
- [ ] Cache detection results at startup
- [ ] Add Vector512 code paths (when adopting)
- [ ] Runtime dispatch mechanism

---

## Files Modified

- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs`:
- Added `EmitSimdScalarRightLoop()` method (lines 1063-1178)
- Added `EmitSimdScalarLeftLoop()` method (lines 1180-1295)
- Added `EmitVectorCreate()` helper (lines 1900-1914)
- Updated `GenerateSimdScalarRightKernel()` to check SIMD eligibility
- Updated `GenerateSimdScalarLeftKernel()` to check SIMD eligibility

---

## Appendix: Raw Benchmark Data

### Test 1: Hardware Detection
```
X86 Intrinsics:
Sse: True
Sse2: True
Avx: True
Avx2: True
Avx512F: False

Generic Vector Types:
Vector256<float>: True
Vector512<float>: False
```

### Test 2: Scalar vs SIMD
```
array[10,000,000] + scalar

1. Scalar Loop : 25.42 ms
2. SIMD Hoisted : 16.28 ms
3. SIMD In-Loop : 22.42 ms
```

### Test 3: NumSharp Same-type vs Mixed-type
```
Same-type (double+double): 14.26 ms
Mixed-type (int+double): 18.07 ms
```

---

## Conclusion

The investigation confirmed:
1. **Scalar SIMD** with hoisted broadcast provides **1.56x speedup** over scalar loops
2. NumSharp's C# SimdKernels already implement this optimally for same-type operations
3. ~~**IL MixedTypeKernels lack SIMD for scalar paths**~~ **FIXED** ✅
4. AVX-512 hardware detection is low priority due to limited adoption

**Status**: SIMD scalar paths have been implemented for IL kernels. Mixed-type operations like `double_array + int_scalar` now use SIMD when the array type equals the result type.

**Remaining work**: Hardware detection for AVX-512 (deferred until adoption increases).
36 changes: 36 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Auto detect text files and normalize to LF in repo
* text=auto eol=lf

# Explicit text files
*.cs text eol=lf
*.csproj text eol=lf
*.sln text eol=lf
*.md text eol=lf
*.json text eol=lf
*.xml text eol=lf
*.yml text eol=lf
*.yaml text eol=lf
*.txt text eol=lf
*.sh text eol=lf
*.ps1 text eol=lf
*.py text eol=lf
*.config text eol=lf
*.props text eol=lf
*.targets text eol=lf
*.editorconfig text eol=lf
*.gitignore text eol=lf
*.gitattributes text eol=lf

# Binary files
*.png binary
*.jpg binary
*.jpeg binary
*.gif binary
*.ico binary
*.snk binary
*.npy binary
*.npz binary
*.dll binary
*.exe binary
*.pdb binary
*.zip binary
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using NumSharp.Benchmark.GraphEngine.Infrastructure;

namespace NumSharp.Benchmark.GraphEngine.Benchmarks.Allocation;

/// <summary>
/// Micro-benchmarks comparing allocation primitives:
/// - Marshal.AllocHGlobal (current)
/// - NativeMemory.Alloc (proposed)
/// - NativeMemory.AlignedAlloc (for SIMD)
///
/// These benchmarks inform issue #528: NativeMemory modernization.
/// </summary>
[BenchmarkCategory("Allocation", "Micro")]
public class AllocationMicroBenchmarks : BenchmarkBase
{
/// <summary>
/// Byte counts to allocate (matching typical NumSharp array sizes).
/// </summary>
[Params(64, 1_000, 100_000, 10_000_000)]
public int Bytes { get; set; }

// ========================================================================
// Allocation Only (no free) - measures allocation overhead
// ========================================================================

[Benchmark(Baseline = true, Description = "Marshal.AllocHGlobal")]
[BenchmarkCategory("AllocOnly")]
public nint MarshalAllocHGlobal() => Marshal.AllocHGlobal(Bytes);

[Benchmark(Description = "NativeMemory.Alloc")]
[BenchmarkCategory("AllocOnly")]
public unsafe void* NativeMemoryAlloc() => NativeMemory.Alloc((nuint)Bytes);

[Benchmark(Description = "NativeMemory.AlignedAlloc(32)")]
[BenchmarkCategory("AllocOnly")]
public unsafe void* NativeMemoryAlignedAlloc32() => NativeMemory.AlignedAlloc((nuint)Bytes, 32);

[Benchmark(Description = "NativeMemory.AlignedAlloc(64)")]
[BenchmarkCategory("AllocOnly")]
public unsafe void* NativeMemoryAlignedAlloc64() => NativeMemory.AlignedAlloc((nuint)Bytes, 64);

[Benchmark(Description = "NativeMemory.AllocZeroed")]
[BenchmarkCategory("AllocOnly")]
public unsafe void* NativeMemoryAllocZeroed() => NativeMemory.AllocZeroed((nuint)Bytes);

// ========================================================================
// Round-Trip (alloc + free) - measures full lifecycle
// ========================================================================

[Benchmark(Description = "Marshal alloc+free")]
[BenchmarkCategory("RoundTrip")]
public void MarshalRoundTrip()
{
var ptr = Marshal.AllocHGlobal(Bytes);
Marshal.FreeHGlobal(ptr);
}

[Benchmark(Description = "NativeMemory alloc+free")]
[BenchmarkCategory("RoundTrip")]
public unsafe void NativeMemoryRoundTrip()
{
var ptr = NativeMemory.Alloc((nuint)Bytes);
NativeMemory.Free(ptr);
}

[Benchmark(Description = "NativeMemory aligned alloc+free")]
[BenchmarkCategory("RoundTrip")]
public unsafe void NativeMemoryAlignedRoundTrip()
{
var ptr = NativeMemory.AlignedAlloc((nuint)Bytes, 32);
NativeMemory.AlignedFree(ptr);
}

[Benchmark(Description = "NativeMemory zeroed alloc+free")]
[BenchmarkCategory("RoundTrip")]
public unsafe void NativeMemoryZeroedRoundTrip()
{
var ptr = NativeMemory.AllocZeroed((nuint)Bytes);
NativeMemory.Free(ptr);
}
}
Loading
Loading