SciSharp · Nucs · Feb 14, 2026 · Feb 14, 2026 · Feb 14, 2026 · Feb 14, 2026
diff --git a/.claude/SIMD_INVESTIGATION_RESULTS.md b/.claude/SIMD_INVESTIGATION_RESULTS.md
@@ -0,0 +1,245 @@
+# SIMD Optimization Investigation Results
+
+## Executive Summary
+
+The investigation revealed NumSharp already has optimal SIMD scalar paths for **same-type operations** (via C# SimdKernels), but **mixed-type operations** fell back to scalar loops in IL kernels. **This has now been fixed.**
+
+### Implementation Complete ✅
+
+SIMD scalar paths have been added to the IL kernel generator for mixed-type operations where the array type equals the result type (no per-element conversion needed).
+
+**Final Benchmark Results:**
+```
+Array size: 10,000,000 elements
+
+Same-type operations (C# SIMD baseline):
+  double + double_scalar                      15.29 ms  [C# SIMD]
+  float + float_scalar                         8.35 ms  [C# SIMD]
+
+Mixed-type with IL SIMD (LHS type == Result type):
+  double + int_scalar                         14.96 ms  [IL SIMD ✓]  <- NOW OPTIMIZED
+  float + int_scalar                           7.18 ms  [IL SIMD ✓]  <- NOW OPTIMIZED
+
+Mixed-type without SIMD (requires conversion):
+  int + double_scalar                         15.84 ms  [Scalar loop]
+```
+
+**Tests:** All 2597 tests pass, 0 failures.
+
+---
+
+## Hardware Detection Results
+
+| Feature | Supported |
+|---------|-----------|
+| SSE | Yes |
+| SSE2 | Yes |
+| SSE3 | Yes |
+| SSSE3 | Yes |
+| SSE4.1 | Yes |
+| SSE4.2 | Yes |
+| AVX | Yes |
+| AVX2 | Yes |
+| **AVX-512** | **No** |
+| Vector256 | Yes (hardware accelerated) |
+| Vector512 | No |
+
+**Conclusion**: This machine (and most consumer CPUs) only supports up to AVX2/Vector256. AVX-512 hardware detection should be added but has lower priority since adoption is limited.
+
+---
+
+## Scalar SIMD Benchmark Results
+
+```
+Benchmark: array[10,000,000] + scalar
+
+1. Scalar Loop           :    25.42 ms
+2. SIMD Hoisted          :    16.28 ms  (1.56x faster)
+3. SIMD In-Loop          :    22.42 ms  (JIT doesn't fully hoist)
+```
+
+**Key Findings:**
+- SIMD with hoisted `Vector256.Create(scalar)` is **1.56x faster** than scalar loop
+- JIT does NOT fully hoist `Vector256.Create` - explicit hoisting gains another **1.38x**
+- Explicit hoisting before the loop is critical for performance
+
+---
+
+## NumSharp Current State Analysis
+
+### Execution Path Dispatch
+
+```
+Operation Type    | Path Classification | Kernel Used          | SIMD Scalar?
+------------------|---------------------|----------------------|-------------
+double + double   | SimdScalarRight     | C# SimdKernels       | YES (optimal)
+int + double      | SimdScalarRight     | IL MixedTypeKernel   | NO (scalar loop)
+int + int         | SimdScalarRight     | C# SimdKernels       | YES (for int/double/float/long)
+byte + float      | SimdScalarRight     | IL MixedTypeKernel   | NO (scalar loop)
+```
+
+### Performance Comparison
+
+```
+Benchmark: array[10,000,000] + scalar
+
+Same-type (double+double): 14.26 ms  (C# SIMD kernel)
+Mixed-type (int+double):   18.07 ms  (IL scalar kernel)
+
+Performance gap: ~27%
+```
+
+### Code Analysis
+
+**C# SimdKernels.cs (lines 217-231)** - Optimal implementation:
+```csharp
+private static unsafe void SimdScalarRight_Add_Double(double* lhs, double scalar, double* result, int totalSize)
+{
+    var scalarVec = Vector256.Create(scalar);  // Hoisted!
+    int i = 0;
+    int vectorEnd = totalSize - Vector256<double>.Count;
+
+    for (; i <= vectorEnd; i += Vector256<double>.Count)
+    {
+        var vl = Vector256.Load(lhs + i);
+        Vector256.Store(vl + scalarVec, result + i);  // SIMD!
+    }
+
+    for (; i < totalSize; i++)
+        result[i] = lhs[i] + scalar;  // Remainder
+}
+```
+
+**ILKernelGenerator.cs (lines 912-970)** - Suboptimal implementation:
+```csharp
+private static void EmitScalarRightLoop(ILGenerator il, MixedTypeKernelKey key, ...)
+{
+    // Line 916-925: Hoist scalar value to local (good!)
+    var locRhsVal = il.DeclareLocal(GetClrType(key.ResultType));
+    il.Emit(OpCodes.Ldarg_1); // rhs
+    EmitLoadIndirect(il, key.RhsType);
+    EmitConvertTo(il, key.RhsType, key.ResultType);
+    il.Emit(OpCodes.Stloc, locRhsVal);
+
+    // Lines 938-960: Scalar operations only, NO SIMD!
+    for (int i = 0; i < totalSize; i++)
+    {
+        result[i] = lhs[i] + rhsVal;  // Scalar add
+    }
+}
+```
+
+---
+
+## Recommendations
+
+### Priority 1: Add SIMD to IL Scalar Paths (HIGH IMPACT)
+
+**Why**: 27% speedup for mixed-type scalar operations.
+
+**Implementation**:
+1. Modify `EmitScalarRightLoop()` to emit SIMD code for supported types
+2. Hoist `Vector256.Create(scalar)` before the loop
+3. Add Vector256 load/add/store in the main loop
+4. Keep scalar remainder loop for sizes not divisible by vector count
+
+**Target types**: float, double (already have Vector256 support)
+
+**Files to modify**:
+- `ILKernelGenerator.cs`: Add `EmitSimdScalarRightLoop()` method
+- Update `GenerateSimdScalarRightKernel()` to choose SIMD vs scalar based on type
+
+### Priority 2: Hardware Detection (LOW PRIORITY)
+
+**Why**: AVX-512 adoption is limited. Most CPUs (including this dev machine) only support AVX2.
+
+**Implementation** (when AVX-512 becomes common):
+1. Add static readonly flags in `SimdThresholds.cs`:
+   ```csharp
+   public static readonly bool HasAvx512 = Vector512.IsHardwareAccelerated;
+   public static readonly int PreferredVectorWidth = HasAvx512 ? 512 : 256;
+   ```
+2. Add Vector512 code paths alongside Vector256
+3. Use runtime dispatch based on `HasAvx512`
+
+**Expected benefit**: 2x throughput on AVX-512 hardware (16 floats vs 8 floats per instruction)
+
+---
+
+## Implementation Checklist
+
+### Phase 1: SIMD Scalar for IL Kernels ✅ COMPLETE
+
+- [x] Add `EmitSimdScalarRightLoop()` for float/double
+- [x] Add `EmitSimdScalarLeftLoop()` for float/double
+- [x] Add `EmitVectorCreate()` helper for Vector256.Create(scalar)
+- [x] Update `GenerateSimdScalarRightKernel()` to choose SIMD path
+- [x] Update `GenerateSimdScalarLeftKernel()` to choose SIMD path
+- [x] Verify correctness with small arrays
+- [x] Run full test suite (2597 passed, 0 failed)
+- [x] Benchmark before/after
+
+### Phase 2: Hardware Detection (Defer)
+
+- [ ] Add `SimdCapabilities` static class
+- [ ] Cache detection results at startup
+- [ ] Add Vector512 code paths (when adopting)
+- [ ] Runtime dispatch mechanism
+
+---
+
+## Files Modified
+
+- `src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs`:
+  - Added `EmitSimdScalarRightLoop()` method (lines 1063-1178)
+  - Added `EmitSimdScalarLeftLoop()` method (lines 1180-1295)
+  - Added `EmitVectorCreate()` helper (lines 1900-1914)
+  - Updated `GenerateSimdScalarRightKernel()` to check SIMD eligibility
+  - Updated `GenerateSimdScalarLeftKernel()` to check SIMD eligibility
+
+---
+
+## Appendix: Raw Benchmark Data
+
+### Test 1: Hardware Detection
+```
+X86 Intrinsics:
+  Sse:        True
+  Sse2:       True
+  Avx:        True
+  Avx2:       True
+  Avx512F:    False
+
+Generic Vector Types:
+  Vector256<float>:  True
+  Vector512<float>:  False
+```
+
+### Test 2: Scalar vs SIMD
+```
+array[10,000,000] + scalar
+
+1. Scalar Loop           :    25.42 ms
+2. SIMD Hoisted          :    16.28 ms
+3. SIMD In-Loop          :    22.42 ms
+```
+
+### Test 3: NumSharp Same-type vs Mixed-type
+```
+Same-type (double+double): 14.26 ms
+Mixed-type (int+double):   18.07 ms
+```
+
+---
+
+## Conclusion
+
+The investigation confirmed:
+1. **Scalar SIMD** with hoisted broadcast provides **1.56x speedup** over scalar loops
+2. NumSharp's C# SimdKernels already implement this optimally for same-type operations
+3. ~~**IL MixedTypeKernels lack SIMD for scalar paths**~~ **FIXED** ✅
+4. AVX-512 hardware detection is low priority due to limited adoption
+
+**Status**: SIMD scalar paths have been implemented for IL kernels. Mixed-type operations like `double_array + int_scalar` now use SIMD when the array type equals the result type.
+
+**Remaining work**: Hardware detection for AVX-512 (deferred until adoption increases).
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,36 @@
+# Auto detect text files and normalize to LF in repo
+* text=auto eol=lf
+
+# Explicit text files
+*.cs text eol=lf
+*.csproj text eol=lf
+*.sln text eol=lf
+*.md text eol=lf
+*.json text eol=lf
+*.xml text eol=lf
+*.yml text eol=lf
+*.yaml text eol=lf
+*.txt text eol=lf
+*.sh text eol=lf
+*.ps1 text eol=lf
+*.py text eol=lf
+*.config text eol=lf
+*.props text eol=lf
+*.targets text eol=lf
+*.editorconfig text eol=lf
+*.gitignore text eol=lf
+*.gitattributes text eol=lf
+
+# Binary files
+*.png binary
+*.jpg binary
+*.jpeg binary
+*.gif binary
+*.ico binary
+*.snk binary
+*.npy binary
+*.npz binary
+*.dll binary
+*.exe binary
+*.pdb binary
+*.zip binary
diff --git a/benchmark/NumSharp.Benchmark.GraphEngine/Benchmarks/Allocation/AllocationMicroBenchmarks.cs b/benchmark/NumSharp.Benchmark.GraphEngine/Benchmarks/Allocation/AllocationMicroBenchmarks.cs
@@ -0,0 +1,83 @@
+using System.Runtime.InteropServices;
+using BenchmarkDotNet.Attributes;
+using NumSharp.Benchmark.GraphEngine.Infrastructure;
+
+namespace NumSharp.Benchmark.GraphEngine.Benchmarks.Allocation;
+
+/// <summary>
+/// Micro-benchmarks comparing allocation primitives:
+/// - Marshal.AllocHGlobal (current)
+/// - NativeMemory.Alloc (proposed)
+/// - NativeMemory.AlignedAlloc (for SIMD)
+///
+/// These benchmarks inform issue #528: NativeMemory modernization.
+/// </summary>
+[BenchmarkCategory("Allocation", "Micro")]
+public class AllocationMicroBenchmarks : BenchmarkBase
+{
+    /// <summary>
+    /// Byte counts to allocate (matching typical NumSharp array sizes).
+    /// </summary>
+    [Params(64, 1_000, 100_000, 10_000_000)]
+    public int Bytes { get; set; }
+
+    // ========================================================================
+    // Allocation Only (no free) - measures allocation overhead
+    // ========================================================================
+
+    [Benchmark(Baseline = true, Description = "Marshal.AllocHGlobal")]
+    [BenchmarkCategory("AllocOnly")]
+    public nint MarshalAllocHGlobal() => Marshal.AllocHGlobal(Bytes);
+
+    [Benchmark(Description = "NativeMemory.Alloc")]
+    [BenchmarkCategory("AllocOnly")]
+    public unsafe void* NativeMemoryAlloc() => NativeMemory.Alloc((nuint)Bytes);
+
+    [Benchmark(Description = "NativeMemory.AlignedAlloc(32)")]
+    [BenchmarkCategory("AllocOnly")]
+    public unsafe void* NativeMemoryAlignedAlloc32() => NativeMemory.AlignedAlloc((nuint)Bytes, 32);
+
+    [Benchmark(Description = "NativeMemory.AlignedAlloc(64)")]
+    [BenchmarkCategory("AllocOnly")]
+    public unsafe void* NativeMemoryAlignedAlloc64() => NativeMemory.AlignedAlloc((nuint)Bytes, 64);
+
+    [Benchmark(Description = "NativeMemory.AllocZeroed")]
+    [BenchmarkCategory("AllocOnly")]
+    public unsafe void* NativeMemoryAllocZeroed() => NativeMemory.AllocZeroed((nuint)Bytes);
+
+    // ========================================================================
+    // Round-Trip (alloc + free) - measures full lifecycle
+    // ========================================================================
+
+    [Benchmark(Description = "Marshal alloc+free")]
+    [BenchmarkCategory("RoundTrip")]
+    public void MarshalRoundTrip()
+    {
+        var ptr = Marshal.AllocHGlobal(Bytes);
+        Marshal.FreeHGlobal(ptr);
+    }
+
+    [Benchmark(Description = "NativeMemory alloc+free")]
+    [BenchmarkCategory("RoundTrip")]
+    public unsafe void NativeMemoryRoundTrip()
+    {
+        var ptr = NativeMemory.Alloc((nuint)Bytes);
+        NativeMemory.Free(ptr);
+    }
+
+    [Benchmark(Description = "NativeMemory aligned alloc+free")]
+    [BenchmarkCategory("RoundTrip")]
+    public unsafe void NativeMemoryAlignedRoundTrip()
+    {
+        var ptr = NativeMemory.AlignedAlloc((nuint)Bytes, 32);
+        NativeMemory.AlignedFree(ptr);
+    }
+
+    [Benchmark(Description = "NativeMemory zeroed alloc+free")]
+    [BenchmarkCategory("RoundTrip")]
+    public unsafe void NativeMemoryZeroedRoundTrip()
+    {
+        var ptr = NativeMemory.AllocZeroed((nuint)Bytes);
+        NativeMemory.Free(ptr);
+    }
+}