FXC-5512 perf: stream outer_dot overlap kernel to reduce runtime and memory #3258

yaugenst-flex · 2026-02-09T14:02:36Z

Summary

Replaced the outer_dot hot path in ElectromagneticFieldData.outer_dot(...) with an adaptive streaming kernel:
- unblocked path for smaller working sets.
- blocked term-by-term path for large working sets to cap temporary allocations.
Removed the runtime try/except fallback to _outer_fn_summation from outer_dot.
Kept existing preprocessing and output semantics intact:
- tangential-dim checks,
- conjugation behavior,
- interpolation/alignment behavior,
- frequency intersection with left-order semantics,
- mode add/rename/drop behavior (mode_index_0, mode_index_1).
Added parity coverage for blocked vs unblocked streaming kernels.

Why This Is Safe

No public API/signature changes.
No schema/data-model architecture changes.
EME call sites unchanged.
Numerical parity verified against legacy/current implementations in benchmark harness (checksums matched within floating-point tolerance).

Benchmark (Legacy vs Current vs Final)

Environment: local macOS, median of 3 runs per case.
Memory metric: tracemalloc peak allocations during kernel call.

Case	Layout	Legacy wall (s)	Current wall (s)	Final wall (s)	Final speedup vs legacy	Legacy peak alloc	Current peak alloc	Final peak alloc
small (`48x48`, `f=3`, `m=12x12`)	mode last	0.027	0.002	0.002	11.08x	1.6 MiB	744.9 KiB	745.1 KiB
eme_like (`64x64`, `f=3`, `sweep=4`, `cell=6`, `m=20x20`)	mode last	8.230	0.114	0.108	76.17x	68.0 MiB	2.0 MiB	2.0 MiB
mode_axis_middle (`128x128`, `f=2`, `m=72x72`)	mode middle	2.434	0.082	0.109	22.42x	7.7 MiB	162.5 MiB	50.4 MiB
Total (sum)	mixed	10.691	0.198	0.219	48.80x	-	-	-

Notes:

mode_axis_middle is the case that reproduces the current memory regression (current unblocked path spikes to 162.5 MiB peak alloc); final blocked path reduces that to 50.4 MiB (~3.2x lower) while still being ~22x faster than legacy.
In mode-last workloads, final stays on the unblocked path and remains effectively non-regressing vs current.

Validation

Executed locally:

poetry run pytest -q tests/test_data/test_monitor_data.py
poetry run pytest -q tests/test_plugins/test_mode_solver.py
poetry run pytest -q tests/test_components/test_eme.py
poetry run ruff check tidy3d/components/data/monitor_data.py tests/test_data/test_monitor_data.py

Results:

tests/test_data/test_monitor_data.py: 354 passed
tests/test_plugins/test_mode_solver.py: 49 passed
tests/test_components/test_eme.py: 18 passed

Known Limitations

Benchmark scenarios are synthetic but chosen to cover small, EME-like preserved-dim, and memory-regression layouts.
tracemalloc measures Python allocator peaks (good for NumPy temporaries), not full process RSS.

Note

Medium Risk
Touches a performance-critical numerical kernel and reshapes how overlaps are computed; while API/output intent is preserved, errors in dimension handling or blocking could cause subtle numerical or shape regressions.

Overview
Replaces the ElectromagneticFieldData.outer_dot() hot path with a new streaming NumPy kernel (_outer_dot_numpy_kernel_streaming) that iterates over frequency/preserved dims and performs blockwise mode matrix products to avoid materializing large mode-pair broadcast tensors, with new tunables like OUTER_DOT_BLOCK_TARGET_BYTES.

Adds a strict slicing helper (_slice_mode_xy_matrix) and updates the implementation to keep frequency ordering and mode-dimension semantics while reducing peak allocations. Extends tests/test_monitor_data.py with targeted parity tests covering preserved-dimension outputs and cases where the mode axis is not last.

^{Written by Cursor Bugbot for commit db498b4. This will update automatically on new commits. Configure here.}

github-actions · 2026-02-09T14:20:19Z

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

tidy3d/components/data/monitor_data.py (90.9%): Missing lines 1099,1103,1107,1113,1122,1129,1172,1216,1255

Summary

Total: 99 lines
Missing: 9 lines
Coverage: 90%

tidy3d/components/data/monitor_data.py

Lines 1095-1111

  1095 
  1096         for dim, ind in zip(preserve_dims, preserve_index):
  1097             axis = dim_to_axis.get(dim)
  1098             if axis is None:
! 1099                 continue
  1100             indexed_dims.add(dim)
  1101             axis_size = arr.shape[axis]
  1102             if axis_size == 1:
! 1103                 idx[axis] = 0
  1104             elif ind < axis_size:
  1105                 idx[axis] = ind
  1106             else:
! 1107                 raise ValueError(
  1108                     "Cannot broadcast preserve dimension in streaming outer_dot kernel."
  1109                 )
  1110 
  1111         freq_axis = dim_to_axis.get(freq_dim)

Lines 1109-1117

  1109                 )
  1110 
  1111         freq_axis = dim_to_axis.get(freq_dim)
  1112         if freq_axis is None:
! 1113             raise ValueError("Frequency dimension missing in streaming outer_dot kernel.")
  1114         indexed_dims.add(freq_dim)
  1115         if arr.shape[freq_axis] == 1:
  1116             idx[freq_axis] = 0
  1117         else:

Lines 1118-1126

  1118             idx[freq_axis] = freq_index
  1119 
  1120         mode_axis = dim_to_axis.get(mode_dim)
  1121         if mode_axis is None:
! 1122             raise ValueError("Mode dimension missing in streaming outer_dot kernel.")
  1123         idx[mode_axis] = mode_slice
  1124 
  1125         sliced = arr[tuple(idx)]
  1126         remaining_dims = [dim for dim in dims if dim not in indexed_dims]

Lines 1125-1133

  1125         sliced = arr[tuple(idx)]
  1126         remaining_dims = [dim for dim in dims if dim not in indexed_dims]
  1127         required_dims = {mode_dim, tangential_dims[0], tangential_dims[1]}
  1128         if len(remaining_dims) != 3 or set(remaining_dims) != required_dims:
! 1129             raise ValueError("Unexpected dimensions in streaming outer_dot kernel.")
  1130 
  1131         axes = (
  1132             remaining_dims.index(mode_dim),
  1133             remaining_dims.index(tangential_dims[0]),

Lines 1168-1176

  1168         coords[outer_dim_2] = data_array_temp_2.coords[outer_dim_2].to_numpy()
  1169         coords = {key: val for key, val in coords.items() if len(val.shape) != 0}
  1170 
  1171         if "f" not in coords:
! 1172             raise ValueError("Frequency coordinate missing in streaming outer_dot kernel.")
  1173 
  1174         dims = tuple(coords.keys())
  1175         shape = [len(val) for val in coords.values()]
  1176         dtype = np.result_type(

Lines 1212-1220

  1212         itemsize = np.dtype(np.result_type(fields_1[e_1].dtype, fields_2[e_1].dtype)).itemsize
  1213         # Heuristic: choose a mode block size that targets bounded temporary allocations,
  1214         # then clamp to practical limits to avoid tiny or overly large chunks.
  1215         if num_grid_points == 0:
! 1216             block_size = OUTER_DOT_BLOCK_MAX_SIZE
  1217         else:
  1218             block_size = int(OUTER_DOT_BLOCK_TARGET_BYTES // (num_grid_points * itemsize))
  1219             block_size = max(OUTER_DOT_BLOCK_MIN_SIZE, block_size)
  1220             block_size = min(OUTER_DOT_BLOCK_MAX_SIZE, block_size)

Lines 1251-1259

  1251                             mode_slice=slice(i0, i0_end),
  1252                         )
  1253 
  1254                         if left_block.shape[1] != d_area.size:
! 1255                             raise ValueError(
  1256                                 "Tangential area shape mismatch in streaming outer_dot kernel."
  1257                             )
  1258 
  1259                         for i1 in range(0, num_modes_right, block_size_right):

caseyflex · 2026-02-10T02:07:19Z

Thanks Yannick, this is a helpful PR. It's helpful to dig into this sort of thing.

I did some benchmarking in frontend, backend, and EME. The current implementation has a memory issue for our primary use case (large transverse grids, many modes, few frequencies), but it's fixable without giving up the BLAS speedup.

Let M = num_modes, F = num_freqs, S = num_transverse_grid_points.

What the "streaming" restructuring actually trades off

The old approach loops over mode pairs (M^2 iterations), processing all frequencies per iteration. The new approach loops over frequencies (F iterations), processing all modes per iteration via BLAS. This "streams" over frequency -- removing F from per-iteration memory -- but vectorizes over modes -- adding M:

Loop	Per-iter work	Per-iter kernel memory
Old (M^2 iterations)	O(F*S) element-wise	O(S*F) -- independent of M
New (F iterations)	O(M^2*S) BLAS gemm	O(M*S) -- independent of F

The speed improvement comes from replacing M^2 Python loop iterations with a single BLAS matmul -- same total arithmetic, but much better cache utilization, SIMD, and lower per-iteration overhead. For the primary use case (M >> F), this trades a large loop count for a small one.

But the memory trade-off goes the wrong way for the same case: when M >> F, replacing O(S*F) temporaries with O(M*S) temporaries is a net increase.

Benchmarks

At benchmark scale (kernel-only, excluding shared field extraction overhead):

Config	Old time	New time	Speedup	Old mem	New mem
M=30, S=2961, F=1	38 ms	2.3 ms	16x	313 KiB	7,155 KiB
M=30, S=4800, F=3	230 ms	8.6 ms	27x	1,219 KiB	11,537 KiB

At production scale (M=50, S=250,000, F=1), measured via tracemalloc: the current kernel peaks at ~760 MiB vs ~31 MiB for the old approach. The ~760 MiB comes from materializing all 4 weighted right-side matrices of shape (M, S) simultaneously inside the frequency loop. The speed gain is real (8.7s to 376ms), but the ~25x memory increase is prohibitive for large-grid workloads -- especially under parallel execution where per-worker memory matters.

Suggested fix

The memory increase is incidental, not fundamental to the BLAS approach. Two changes reduce peak memory to O(block*S) -- independent of M -- while retaining the BLAS speedup:

Process the 4 cross-product terms one at a time instead of holding all weighted matrices simultaneously. Nearly free -- reduces peak from 760 to 191 MiB with negligible speed cost.
Block both sides of the mode dimension -- compute (block, S) @ (S, block) chunks. The left block is a view into the source field data (zero alloc for tidy3d's standard (tan1, tan2, f, mode) dim ordering); only the weighted right block allocates block * S * 16 bytes per inner iteration. BLAS remains efficient because S is large.

Benchmarked at M=50, S=250,000, F=1 (tracemalloc peak):

Strategy	Time	Speedup	Peak alloc
Old (M^2 loop)	8.7s	--	31 MiB
Current impl (4 weighted alive)	376ms	23x	763 MiB
+ term-by-term	403ms	22x	191 MiB
+ term-by-term + block=8	1.6s	5.4x	31 MiB
+ term-by-term + block=16	1.1s	8.0x	61 MiB
+ term-by-term + block=32	587ms	14.8x	122 MiB

At block=8, peak memory matches the old approach while still being 5x faster. The block size is a tunable speed-memory knob.

Memory regression test

I'd also suggest adding a production-scale memory regression test -- a tracemalloc-guarded call at M>=30, S>=50,000 that asserts peak kernel memory stays within a bound proportional to block * S, not M * S. This would have caught the current issue.

Summary

The core idea (BLAS matmul over mode pairs) is the right direction and the speed gains are significant. I'd like to see the double-sided blocking optimization and a memory test before merging.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

tidy3d/components/data/monitor_data.py

yaugenst-flex · 2026-02-10T06:45:51Z

Posted updated benchmark comparison with commit-hash columns for clarity:

af528fad = before blocked/adaptive kernel changes ("current" in earlier discussion)
287209c0 = after blocked/adaptive kernel changes (final)

Environment:

local macOS
median walltime over 3 runs
peak allocation from tracemalloc during kernel call

Case	Layout	Legacy wall (s)	`af528fad` wall (s)	`287209c0` wall (s)	`af528fad` peak alloc	`287209c0` peak alloc
small (`48x48`, `f=3`, `m=12x12`)	mode-last	0.027	0.00244	0.00245	744.9 KiB	745.1 KiB
eme_like (`64x64`, `f=3`, `sweep=4`, `cell=6`, `m=20x20`)	mode-last	8.230	0.11375	0.10805	2.0 MiB	2.0 MiB
mode_axis_middle (`128x128`, `f=2`, `m=72x72`)	mode-middle	2.434	0.08209	0.10856	162.5 MiB	50.4 MiB

Key point on regression case:

In mode_axis_middle, af528fad shows the high temporary-allocation footprint (162.5 MiB).
287209c0 reduces that to 50.4 MiB (~3.2x lower), which is the intended memory fix from the PR discussion.

Aggregate runtime (sum of listed cases):

af528fad: 0.198s
287209c0: 0.219s
Overall: ~0.90x vs af528fad (tradeoff accepted to remove memory spike on mode-middle layout).

caseyflex

I think the speed-memory tradeoff is good overall. And I think this is an algorithmic improvement even when we do try to keep memory usage low.

A couple more comments:

This will conflict heavily with @dmarek-flex 's outer dot colocation PR. Whoever merges second will need to carefully incorporate the logic improvements from the other
I'd still like a unit test that explicitly checks that the memory usage is as expected. This could be here or backend.
Do we need to update any memory estimation logic?
Can we lower OUTER_DOT_BLOCK_TARGET_BYTES to 8, and make the min block size 1 instead of 8? This may be safer for parallel execution.
I also talked with @dmarek-flex about this, but maybe you have thoughts -- you keep the old _outer_fn_summation because it is more general with an arbitrary function and used elsewhere, but the flexibility is at the cost of performance. Do you have any ideas about how to handle all use cases with minimal code duplication?

here is the other PR:
#3100

caseyflex · 2026-02-10T20:00:34Z

tidy3d/components/data/monitor_data.py

+                                tangential_dims=tangential_dims,
+                                mode_slice=slice(i1, i1_end),
+                            )
+                            weighted_right = right_block * d_area


maybe this can be moved out of the inner loop, by moving it to the left factor or swapping loop ordering

caseyflex · 2026-02-10T20:07:10Z

tidy3d/components/data/monitor_data.py

 # Threshold for cos(theta) to avoid unphysically large amplitudes near grazing angles
 COS_THETA_THRESH = 1e-5
 MODE_INTERP_EXTRAPOLATION_TOLERANCE = 1e-2
+OUTER_DOT_BLOCK_TARGET_BYTES = 64 * 1024**2


Actually, maybe we can use a simpler logic for block size. We could just have some default block size and add it as an optional parameter to the private outer dot implementation function

caseyflex · 2026-02-10T20:10:05Z

tidy3d/components/data/monitor_data.py

+                for sign, left_comp, right_comp in term_specs:
+                    for i0 in range(0, num_modes_left, block_size_left):
+                        i0_end = min(i0 + block_size_left, num_modes_left)
+                        left_block = ElectromagneticFieldData._slice_mode_xy_matrix(


Not sure we need to optimize further now, but maybe we don't need to slice the data 4 times here

dmarek-flex · 2026-02-10T21:02:45Z

Correct me if I am wrong, but I think the most effort here is spent figuring out the dimensions and their order and prepping the arrays for the matrix multiply. So I think this optimization could be easily added to the numpy part of my branch.

tidy3d/tidy3d/components/data/utils.py

Lines 229 to 301 in e147035

    
           def _outer_dot_numpy( 
        
               E1: tuple[np.ndarray, np.ndarray], 
        
               H1: tuple[np.ndarray, np.ndarray], 
        
               E2: tuple[np.ndarray, np.ndarray], 
        
               H2: tuple[np.ndarray, np.ndarray], 
        
               dS: tuple[np.ndarray, np.ndarray], 
        
               conjugate: bool = False, 
        
               bidirectional: bool = True, 
        
           ) -> np.ndarray: 
        
               """Compute pairwise modal overlap matrix. 
        
               Computes all elements of the overlap matrix S[i,j] = <mode_i | mode_j>. 
        
               By default computes the bidirectional overlap: 1/4 * integral(E1 x H2 + H1 x E2) dS. 
        
               With bidirectional=False, computes just: 1/2 * integral(E1 x H2) dS. 
        
               Parameters 
        
               ---------- 
        
               E1 : tuple[np.ndarray, np.ndarray] 
        
                   Tangential E-field components (Eu, Ev) from first dataset, each shape ``(..., n_modes_1, nu, nv)``. 
        
                   Mode index is at -3, spatial dims at -2, -1. 
        
               H1 : tuple[np.ndarray, np.ndarray] 
        
                   Tangential H-field components (Hu, Hv) from first dataset, each shape ``(..., n_modes_1, nu, nv)``. 
        
               E2 : tuple[np.ndarray, np.ndarray] 
        
                   Tangential E-field components (Eu, Ev) from second dataset, each shape ``(..., n_modes_2, nu, nv)``. 
        
               H2 : tuple[np.ndarray, np.ndarray] 
        
                   Tangential H-field components (Hu, Hv) from second dataset, each shape ``(..., n_modes_2, nu, nv)``. 
        
               dS : tuple[np.ndarray, np.ndarray] 
        
                   Area elements at the two Yee grid locations, each shape ``(nu, nv)``. 
        
                   ``dS[0]`` is for Eu/Hv location, ``dS[1]`` for Ev/Hu location. 
        
               conjugate : bool 
        
                   If True, conjugate the first set of fields (E1, H1) before computing overlap. 
        
               bidirectional : bool 
        
                   If True (default), computes symmetric overlap 1/4 * (E1 x H2 + H1 x E2). 
        
                   If False, computes just 1/2 * (E1 x H2). 
        
               Returns 
        
               ------- 
        
               np.ndarray 
        
                   Overlap matrix with shape ``(..., n_modes_1, n_modes_2)``. 
        
               """ 
        
               E1u, E1v = E1 
        
               H1u, H1v = H1 
        
               E2u, E2v = E2 
        
               H2u, H2v = H2 
        
               assert E1u.shape == H1v.shape 
        
               assert E1v.shape == H1u.shape 
        
               assert E2u.shape == H2v.shape 
        
               assert E2v.shape == H2u.shape 
        
               # Get number of modes and broadcast shape 
        
               n_modes_1 = E1u.shape[-3] 
        
               n_modes_2 = E2u.shape[-3] 
        
               broadcast_shape = E1u.shape[:-3] 
        
               dtype = E1u.dtype 
        
               # Initialize output matrix 
        
               S = np.zeros((*broadcast_shape, n_modes_1, n_modes_2), dtype=dtype) 
        
               # Conjugate outside loop to avoid copies 
        
               if conjugate: 
        
                   E1u, E1v = np.conj(E1u), np.conj(E1v) 
        
                   H1u, H1v = np.conj(H1u), np.conj(H1v) 
        
               # Compute all elements of the overlap matrix 
        
               # Use slices instead of fancy indexing for E2/H2 to avoid copies 
        
               for i in range(n_modes_1): 
        
                   E1_i = (E1u[..., i : i + 1, :, :], E1v[..., i : i + 1, :, :]) 
        
                   H1_i = (H1u[..., i : i + 1, :, :], H1v[..., i : i + 1, :, :]) 
        
                   # E2/H2 use all modes, so just pass them directly (no indexing needed) 
        
                   S[..., i, :] = _dot_numpy(E1_i, H1_i, E2, H2, dS, False, bidirectional) 
        
               return S

caseyflex · 2026-02-11T01:04:04Z

Correct me if I am wrong, but I think the most effort here is spent figuring out the dimensions and their order and prepping the arrays for the matrix multiply. So I think this optimization could be easily added to the numpy part of my branch.

Yeah I'm not completely sure. Seems a mix of that, and then the actual performance gain is from changing the python looping / BLAS matrix multiply tradeoffs. But I also didn't study your handling in detail. So I think it would be critical to have some unit tests that control the memory at least of the implementation, as there's just so much that can go wrong here -- one extra line of python can double the memory usage

perf: stream outer_dot kernel for mode-overlap accumulation

af528fa

yaugenst-flex force-pushed the yaugenst-flex/FXC-5512-stream-outer-dot-overlap-kernel-to-reduce-runtime-and-memory branch from a48f2a2 to af528fa Compare February 9, 2026 14:03

yaugenst-flex requested review from caseyflex, dbochkov-flexcompute and momchil-flex February 9, 2026 14:04

yaugenst-flex self-assigned this Feb 9, 2026

perf(data): block outer_dot kernel and drop runtime fallback

287209c

cursor bot reviewed Feb 10, 2026

View reviewed changes

tidy3d/components/data/monitor_data.py Outdated Show resolved Hide resolved

tidy3d/components/data/monitor_data.py Show resolved Hide resolved

yaugenst-flex added 2 commits February 10, 2026 08:14

perf(data): remove unblocked outer_dot kernel path

0e9675e

docs(data): clarify outer_dot streaming kernel internals

db498b4

caseyflex requested changes Feb 10, 2026

View reviewed changes

caseyflex reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FXC-5512 perf: stream outer_dot overlap kernel to reduce runtime and memory #3258

FXC-5512 perf: stream outer_dot overlap kernel to reduce runtime and memory #3258

yaugenst-flex commented Feb 9, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

caseyflex commented Feb 10, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

yaugenst-flex commented Feb 10, 2026

Uh oh!

caseyflex left a comment

Uh oh!

caseyflex Feb 10, 2026

Uh oh!

caseyflex Feb 10, 2026

Uh oh!

caseyflex Feb 10, 2026

Uh oh!

dmarek-flex commented Feb 10, 2026

Uh oh!

caseyflex commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FXC-5512 perf: stream outer_dot overlap kernel to reduce runtime and memory #3258

Are you sure you want to change the base?

FXC-5512 perf: stream outer_dot overlap kernel to reduce runtime and memory #3258

Conversation

yaugenst-flex commented Feb 9, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why This Is Safe

Benchmark (Legacy vs Current vs Final)

Validation

Known Limitations

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

Summary

tidy3d/components/data/monitor_data.py

Uh oh!

caseyflex commented Feb 10, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yaugenst-flex commented Feb 10, 2026

Uh oh!

caseyflex left a comment

Choose a reason for hiding this comment

Uh oh!

caseyflex Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

caseyflex Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

caseyflex Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

dmarek-flex commented Feb 10, 2026

Uh oh!

caseyflex commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaugenst-flex commented Feb 9, 2026 •

edited by cursor bot

Loading

github-actions bot commented Feb 9, 2026 •

edited

Loading