Skip to content

Conversation

@yaugenst-flex
Copy link
Collaborator

@yaugenst-flex yaugenst-flex commented Feb 9, 2026

Summary

  • Replaced the outer_dot hot path in ElectromagneticFieldData.outer_dot(...) with an adaptive streaming kernel:
    • unblocked path for smaller working sets.
    • blocked term-by-term path for large working sets to cap temporary allocations.
  • Removed the runtime try/except fallback to _outer_fn_summation from outer_dot.
  • Kept existing preprocessing and output semantics intact:
    • tangential-dim checks,
    • conjugation behavior,
    • interpolation/alignment behavior,
    • frequency intersection with left-order semantics,
    • mode add/rename/drop behavior (mode_index_0, mode_index_1).
  • Added parity coverage for blocked vs unblocked streaming kernels.

Why This Is Safe

  • No public API/signature changes.
  • No schema/data-model architecture changes.
  • EME call sites unchanged.
  • Numerical parity verified against legacy/current implementations in benchmark harness (checksums matched within floating-point tolerance).

Benchmark (Legacy vs Current vs Final)

Environment: local macOS, median of 3 runs per case.
Memory metric: tracemalloc peak allocations during kernel call.

Case Layout Legacy wall (s) Current wall (s) Final wall (s) Final speedup vs legacy Legacy peak alloc Current peak alloc Final peak alloc
small (48x48, f=3, m=12x12) mode last 0.027 0.002 0.002 11.08x 1.6 MiB 744.9 KiB 745.1 KiB
eme_like (64x64, f=3, sweep=4, cell=6, m=20x20) mode last 8.230 0.114 0.108 76.17x 68.0 MiB 2.0 MiB 2.0 MiB
mode_axis_middle (128x128, f=2, m=72x72) mode middle 2.434 0.082 0.109 22.42x 7.7 MiB 162.5 MiB 50.4 MiB
Total (sum) mixed 10.691 0.198 0.219 48.80x - - -

Notes:

  • mode_axis_middle is the case that reproduces the current memory regression (current unblocked path spikes to 162.5 MiB peak alloc); final blocked path reduces that to 50.4 MiB (~3.2x lower) while still being ~22x faster than legacy.
  • In mode-last workloads, final stays on the unblocked path and remains effectively non-regressing vs current.

Validation

Executed locally:

  • poetry run pytest -q tests/test_data/test_monitor_data.py
  • poetry run pytest -q tests/test_plugins/test_mode_solver.py
  • poetry run pytest -q tests/test_components/test_eme.py
  • poetry run ruff check tidy3d/components/data/monitor_data.py tests/test_data/test_monitor_data.py

Results:

  • tests/test_data/test_monitor_data.py: 354 passed
  • tests/test_plugins/test_mode_solver.py: 49 passed
  • tests/test_components/test_eme.py: 18 passed

Known Limitations

  • Benchmark scenarios are synthetic but chosen to cover small, EME-like preserved-dim, and memory-regression layouts.
  • tracemalloc measures Python allocator peaks (good for NumPy temporaries), not full process RSS.

Note

Medium Risk
Touches a performance-critical numerical kernel and reshapes how overlaps are computed; while API/output intent is preserved, errors in dimension handling or blocking could cause subtle numerical or shape regressions.

Overview
Replaces the ElectromagneticFieldData.outer_dot() hot path with a new streaming NumPy kernel (_outer_dot_numpy_kernel_streaming) that iterates over frequency/preserved dims and performs blockwise mode matrix products to avoid materializing large mode-pair broadcast tensors, with new tunables like OUTER_DOT_BLOCK_TARGET_BYTES.

Adds a strict slicing helper (_slice_mode_xy_matrix) and updates the implementation to keep frequency ordering and mode-dimension semantics while reducing peak allocations. Extends tests/test_monitor_data.py with targeted parity tests covering preserved-dimension outputs and cases where the mode axis is not last.

Written by Cursor Bugbot for commit db498b4. This will update automatically on new commits. Configure here.

@yaugenst-flex yaugenst-flex force-pushed the yaugenst-flex/FXC-5512-stream-outer-dot-overlap-kernel-to-reduce-runtime-and-memory branch from a48f2a2 to af528fa Compare February 9, 2026 14:03
@yaugenst-flex yaugenst-flex self-assigned this Feb 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

Diff Coverage

Diff: origin/develop...HEAD, staged and unstaged changes

  • tidy3d/components/data/monitor_data.py (90.9%): Missing lines 1099,1103,1107,1113,1122,1129,1172,1216,1255

Summary

  • Total: 99 lines
  • Missing: 9 lines
  • Coverage: 90%

tidy3d/components/data/monitor_data.py

Lines 1095-1111

  1095 
  1096         for dim, ind in zip(preserve_dims, preserve_index):
  1097             axis = dim_to_axis.get(dim)
  1098             if axis is None:
! 1099                 continue
  1100             indexed_dims.add(dim)
  1101             axis_size = arr.shape[axis]
  1102             if axis_size == 1:
! 1103                 idx[axis] = 0
  1104             elif ind < axis_size:
  1105                 idx[axis] = ind
  1106             else:
! 1107                 raise ValueError(
  1108                     "Cannot broadcast preserve dimension in streaming outer_dot kernel."
  1109                 )
  1110 
  1111         freq_axis = dim_to_axis.get(freq_dim)

Lines 1109-1117

  1109                 )
  1110 
  1111         freq_axis = dim_to_axis.get(freq_dim)
  1112         if freq_axis is None:
! 1113             raise ValueError("Frequency dimension missing in streaming outer_dot kernel.")
  1114         indexed_dims.add(freq_dim)
  1115         if arr.shape[freq_axis] == 1:
  1116             idx[freq_axis] = 0
  1117         else:

Lines 1118-1126

  1118             idx[freq_axis] = freq_index
  1119 
  1120         mode_axis = dim_to_axis.get(mode_dim)
  1121         if mode_axis is None:
! 1122             raise ValueError("Mode dimension missing in streaming outer_dot kernel.")
  1123         idx[mode_axis] = mode_slice
  1124 
  1125         sliced = arr[tuple(idx)]
  1126         remaining_dims = [dim for dim in dims if dim not in indexed_dims]

Lines 1125-1133

  1125         sliced = arr[tuple(idx)]
  1126         remaining_dims = [dim for dim in dims if dim not in indexed_dims]
  1127         required_dims = {mode_dim, tangential_dims[0], tangential_dims[1]}
  1128         if len(remaining_dims) != 3 or set(remaining_dims) != required_dims:
! 1129             raise ValueError("Unexpected dimensions in streaming outer_dot kernel.")
  1130 
  1131         axes = (
  1132             remaining_dims.index(mode_dim),
  1133             remaining_dims.index(tangential_dims[0]),

Lines 1168-1176

  1168         coords[outer_dim_2] = data_array_temp_2.coords[outer_dim_2].to_numpy()
  1169         coords = {key: val for key, val in coords.items() if len(val.shape) != 0}
  1170 
  1171         if "f" not in coords:
! 1172             raise ValueError("Frequency coordinate missing in streaming outer_dot kernel.")
  1173 
  1174         dims = tuple(coords.keys())
  1175         shape = [len(val) for val in coords.values()]
  1176         dtype = np.result_type(

Lines 1212-1220

  1212         itemsize = np.dtype(np.result_type(fields_1[e_1].dtype, fields_2[e_1].dtype)).itemsize
  1213         # Heuristic: choose a mode block size that targets bounded temporary allocations,
  1214         # then clamp to practical limits to avoid tiny or overly large chunks.
  1215         if num_grid_points == 0:
! 1216             block_size = OUTER_DOT_BLOCK_MAX_SIZE
  1217         else:
  1218             block_size = int(OUTER_DOT_BLOCK_TARGET_BYTES // (num_grid_points * itemsize))
  1219             block_size = max(OUTER_DOT_BLOCK_MIN_SIZE, block_size)
  1220             block_size = min(OUTER_DOT_BLOCK_MAX_SIZE, block_size)

Lines 1251-1259

  1251                             mode_slice=slice(i0, i0_end),
  1252                         )
  1253 
  1254                         if left_block.shape[1] != d_area.size:
! 1255                             raise ValueError(
  1256                                 "Tangential area shape mismatch in streaming outer_dot kernel."
  1257                             )
  1258 
  1259                         for i1 in range(0, num_modes_right, block_size_right):

@caseyflex
Copy link
Contributor

Thanks Yannick, this is a helpful PR. It's helpful to dig into this sort of thing.

I did some benchmarking in frontend, backend, and EME. The current implementation has a memory issue for our primary use case (large transverse grids, many modes, few frequencies), but it's fixable without giving up the BLAS speedup.

Let M = num_modes, F = num_freqs, S = num_transverse_grid_points.

What the "streaming" restructuring actually trades off

The old approach loops over mode pairs (M^2 iterations), processing all frequencies per iteration. The new approach loops over frequencies (F iterations), processing all modes per iteration via BLAS. This "streams" over frequency -- removing F from per-iteration memory -- but vectorizes over modes -- adding M:

Loop Per-iter work Per-iter kernel memory
Old (M^2 iterations) O(F*S) element-wise O(S*F) -- independent of M
New (F iterations) O(M^2*S) BLAS gemm O(M*S) -- independent of F

The speed improvement comes from replacing M^2 Python loop iterations with a single BLAS matmul -- same total arithmetic, but much better cache utilization, SIMD, and lower per-iteration overhead. For the primary use case (M >> F), this trades a large loop count for a small one.

But the memory trade-off goes the wrong way for the same case: when M >> F, replacing O(S*F) temporaries with O(M*S) temporaries is a net increase.

Benchmarks

At benchmark scale (kernel-only, excluding shared field extraction overhead):

Config Old time New time Speedup Old mem New mem
M=30, S=2961, F=1 38 ms 2.3 ms 16x 313 KiB 7,155 KiB
M=30, S=4800, F=3 230 ms 8.6 ms 27x 1,219 KiB 11,537 KiB

At production scale (M=50, S=250,000, F=1), measured via tracemalloc: the current kernel peaks at ~760 MiB vs ~31 MiB for the old approach. The ~760 MiB comes from materializing all 4 weighted right-side matrices of shape (M, S) simultaneously inside the frequency loop. The speed gain is real (8.7s to 376ms), but the ~25x memory increase is prohibitive for large-grid workloads -- especially under parallel execution where per-worker memory matters.

Suggested fix

The memory increase is incidental, not fundamental to the BLAS approach. Two changes reduce peak memory to O(block*S) -- independent of M -- while retaining the BLAS speedup:

  1. Process the 4 cross-product terms one at a time instead of holding all weighted matrices simultaneously. Nearly free -- reduces peak from 760 to 191 MiB with negligible speed cost.
  2. Block both sides of the mode dimension -- compute (block, S) @ (S, block) chunks. The left block is a view into the source field data (zero alloc for tidy3d's standard (tan1, tan2, f, mode) dim ordering); only the weighted right block allocates block * S * 16 bytes per inner iteration. BLAS remains efficient because S is large.

Benchmarked at M=50, S=250,000, F=1 (tracemalloc peak):

Strategy Time Speedup Peak alloc
Old (M^2 loop) 8.7s -- 31 MiB
Current impl (4 weighted alive) 376ms 23x 763 MiB
+ term-by-term 403ms 22x 191 MiB
+ term-by-term + block=8 1.6s 5.4x 31 MiB
+ term-by-term + block=16 1.1s 8.0x 61 MiB
+ term-by-term + block=32 587ms 14.8x 122 MiB

At block=8, peak memory matches the old approach while still being 5x faster. The block size is a tunable speed-memory knob.

Memory regression test

I'd also suggest adding a production-scale memory regression test -- a tracemalloc-guarded call at M>=30, S>=50,000 that asserts peak kernel memory stays within a bound proportional to block * S, not M * S. This would have caught the current issue.

Summary

The core idea (BLAS matmul over mode pairs) is the right direction and the speed gains are significant. I'd like to see the double-sided blocking optimization and a memory test before merging.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@yaugenst-flex
Copy link
Collaborator Author

Posted updated benchmark comparison with commit-hash columns for clarity:

  • af528fad = before blocked/adaptive kernel changes ("current" in earlier discussion)
  • 287209c0 = after blocked/adaptive kernel changes (final)

Environment:

  • local macOS
  • median walltime over 3 runs
  • peak allocation from tracemalloc during kernel call
Case Layout Legacy wall (s) af528fad wall (s) 287209c0 wall (s) af528fad peak alloc 287209c0 peak alloc
small (48x48, f=3, m=12x12) mode-last 0.027 0.00244 0.00245 744.9 KiB 745.1 KiB
eme_like (64x64, f=3, sweep=4, cell=6, m=20x20) mode-last 8.230 0.11375 0.10805 2.0 MiB 2.0 MiB
mode_axis_middle (128x128, f=2, m=72x72) mode-middle 2.434 0.08209 0.10856 162.5 MiB 50.4 MiB

Key point on regression case:

  • In mode_axis_middle, af528fad shows the high temporary-allocation footprint (162.5 MiB).
  • 287209c0 reduces that to 50.4 MiB (~3.2x lower), which is the intended memory fix from the PR discussion.

Aggregate runtime (sum of listed cases):

  • af528fad: 0.198s
  • 287209c0: 0.219s
  • Overall: ~0.90x vs af528fad (tradeoff accepted to remove memory spike on mode-middle layout).

Copy link
Contributor

@caseyflex caseyflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the speed-memory tradeoff is good overall. And I think this is an algorithmic improvement even when we do try to keep memory usage low.

A couple more comments:

  • This will conflict heavily with @dmarek-flex 's outer dot colocation PR. Whoever merges second will need to carefully incorporate the logic improvements from the other
  • I'd still like a unit test that explicitly checks that the memory usage is as expected. This could be here or backend.
  • Do we need to update any memory estimation logic?
  • Can we lower OUTER_DOT_BLOCK_TARGET_BYTES to 8, and make the min block size 1 instead of 8? This may be safer for parallel execution.
  • I also talked with @dmarek-flex about this, but maybe you have thoughts -- you keep the old _outer_fn_summation because it is more general with an arbitrary function and used elsewhere, but the flexibility is at the cost of performance. Do you have any ideas about how to handle all use cases with minimal code duplication?

here is the other PR:
#3100

tangential_dims=tangential_dims,
mode_slice=slice(i1, i1_end),
)
weighted_right = right_block * d_area
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be moved out of the inner loop, by moving it to the left factor or swapping loop ordering

# Threshold for cos(theta) to avoid unphysically large amplitudes near grazing angles
COS_THETA_THRESH = 1e-5
MODE_INTERP_EXTRAPOLATION_TOLERANCE = 1e-2
OUTER_DOT_BLOCK_TARGET_BYTES = 64 * 1024**2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe we can use a simpler logic for block size. We could just have some default block size and add it as an optional parameter to the private outer dot implementation function

for sign, left_comp, right_comp in term_specs:
for i0 in range(0, num_modes_left, block_size_left):
i0_end = min(i0 + block_size_left, num_modes_left)
left_block = ElectromagneticFieldData._slice_mode_xy_matrix(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need to optimize further now, but maybe we don't need to slice the data 4 times here

@dmarek-flex
Copy link
Contributor

Correct me if I am wrong, but I think the most effort here is spent figuring out the dimensions and their order and prepping the arrays for the matrix multiply. So I think this optimization could be easily added to the numpy part of my branch.

def _outer_dot_numpy(
E1: tuple[np.ndarray, np.ndarray],
H1: tuple[np.ndarray, np.ndarray],
E2: tuple[np.ndarray, np.ndarray],
H2: tuple[np.ndarray, np.ndarray],
dS: tuple[np.ndarray, np.ndarray],
conjugate: bool = False,
bidirectional: bool = True,
) -> np.ndarray:
"""Compute pairwise modal overlap matrix.
Computes all elements of the overlap matrix S[i,j] = <mode_i | mode_j>.
By default computes the bidirectional overlap: 1/4 * integral(E1 x H2 + H1 x E2) dS.
With bidirectional=False, computes just: 1/2 * integral(E1 x H2) dS.
Parameters
----------
E1 : tuple[np.ndarray, np.ndarray]
Tangential E-field components (Eu, Ev) from first dataset, each shape ``(..., n_modes_1, nu, nv)``.
Mode index is at -3, spatial dims at -2, -1.
H1 : tuple[np.ndarray, np.ndarray]
Tangential H-field components (Hu, Hv) from first dataset, each shape ``(..., n_modes_1, nu, nv)``.
E2 : tuple[np.ndarray, np.ndarray]
Tangential E-field components (Eu, Ev) from second dataset, each shape ``(..., n_modes_2, nu, nv)``.
H2 : tuple[np.ndarray, np.ndarray]
Tangential H-field components (Hu, Hv) from second dataset, each shape ``(..., n_modes_2, nu, nv)``.
dS : tuple[np.ndarray, np.ndarray]
Area elements at the two Yee grid locations, each shape ``(nu, nv)``.
``dS[0]`` is for Eu/Hv location, ``dS[1]`` for Ev/Hu location.
conjugate : bool
If True, conjugate the first set of fields (E1, H1) before computing overlap.
bidirectional : bool
If True (default), computes symmetric overlap 1/4 * (E1 x H2 + H1 x E2).
If False, computes just 1/2 * (E1 x H2).
Returns
-------
np.ndarray
Overlap matrix with shape ``(..., n_modes_1, n_modes_2)``.
"""
E1u, E1v = E1
H1u, H1v = H1
E2u, E2v = E2
H2u, H2v = H2
assert E1u.shape == H1v.shape
assert E1v.shape == H1u.shape
assert E2u.shape == H2v.shape
assert E2v.shape == H2u.shape
# Get number of modes and broadcast shape
n_modes_1 = E1u.shape[-3]
n_modes_2 = E2u.shape[-3]
broadcast_shape = E1u.shape[:-3]
dtype = E1u.dtype
# Initialize output matrix
S = np.zeros((*broadcast_shape, n_modes_1, n_modes_2), dtype=dtype)
# Conjugate outside loop to avoid copies
if conjugate:
E1u, E1v = np.conj(E1u), np.conj(E1v)
H1u, H1v = np.conj(H1u), np.conj(H1v)
# Compute all elements of the overlap matrix
# Use slices instead of fancy indexing for E2/H2 to avoid copies
for i in range(n_modes_1):
E1_i = (E1u[..., i : i + 1, :, :], E1v[..., i : i + 1, :, :])
H1_i = (H1u[..., i : i + 1, :, :], H1v[..., i : i + 1, :, :])
# E2/H2 use all modes, so just pass them directly (no indexing needed)
S[..., i, :] = _dot_numpy(E1_i, H1_i, E2, H2, dS, False, bidirectional)
return S

@caseyflex
Copy link
Contributor

Correct me if I am wrong, but I think the most effort here is spent figuring out the dimensions and their order and prepping the arrays for the matrix multiply. So I think this optimization could be easily added to the numpy part of my branch.

Yeah I'm not completely sure. Seems a mix of that, and then the actual performance gain is from changing the python looping / BLAS matrix multiply tradeoffs. But I also didn't study your handling in detail. So I think it would be critical to have some unit tests that control the memory at least of the implementation, as there's just so much that can go wrong here -- one extra line of python can double the memory usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants