-
Notifications
You must be signed in to change notification settings - Fork 70
FXC-5512 perf: stream outer_dot overlap kernel to reduce runtime and memory #3258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
a48f2a2 to
af528fa
Compare
Diff CoverageDiff: origin/develop...HEAD, staged and unstaged changes
Summary
tidy3d/components/data/monitor_data.pyLines 1095-1111 1095
1096 for dim, ind in zip(preserve_dims, preserve_index):
1097 axis = dim_to_axis.get(dim)
1098 if axis is None:
! 1099 continue
1100 indexed_dims.add(dim)
1101 axis_size = arr.shape[axis]
1102 if axis_size == 1:
! 1103 idx[axis] = 0
1104 elif ind < axis_size:
1105 idx[axis] = ind
1106 else:
! 1107 raise ValueError(
1108 "Cannot broadcast preserve dimension in streaming outer_dot kernel."
1109 )
1110
1111 freq_axis = dim_to_axis.get(freq_dim)Lines 1109-1117 1109 )
1110
1111 freq_axis = dim_to_axis.get(freq_dim)
1112 if freq_axis is None:
! 1113 raise ValueError("Frequency dimension missing in streaming outer_dot kernel.")
1114 indexed_dims.add(freq_dim)
1115 if arr.shape[freq_axis] == 1:
1116 idx[freq_axis] = 0
1117 else:Lines 1118-1126 1118 idx[freq_axis] = freq_index
1119
1120 mode_axis = dim_to_axis.get(mode_dim)
1121 if mode_axis is None:
! 1122 raise ValueError("Mode dimension missing in streaming outer_dot kernel.")
1123 idx[mode_axis] = mode_slice
1124
1125 sliced = arr[tuple(idx)]
1126 remaining_dims = [dim for dim in dims if dim not in indexed_dims]Lines 1125-1133 1125 sliced = arr[tuple(idx)]
1126 remaining_dims = [dim for dim in dims if dim not in indexed_dims]
1127 required_dims = {mode_dim, tangential_dims[0], tangential_dims[1]}
1128 if len(remaining_dims) != 3 or set(remaining_dims) != required_dims:
! 1129 raise ValueError("Unexpected dimensions in streaming outer_dot kernel.")
1130
1131 axes = (
1132 remaining_dims.index(mode_dim),
1133 remaining_dims.index(tangential_dims[0]),Lines 1168-1176 1168 coords[outer_dim_2] = data_array_temp_2.coords[outer_dim_2].to_numpy()
1169 coords = {key: val for key, val in coords.items() if len(val.shape) != 0}
1170
1171 if "f" not in coords:
! 1172 raise ValueError("Frequency coordinate missing in streaming outer_dot kernel.")
1173
1174 dims = tuple(coords.keys())
1175 shape = [len(val) for val in coords.values()]
1176 dtype = np.result_type(Lines 1212-1220 1212 itemsize = np.dtype(np.result_type(fields_1[e_1].dtype, fields_2[e_1].dtype)).itemsize
1213 # Heuristic: choose a mode block size that targets bounded temporary allocations,
1214 # then clamp to practical limits to avoid tiny or overly large chunks.
1215 if num_grid_points == 0:
! 1216 block_size = OUTER_DOT_BLOCK_MAX_SIZE
1217 else:
1218 block_size = int(OUTER_DOT_BLOCK_TARGET_BYTES // (num_grid_points * itemsize))
1219 block_size = max(OUTER_DOT_BLOCK_MIN_SIZE, block_size)
1220 block_size = min(OUTER_DOT_BLOCK_MAX_SIZE, block_size)Lines 1251-1259 1251 mode_slice=slice(i0, i0_end),
1252 )
1253
1254 if left_block.shape[1] != d_area.size:
! 1255 raise ValueError(
1256 "Tangential area shape mismatch in streaming outer_dot kernel."
1257 )
1258
1259 for i1 in range(0, num_modes_right, block_size_right): |
|
Thanks Yannick, this is a helpful PR. It's helpful to dig into this sort of thing. I did some benchmarking in frontend, backend, and EME. The current implementation has a memory issue for our primary use case (large transverse grids, many modes, few frequencies), but it's fixable without giving up the BLAS speedup. Let M = num_modes, F = num_freqs, S = num_transverse_grid_points. What the "streaming" restructuring actually trades off The old approach loops over mode pairs (M^2 iterations), processing all frequencies per iteration. The new approach loops over frequencies (F iterations), processing all modes per iteration via BLAS. This "streams" over frequency -- removing F from per-iteration memory -- but vectorizes over modes -- adding M:
The speed improvement comes from replacing M^2 Python loop iterations with a single BLAS matmul -- same total arithmetic, but much better cache utilization, SIMD, and lower per-iteration overhead. For the primary use case (M >> F), this trades a large loop count for a small one. But the memory trade-off goes the wrong way for the same case: when M >> F, replacing O(S*F) temporaries with O(M*S) temporaries is a net increase. Benchmarks At benchmark scale (kernel-only, excluding shared field extraction overhead):
At production scale (M=50, S=250,000, F=1), measured via tracemalloc: the current kernel peaks at ~760 MiB vs ~31 MiB for the old approach. The ~760 MiB comes from materializing all 4 weighted right-side matrices of shape (M, S) simultaneously inside the frequency loop. The speed gain is real (8.7s to 376ms), but the ~25x memory increase is prohibitive for large-grid workloads -- especially under parallel execution where per-worker memory matters. Suggested fix The memory increase is incidental, not fundamental to the BLAS approach. Two changes reduce peak memory to O(block*S) -- independent of M -- while retaining the BLAS speedup:
Benchmarked at M=50, S=250,000, F=1 (tracemalloc peak):
At block=8, peak memory matches the old approach while still being 5x faster. The block size is a tunable speed-memory knob. Memory regression test I'd also suggest adding a production-scale memory regression test -- a tracemalloc-guarded call at M>=30, S>=50,000 that asserts peak kernel memory stays within a bound proportional to block * S, not M * S. This would have caught the current issue. Summary The core idea (BLAS matmul over mode pairs) is the right direction and the speed gains are significant. I'd like to see the double-sided blocking optimization and a memory test before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
Posted updated benchmark comparison with commit-hash columns for clarity:
Environment:
Key point on regression case:
Aggregate runtime (sum of listed cases):
|
caseyflex
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the speed-memory tradeoff is good overall. And I think this is an algorithmic improvement even when we do try to keep memory usage low.
A couple more comments:
- This will conflict heavily with @dmarek-flex 's outer dot colocation PR. Whoever merges second will need to carefully incorporate the logic improvements from the other
- I'd still like a unit test that explicitly checks that the memory usage is as expected. This could be here or backend.
- Do we need to update any memory estimation logic?
- Can we lower OUTER_DOT_BLOCK_TARGET_BYTES to 8, and make the min block size 1 instead of 8? This may be safer for parallel execution.
- I also talked with @dmarek-flex about this, but maybe you have thoughts -- you keep the old
_outer_fn_summationbecause it is more general with an arbitrary function and used elsewhere, but the flexibility is at the cost of performance. Do you have any ideas about how to handle all use cases with minimal code duplication?
here is the other PR:
#3100
| tangential_dims=tangential_dims, | ||
| mode_slice=slice(i1, i1_end), | ||
| ) | ||
| weighted_right = right_block * d_area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this can be moved out of the inner loop, by moving it to the left factor or swapping loop ordering
| # Threshold for cos(theta) to avoid unphysically large amplitudes near grazing angles | ||
| COS_THETA_THRESH = 1e-5 | ||
| MODE_INTERP_EXTRAPOLATION_TOLERANCE = 1e-2 | ||
| OUTER_DOT_BLOCK_TARGET_BYTES = 64 * 1024**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe we can use a simpler logic for block size. We could just have some default block size and add it as an optional parameter to the private outer dot implementation function
| for sign, left_comp, right_comp in term_specs: | ||
| for i0 in range(0, num_modes_left, block_size_left): | ||
| i0_end = min(i0 + block_size_left, num_modes_left) | ||
| left_block = ElectromagneticFieldData._slice_mode_xy_matrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need to optimize further now, but maybe we don't need to slice the data 4 times here
|
Correct me if I am wrong, but I think the most effort here is spent figuring out the dimensions and their order and prepping the arrays for the matrix multiply. So I think this optimization could be easily added to the numpy part of my branch. tidy3d/tidy3d/components/data/utils.py Lines 229 to 301 in e147035
|
Yeah I'm not completely sure. Seems a mix of that, and then the actual performance gain is from changing the python looping / BLAS matrix multiply tradeoffs. But I also didn't study your handling in detail. So I think it would be critical to have some unit tests that control the memory at least of the implementation, as there's just so much that can go wrong here -- one extra line of python can double the memory usage |
Summary
outer_dothot path inElectromagneticFieldData.outer_dot(...)with an adaptive streaming kernel:unblockedpath for smaller working sets.blockedterm-by-term path for large working sets to cap temporary allocations.try/exceptfallback to_outer_fn_summationfromouter_dot.mode_index_0,mode_index_1).Why This Is Safe
Benchmark (Legacy vs Current vs Final)
Environment: local macOS, median of 3 runs per case.
Memory metric:
tracemallocpeak allocations during kernel call.48x48,f=3,m=12x12)64x64,f=3,sweep=4,cell=6,m=20x20)128x128,f=2,m=72x72)Notes:
mode_axis_middleis the case that reproduces the current memory regression (current unblocked path spikes to 162.5 MiB peak alloc); final blocked path reduces that to 50.4 MiB (~3.2x lower) while still being ~22x faster than legacy.Validation
Executed locally:
poetry run pytest -q tests/test_data/test_monitor_data.pypoetry run pytest -q tests/test_plugins/test_mode_solver.pypoetry run pytest -q tests/test_components/test_eme.pypoetry run ruff check tidy3d/components/data/monitor_data.py tests/test_data/test_monitor_data.pyResults:
tests/test_data/test_monitor_data.py: 354 passedtests/test_plugins/test_mode_solver.py: 49 passedtests/test_components/test_eme.py: 18 passedKnown Limitations
tracemallocmeasures Python allocator peaks (good for NumPy temporaries), not full process RSS.Note
Medium Risk
Touches a performance-critical numerical kernel and reshapes how overlaps are computed; while API/output intent is preserved, errors in dimension handling or blocking could cause subtle numerical or shape regressions.
Overview
Replaces the
ElectromagneticFieldData.outer_dot()hot path with a new streaming NumPy kernel (_outer_dot_numpy_kernel_streaming) that iterates over frequency/preserved dims and performs blockwise mode matrix products to avoid materializing large mode-pair broadcast tensors, with new tunables likeOUTER_DOT_BLOCK_TARGET_BYTES.Adds a strict slicing helper (
_slice_mode_xy_matrix) and updates the implementation to keep frequency ordering and mode-dimension semantics while reducing peak allocations. Extendstests/test_monitor_data.pywith targeted parity tests covering preserved-dimension outputs and cases where the mode axis is not last.Written by Cursor Bugbot for commit db498b4. This will update automatically on new commits. Configure here.