It should be quite simple (maybe even simpler than the current approach) to iterate the computation of $H$ for each value of $n$, and thence to iterate computation of $d^{(\ell)}$, $\mathfrak{D}^{(\ell)}$, or ${}_{s}Y_{\ell}$ for each value of $\ell$ separately. This would decrease the storage needed for the calculations, and simplify the reinterpretation of the data as matrices (for $d$ and $\mathfrak{D}$).
The key problem is computation of $H$. For a given value of $n$, to compute $H_{n}^{0,m}$ for all $m$, we need $H_{n-1}^{0,m}$ for all $m$; and to compute $H_{n}^{1,m}$ for all $m$, we need $H_{n+1}^{0,m}$ for all $m$. So, to compute all $H_{n}^{m',m}$ for all $m',m$, we need to provide storage for $H_{n}^{m',m}$, $H_{n-1}^{0,m}$, and $H_{n+1}^{0,m}$. As we iterate to the next $n$, we just move the $H_{n}^{0,m}$ to the storage for $H_{n-1}^{0,m}$, the $H_{n+1}^{0,m}$ into the $H_{n}^{0,m}$ slots, and the compute the new $H_{n+1}^{0,m}$ values. Of course, all this storage should be pre-allocated with the largest sizes that will ever be required, and views taken as necessary, with reshape called when matrices are needed.
Should this happen?
While there is much to be said for the elegance of this approach, it is not clear that it is the best idea. It still won't thread any better because of the recursion. And my use case involves a lot of stepping through entire sets of mode weights at different times. For $T$ times with $W$ mode weights at each time, that's a $W \times T$ matrix, and we step through all $W$ modes at a time, for each value in $T$, making just one trip through all the data. If we were to switch here, I would have to make $\sim\ell_{\mathrm{max}}$ separate trips through the data — one for each $\ell$ value.