asm clean up and stack size reduction#209
Conversation
There was a problem hiding this comment.
Pull request overview
This PR cleans up and optimises the VX4B filter assembly implementations in lib_xcore_math, primarily by reducing stack usage, and adjusts tests to match the VX4B numerical behaviour and fix a shift-handling bug in the FIR S32 tests.
Changes:
- Reduce
NSTACKWORDSand adjust stack layouts for VX4B filter assembly routines (filter_fir_s32,filter_fir_s16, biquad variants, and support routines) to minimise stack usage and remove old tool-annotation clutter. - Fix the S32 FIR test’s expected-value computation for
shift == 0and relax several S16 FIR tests on__VX4B__to allow small LSB differences. - Clean up VX4B VPU usage in various filter assembly files (use of
lafor constants, updated mask and pointer handling, consistent.globl/metadata directives).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/filter_tests/src/filter/test_filter_fir_s32.c | Adjusts expected-value logic to treat shift == 0 as a non-shift (avoiding undefined (1LL << -1)), fixing the S32 FIR test bug. |
| tests/filter_tests/src/filter/test_filter_fir_s16.c | For __VX4B__, replaces exact equality checks with small INT16_WITHIN tolerances to account for VX4B rounding/saturation differences. |
| lib_xcore_math/src/arch/vx4b/filter/vect_s32_convolve_valid.S | Reduces stack usage and removes translation warnings while keeping the convolution loop and tail handling logic intact. |
| lib_xcore_math/src/arch/vx4b/filter/push_sample_up_s16.S | Shrinks the stack frame, fixes prologue/epilogue register save/restore offsets, and updates comments and branch usage for VX4B. |
| lib_xcore_math/src/arch/vx4b/filter/push_sample_down_s16.S | Similar stack-frame reduction and cleanup for the “push sample down” helper, with corrected save/restore and VPU loop structure. |
| lib_xcore_math/src/arch/vx4b/filter/filter_fir_s32.S | Reduces stack usage and cleans up the VPU setup, tail processing, and accumulator-combination logic, but currently has an uninitialised s4 used as a shift amount in the left-shift path. |
| lib_xcore_math/src/arch/vx4b/filter/filter_fir_s16.S | Refactors stack layout and adds extra vector scratch regions for VX4B, but currently computes a stack-based pointer (s7) that underflows the frame and is then used by VPU load/store instructions. |
| lib_xcore_math/src/arch/vx4b/filter/filter_biquad_sat_s32.S | Tightens the stack frame and vector scratch allocation while preserving the biquad-saturated processing pipeline and state-update logic. |
| lib_xcore_math/src/arch/vx4b/filter/filter_biquad_s32.S | Minimises stack usage and simplifies the non-saturating biquad filter implementation, updating VPU constant loads and state propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot please review |
|
@andrewstanfordjason I've opened a new pull request, #210, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR cleans up the lib_xcore_math filter code for vx4b and reduces the stack usage to its minimum. It also fixes a bug in the filter tests.