-
Notifications
You must be signed in to change notification settings - Fork 66
Change MTA logic for foreach_copy #2455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR aligns XPU's foreach_copy operator with CUDA's logic for selecting the Multi-Tensor Apply (MTA) fast path, enabling dtype conversions within the MTA path rather than falling back to slower operations.
Key changes:
- Introduces
can_use_fast_route_for_copyto match CUDA's fast path selection logic - Implements dtype conversion support via
CopyFunctorandCopystruct - Adds nested dispatch logic to handle different source and destination types
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp | Replaces Identity operation with Copy functor supporting dtype conversion; implements CopyFunctor for MTA path |
| src/ATen/native/xpu/ForeachOpList.cpp | Adds can_use_fast_route_for_copy function to enable fast path for dtype conversions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I reviewed the issue. It should not be a real issue, right? If so, why do we need such kind of changes? What's the performance benifit? |
@EikanWang Without these changes, foreach_copy will still run and produce correct results. However, without the fix, performance may be somewhat worse as we will launch multiple kernels instead of a single kernel with multiple operations. Taking into account that we are using the MultiTensorApply approach in different ops and to some extent in foreach_copy suggests that we benefit from running the fastpath. It's the more performant way, and we should use it whenever possible. I see no reason not to implement this to the maximum extent possible for each op. Another minor issue is that without this change, we will need more sophisticated conditioning to determine when to expect the fastpath when porting tests from torch. |
| AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3( | ||
| at::ScalarType::Half, | ||
| at::ScalarType::BFloat16, | ||
| at::ScalarType::Bool, | ||
| src[0].scalar_type(), | ||
| "foreach_tensor_copy", | ||
| [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this PR need the nested AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3?
| /* depth */ 2, | ||
| /* r_args_depth */ 1, | ||
| /* res_arg_index */ 1>(), | ||
| Copy<dst_opmath_t, src_opmath_t>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a copy function, why are dst_opmath_t and src_opmath_t required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's since PyTorch supports that torch.copy (and foreach version as well) supports copying between different dtype if we want to have support here for such cases we need to have both src and dst dtypes. Thats also the reason why we are doing a nested AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you are addressing the above comment.
// Outer dispatch
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
at::ScalarType::Half,
at::ScalarType::BFloat16,
at::ScalarType::Bool,
self[0].scalar_type(),
"foreach_tensor_copy",
[&]() {
using dst_t = scalar_t;
using dst_opmath_t = at::opmath_type<dst_t>;
// Inner dispatch (newly added)
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
at::ScalarType::Half,
at::ScalarType::BFloat16,
at::ScalarType::Bool,
src[0].scalar_type(),
"foreach_tensor_copy",The outer dispatch has decided the data type. Why is the inner dispatch necessary? @PawelSwider2000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, dispatch needs to decide based on src and dst datatypes so this is the reason for outer and inner dispatch. We want to have such dispatching:
AT_DISPATCH_CASE((float32, float32))
AT_DISPATCH_CASE((float32, bfloat16))
etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
This PR only replaces the #pragma unroll
for (int ii = 0; ii < kILP; ++ii) {
r_args[0][ii] =
static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii])));
}
load_store(args[res_arg_index], r_args[0], i, 0);
}However, the @PawelSwider2000 any performance data? |
|
Direct replacing However some performance comparison will be useful here. I will prepare such. |
|
I made some performance analysis for different number of tensors, shapes, and (src,dst) dtypes. On average we see an perf increase, which is bigger for tensorlists with many small tensors and small for few tensorlists with big tensors. On average we get about 2 times speedup however it differs based on size and number of tensors. It could be even around 6 times faster than current implementation and two times slower for some specific cases. Looks like some copy perf differs based on shape, e.g.: @EikanWang Do you know somebody who will help understand why for some cases we have significantly worse performance? |
Yes. However, torch-xpu-ops/src/ATen/native/xpu/sycl/ForeachFunctors.h Lines 151 to 154 in de582ba
I need to take time to understand the behavior because it cannot be well explained. I may miss something. |
|
torch-xpu-ops/src/ATen/native/xpu/sycl/ForeachFunctors.h Lines 151 to 154 in de582ba
In this code the |
|
@EikanWang I make some changes to speed up performance. The changes consists of simplifying When comes to results I will focus on large tensors as for small we have tiny absolute differences and in most cases (especially when there is a lot of inputs ) new implementation is much faster anyway. Generally there is a visible speedup, for some types this speedup is really big. However the copying from int64 to other smaller type is a bit slower but looks like this is a small drawback compared to massive speedup for other types |
Performance outliers, please check!
|
On XPU,
foreach_copyoperator follows similar logic for switching between multi tensor apply and slower mode. For CUDA the behavior is different as in other foreach ops as it enable MTA path when source and destination datatypes are different. This commit is aligning XPU logic with CUDA.To do so:
can_use_fast_route_for_copymatching CUDA logic for selecting fast MTA path.CopyFunctorand modifyCopyto handle different dtypes and perform casts instead of Identity operationforeach_copy_list_kernel_dispatch logic to take into account both src and dst types.PR fixes issues in: #2313
Tests for the issue are being enabled in issue mention above and also we could use existing
foreach_copytests.