Skip to content

Conversation

@PawelSwider2000
Copy link
Contributor

On XPU, foreach_copy operator follows similar logic for switching between multi tensor apply and slower mode. For CUDA the behavior is different as in other foreach ops as it enable MTA path when source and destination datatypes are different. This commit is aligning XPU logic with CUDA.

To do so:

  • create can_use_fast_route_for_copy matching CUDA logic for selecting fast MTA path.
  • create CopyFunctor and modify Copy to handle different dtypes and perform casts instead of Identity operation
  • change foreach_copy_list_kernel_ dispatch logic to take into account both src and dst types.

PR fixes issues in: #2313

Tests for the issue are being enabled in issue mention above and also we could use existing foreach_copy tests.

Copilot AI review requested due to automatic review settings December 3, 2025 08:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns XPU's foreach_copy operator with CUDA's logic for selecting the Multi-Tensor Apply (MTA) fast path, enabling dtype conversions within the MTA path rather than falling back to slower operations.

Key changes:

  • Introduces can_use_fast_route_for_copy to match CUDA's fast path selection logic
  • Implements dtype conversion support via CopyFunctor and Copy struct
  • Adds nested dispatch logic to handle different source and destination types

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp Replaces Identity operation with Copy functor supporting dtype conversion; implements CopyFunctor for MTA path
src/ATen/native/xpu/ForeachOpList.cpp Adds can_use_fast_route_for_copy function to enable fast path for dtype conversions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@EikanWang
Copy link
Contributor

I reviewed the issue. It should not be a real issue, right? If so, why do we need such kind of changes? What's the performance benifit?

@PawelSwider2000
Copy link
Contributor Author

I reviewed the issue. It should not be a real issue, right? If so, why do we need such kind of changes? What's the performance benifit?

@EikanWang Without these changes, foreach_copy will still run and produce correct results. However, without the fix, performance may be somewhat worse as we will launch multiple kernels instead of a single kernel with multiple operations.

Taking into account that we are using the MultiTensorApply approach in different ops and to some extent in foreach_copy suggests that we benefit from running the fastpath. It's the more performant way, and we should use it whenever possible. I see no reason not to implement this to the maximum extent possible for each op.

Another minor issue is that without this change, we will need more sophisticated conditioning to determine when to expect the fastpath when porting tests from torch.

Comment on lines +112 to +118
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
at::ScalarType::Half,
at::ScalarType::BFloat16,
at::ScalarType::Bool,
src[0].scalar_type(),
"foreach_tensor_copy",
[&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this PR need the nested AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3?

/* depth */ 2,
/* r_args_depth */ 1,
/* res_arg_index */ 1>(),
Copy<dst_opmath_t, src_opmath_t>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a copy function, why are dst_opmath_t and src_opmath_t required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's since PyTorch supports that torch.copy (and foreach version as well) supports copying between different dtype if we want to have support here for such cases we need to have both src and dst dtypes. Thats also the reason why we are doing a nested AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you are addressing the above comment.

https://github.com/intel/torch-xpu-ops/pull/2455/changes/BASE..18ab57c4ee1faefd76b003818ac20bd8c0aab103#diff-346dedcf59d4381e08b92e700ab3354c84bc13e5a13097ee55f7872cfd86e3a7R103-R108

  // Outer dispatch
  AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
      at::ScalarType::Half,
      at::ScalarType::BFloat16,
      at::ScalarType::Bool,
      self[0].scalar_type(),
      "foreach_tensor_copy",
      [&]() {
        using dst_t = scalar_t;
        using dst_opmath_t = at::opmath_type<dst_t>;
        
        // Inner dispatch (newly added)
        AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
            at::ScalarType::Half,
            at::ScalarType::BFloat16,
            at::ScalarType::Bool,
            src[0].scalar_type(),
            "foreach_tensor_copy",

The outer dispatch has decided the data type. Why is the inner dispatch necessary? @PawelSwider2000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, dispatch needs to decide based on src and dst datatypes so this is the reason for outer and inner dispatch. We want to have such dispatching:

AT_DISPATCH_CASE((float32, float32))
AT_DISPATCH_CASE((float32, bfloat16))
etc...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@EikanWang
Copy link
Contributor

EikanWang commented Dec 15, 2025

However, without the fix, performance may be somewhat worse as we will launch multiple kernels instead of a single kernel with multiple operations.

This PR only replaces the UnaryOpFunctor with CopyFunctor. So, why could it impact the kernel launcher? Per my understanding, the key changes of CopyFunctor are to avoid a useless casting between scalar_t and opmath_type<scalar_t>.

#pragma unroll
        for (int ii = 0; ii < kILP; ++ii) {
          r_args[0][ii] =
              static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii])));
        }
        load_store(args[res_arg_index], r_args[0], i, 0);
      }

However, the op introduces the opmath_type converting again - Copy<dst_opmath_t, src_opmath_t>(). It does not make sense compared to UnaryOpFunctor from the performance perspective.

@PawelSwider2000 any performance data?

@PawelSwider2000
Copy link
Contributor Author

Direct replacing UnaryOpFunctor with CopyFunctor by itself does not improve performance, but it could handle copying from one dtype to another like which allows us to use MTA logic in more amount of cases.

However some performance comparison will be useful here. I will prepare such.

@PawelSwider2000
Copy link
Contributor Author

PawelSwider2000 commented Dec 17, 2025

I made some performance analysis for different number of tensors, shapes, and (src,dst) dtypes. On average we see an perf increase, which is bigger for tensorlists with many small tensors and small for few tensorlists with big tensors.

On average we get about 2 times speedup however it differs based on size and number of tensors. It could be even around 6 times faster than current implementation and two times slower for some specific cases.

Looks like some copy perf differs based on shape, e.g.:
int32->int64 for (2048, 2048) the speedup is 0.63 (we are much slower) but for
int32->int64 for (1024, ) the speedup is the highest and equal to 5.64 (which is a massive improvement)

@EikanWang Do you know somebody who will help understand why for some cases we have significantly worse performance?

@EikanWang
Copy link
Contributor

EikanWang commented Dec 17, 2025

but it could handle copying from one dtype to another like which allows us to use MTA logic in more amount of cases.

Yes. However, UnaryOpFunctor can handle it as well.

for (int ii = 0; ii < kILP; ++ii) {
r_args[0][ii] =
static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii])));
}

On average we see an perf increase, which is bigger for tensorlists with many small tensors and small for few tensorlists with big tensors.

I need to take time to understand the behavior because it cannot be well explained. I may miss something.

@PawelSwider2000
Copy link
Contributor Author

for (int ii = 0; ii < kILP; ++ii) {
r_args[0][ii] =
static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii])));
}

In this code the opmath_t and T are related to each other and they could not be in foreach_copy. I will try to analyze and debug it in the meantime also.

@PawelSwider2000
Copy link
Contributor Author

@EikanWang I make some changes to speed up performance. The changes consists of simplifying CopyFunctor, using vec path only for dtypes with same type length (empirical observation) and setting kILP based on types which improved performance (logic similar to: syclPrefVectorWidth).

When comes to results I will focus on large tensors as for small we have tiny absolute differences and in most cases (especially when there is a lot of inputs ) new implementation is much faster anyway.
Example results:

N      Shape          Src         Dst      Speedup
5    (2048, 2048)    int64      int16      0.8205
10   (2001, 2001)    int64      float16    0.8459
20   (2048, 2048)    float32    int16      0.9369
20   (2048, 2048)    float32    float32    0.9998
50   (2048, 2048)    int16      int16      0.9995
100  (2048, 2048)    float32    float32    1.0737
100  (2048, 2048)    int16      int64      1.2181
100  (2048, 2048)    float16    int64      1.2169
20   (2001, 2001)    bfloat16   int64      1.2149
20   (2048, 2048)    bfloat16   int32      1.4098
50   (2001, 2001)    int16      int16      1.4080
10   (2001, 2001)    float16    float16    1.4046
100  (2048, 2048)    float16    bfloat16   1.7854
100  (2048, 2048)    int64      int64      2.5141

Generally there is a visible speedup, for some types this speedup is really big. However the copying from int64 to other smaller type is a bit slower but looks like this is a small drawback compared to massive speedup for other types

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.871456 0.774898

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[xpu][test]profiler can not get the key of multi_tensor_apply_kernel

7 participants