Change MTA logic for foreach_copy #2455

PawelSwider2000 · 2025-12-03T08:26:14Z

On XPU, foreach_copy operator follows similar logic for switching between multi tensor apply and slower mode. For CUDA the behavior is different as in other foreach ops as it enable MTA path when source and destination datatypes are different. This commit is aligning XPU logic with CUDA.

To do so:

create can_use_fast_route_for_copy matching CUDA logic for selecting fast MTA path.
create CopyFunctor and modify Copy to handle different dtypes and perform casts instead of Identity operation
change foreach_copy_list_kernel_ dispatch logic to take into account both src and dst types.

PR fixes issues in: #2313

Tests for the issue are being enabled in issue mention above and also we could use existing foreach_copy tests.

Copilot

Pull request overview

This PR aligns XPU's foreach_copy operator with CUDA's logic for selecting the Multi-Tensor Apply (MTA) fast path, enabling dtype conversions within the MTA path rather than falling back to slower operations.

Key changes:

Introduces can_use_fast_route_for_copy to match CUDA's fast path selection logic
Implements dtype conversion support via CopyFunctor and Copy struct
Adds nested dispatch logic to handle different source and destination types

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp	Replaces Identity operation with Copy functor supporting dtype conversion; implements CopyFunctor for MTA path
src/ATen/native/xpu/ForeachOpList.cpp	Adds can_use_fast_route_for_copy function to enable fast path for dtype conversions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp

src/ATen/native/xpu/ForeachOpList.cpp

src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp

EikanWang · 2025-12-04T11:12:54Z

I reviewed the issue. It should not be a real issue, right? If so, why do we need such kind of changes? What's the performance benifit?

PawelSwider2000 · 2025-12-05T07:00:53Z

I reviewed the issue. It should not be a real issue, right? If so, why do we need such kind of changes? What's the performance benifit?

@EikanWang Without these changes, foreach_copy will still run and produce correct results. However, without the fix, performance may be somewhat worse as we will launch multiple kernels instead of a single kernel with multiple operations.

Taking into account that we are using the MultiTensorApply approach in different ops and to some extent in foreach_copy suggests that we benefit from running the fastpath. It's the more performant way, and we should use it whenever possible. I see no reason not to implement this to the maximum extent possible for each op.

Another minor issue is that without this change, we will need more sophisticated conditioning to determine when to expect the fastpath when porting tests from torch.

EikanWang · 2025-12-15T17:49:02Z

src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp

+        AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(
+            at::ScalarType::Half,
+            at::ScalarType::BFloat16,
+            at::ScalarType::Bool,
+            src[0].scalar_type(),
+            "foreach_tensor_copy",
+            [&]() {


Why does this PR need the nested AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3?

EikanWang · 2025-12-15T17:55:55Z

src/ATen/native/xpu/sycl/ForeachCopyKernels.cpp

+                      /* depth */ 2,
+                      /* r_args_depth */ 1,
+                      /* res_arg_index */ 1>(),
+                  Copy<dst_opmath_t, src_opmath_t>());


Since it is a copy function, why are dst_opmath_t and src_opmath_t required?

It's since PyTorch supports that torch.copy (and foreach version as well) supports copying between different dtype if we want to have support here for such cases we need to have both src and dst dtypes. Thats also the reason why we are doing a nested AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3

I suppose you are addressing the above comment.

https://github.com/intel/torch-xpu-ops/pull/2455/changes/BASE..18ab57c4ee1faefd76b003818ac20bd8c0aab103#diff-346dedcf59d4381e08b92e700ab3354c84bc13e5a13097ee55f7872cfd86e3a7R103-R108

// Outer dispatch AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3( at::ScalarType::Half, at::ScalarType::BFloat16, at::ScalarType::Bool, self[0].scalar_type(), "foreach_tensor_copy", [&]() { using dst_t = scalar_t; using dst_opmath_t = at::opmath_type<dst_t>; // Inner dispatch (newly added) AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3( at::ScalarType::Half, at::ScalarType::BFloat16, at::ScalarType::Bool, src[0].scalar_type(), "foreach_tensor_copy",

The outer dispatch has decided the data type. Why is the inner dispatch necessary? @PawelSwider2000

Here, dispatch needs to decide based on src and dst datatypes so this is the reason for outer and inner dispatch. We want to have such dispatching:

AT_DISPATCH_CASE((float32, float32)) AT_DISPATCH_CASE((float32, bfloat16)) etc...

Make sense.

EikanWang · 2025-12-15T18:04:28Z

However, without the fix, performance may be somewhat worse as we will launch multiple kernels instead of a single kernel with multiple operations.

This PR only replaces the UnaryOpFunctor with CopyFunctor. So, why could it impact the kernel launcher? Per my understanding, the key changes of CopyFunctor are to avoid a useless casting between scalar_t and opmath_type<scalar_t>.

#pragma unroll
        for (int ii = 0; ii < kILP; ++ii) {
          r_args[0][ii] =
              static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii])));
        }
        load_store(args[res_arg_index], r_args[0], i, 0);
      }

However, the op introduces the opmath_type converting again - Copy<dst_opmath_t, src_opmath_t>(). It does not make sense compared to UnaryOpFunctor from the performance perspective.

@PawelSwider2000 any performance data?

PawelSwider2000 · 2025-12-16T08:59:51Z

Direct replacing UnaryOpFunctor with CopyFunctor by itself does not improve performance, but it could handle copying from one dtype to another like which allows us to use MTA logic in more amount of cases.

However some performance comparison will be useful here. I will prepare such.

PawelSwider2000 · 2025-12-17T12:43:24Z

I made some performance analysis for different number of tensors, shapes, and (src,dst) dtypes. On average we see an perf increase, which is bigger for tensorlists with many small tensors and small for few tensorlists with big tensors.

On average we get about 2 times speedup however it differs based on size and number of tensors. It could be even around 6 times faster than current implementation and two times slower for some specific cases.

Looks like some copy perf differs based on shape, e.g.:
int32->int64 for (2048, 2048) the speedup is 0.63 (we are much slower) but for
int32->int64 for (1024, ) the speedup is the highest and equal to 5.64 (which is a massive improvement)

@EikanWang Do you know somebody who will help understand why for some cases we have significantly worse performance?

EikanWang · 2025-12-17T18:26:42Z

but it could handle copying from one dtype to another like which allows us to use MTA logic in more amount of cases.

Yes. However, UnaryOpFunctor can handle it as well.

torch-xpu-ops/src/ATen/native/xpu/sycl/ForeachFunctors.h

Lines 151 to 154 in de582ba

    
           for (int ii = 0; ii < kILP; ++ii) { 
        
             r_args[0][ii] = 
        
                 static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii]))); 
        
           }

On average we see an perf increase, which is bigger for tensorlists with many small tensors and small for few tensorlists with big tensors.

I need to take time to understand the behavior because it cannot be well explained. I may miss something.

PawelSwider2000 · 2025-12-18T06:59:53Z

torch-xpu-ops/src/ATen/native/xpu/sycl/ForeachFunctors.h

Lines 151 to 154 in de582ba

    
           for (int ii = 0; ii < kILP; ++ii) { 
        
             r_args[0][ii] = 
        
                 static_cast<T>(op(static_cast<opmath_t>(r_args[0][ii]))); 
        
           }

In this code the opmath_t and T are related to each other and they could not be in foreach_copy. I will try to analyze and debug it in the meantime also.

PawelSwider2000 · 2026-01-09T08:49:43Z

@EikanWang I make some changes to speed up performance. The changes consists of simplifying CopyFunctor, using vec path only for dtypes with same type length (empirical observation) and setting kILP based on types which improved performance (logic similar to: syclPrefVectorWidth).

When comes to results I will focus on large tensors as for small we have tiny absolute differences and in most cases (especially when there is a lot of inputs ) new implementation is much faster anyway.
Example results:

N      Shape          Src         Dst      Speedup
5    (2048, 2048)    int64      int16      0.8205
10   (2001, 2001)    int64      float16    0.8459
20   (2048, 2048)    float32    int16      0.9369
20   (2048, 2048)    float32    float32    0.9998
50   (2048, 2048)    int16      int16      0.9995
100  (2048, 2048)    float32    float32    1.0737
100  (2048, 2048)    int16      int64      1.2181
100  (2048, 2048)    float16    int64      1.2169
20   (2001, 2001)    bfloat16   int64      1.2149
20   (2048, 2048)    bfloat16   int32      1.4098
50   (2001, 2001)    int16      int16      1.4080
10   (2001, 2001)    float16    float16    1.4046
100  (2048, 2048)    float16    bfloat16   1.7854
100  (2048, 2048)    int64      int64      2.5141

Generally there is a visible speedup, for some types this speedup is really big. However the copying from int64 to other smaller type is a bit slower but looks like this is a small drawback compared to massive speedup for other types

github-actions · 2026-01-09T09:36:50Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnet18	0.871456	0.774898

Change mta logic for foreach_copy

1106758

Copilot AI review requested due to automatic review settings December 3, 2025 08:26

Copilot AI reviewed Dec 3, 2025

View reviewed changes

moksiuc approved these changes Dec 3, 2025

View reviewed changes

kdrozd-dev approved these changes Dec 3, 2025

View reviewed changes

PawelSwider2000 mentioned this pull request Dec 3, 2025

[xpu][test]profiler can not get the key of multi_tensor_apply_kernel #2313

Open

PawelSwider2000 linked an issue Dec 3, 2025 that may be closed by this pull request

[xpu][test]profiler can not get the key of multi_tensor_apply_kernel #2313

Open

tsocha requested changes Dec 3, 2025

View reviewed changes

Review fixes

18ab57c

tsocha approved these changes Dec 3, 2025

View reviewed changes

Silv3S approved these changes Dec 3, 2025

View reviewed changes

EikanWang requested changes Dec 15, 2025

View reviewed changes

PawelSwider2000 and others added 3 commits January 7, 2026 07:26

Performance improvements

f45b550

Merge branch 'main' into pswider/foreach-copy-mta

ec6610e

Merge branch 'main' into pswider/foreach-copy-mta

4b1132e

Change MTA logic for foreach_copy #2455

Are you sure you want to change the base?

Change MTA logic for foreach_copy #2455

Uh oh!

Conversation

PawelSwider2000 commented Dec 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EikanWang commented Dec 4, 2025

Uh oh!

PawelSwider2000 commented Dec 5, 2025

Uh oh!

EikanWang Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

PawelSwider2000 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

PawelSwider2000 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

EikanWang commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PawelSwider2000 commented Dec 16, 2025

Uh oh!

PawelSwider2000 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EikanWang commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PawelSwider2000 commented Dec 18, 2025

Uh oh!

PawelSwider2000 commented Jan 9, 2026

Uh oh!

github-actions bot commented Jan 9, 2026

Performance outliers, please check!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

EikanWang commented Dec 15, 2025 •

edited

Loading

PawelSwider2000 commented Dec 17, 2025 •

edited

Loading

EikanWang commented Dec 17, 2025 •

edited

Loading