Skip to content

Conversation

@ShangkunLi
Copy link
Collaborator

@ShangkunLi ShangkunLi commented Jan 5, 2026

In this pr, we complete the following things.

The Taskflow Dialect

We introduce the taskflow dialect, which contains the following ops to build a computation abstraction for both scale-out & scale-up spatial architectures:

  1. The taskflow.graph op: wraps computation-intensive workloads into its region for multi-CGRA system acceleration.
  2. The taskflow.task op: wraps a specific operation into its body for a single CGRA (with affine controller & tile array)
  3. The taskflow.channel op: carries the data dependencies between two different tasks. We can further add resource binding attributes (e.g., streaming, sequential, coarse-grained pipeline) on this op to denote how we can transfer the data between two tasks that work along with the affine controller
  4. The taskflow.drive op: carries the control dependencies between two different tasks. This is mainly used to partition some irregular workloads on multi-CGRA systems.
    e.g.,
#set = affine_set<(d0, d1) : (d0 - 3 == 0, d1 - 7 == 0)>
module attributes {} {
  func.func @_Z21irregularLoopExample1v() -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %c2_i32 = arith.constant 2 : i32
    %c8_i32 = arith.constant 8 : i32
    %c0_i32 = arith.constant 0 : i32
    %alloca = memref.alloca() : memref<i32>
    %alloca_0 = memref.alloca() : memref<4x8xi32>
    %0 = affine.for %arg0 = 0 to 5 iter_args(%arg1 = %c0_i32) -> (i32) {
      %2 = arith.index_cast %arg0 : index to i32
      %3 = arith.addi %arg1, %2 : i32
      affine.yield %3 : i32
    }
    // Loop 1: wrapped in task 1, and uses taskflow.drive to control task 2 & 3
    affine.for %arg0 = 0 to 4 { 
      %2 = arith.index_cast %arg0 : index to i32
      %3 = arith.muli %2, %c8_i32 : i32
      // Loop 2: wrapped in task 2, controlled by task 1
      affine.for %arg1 = 0 to 8 { 
        %4 = arith.index_cast %arg1 : index to i32
        %5 = arith.addi %3, %4 : i32
        affine.store %5, %alloca_0[%arg0, %arg1] : memref<4x8xi32>
      }
      // Loop 3: wrapped in task 3, controlled by task 1
      affine.for %arg1 = 0 to 8 {
        %4 = affine.load %alloca_0[%arg0, %arg1] : memref<4x8xi32>
        %5 = arith.addi %4, %0 : i32
        affine.if #set(%arg0, %arg1) {
          affine.store %5, %alloca[] : memref<i32>
          %6 = arith.muli %5, %c2_i32 : i32
          affine.store %6, %alloca[] : memref<i32>
        }
      }
    }
    %1 = affine.load %alloca[] : memref<i32>
    return %1 : i32
  }
}
  1. We introduce a packet data type in taskflow dialect. This data type is carried by the taskflow.drive op and contains some metadata of each task (e.g., iteration space, task-level execution conditions).
  2. The taskflow.task is the node of the taskflow.graph, while the taskflow.channel & taskflow.drive are the edges of the graph.

The convert-linalg-to-taskflow Pass

We initially realize a conversion pass to get the taskflow representation for a simple ResNet block generated by PyTorch.
The reason why I implement linalg-to-taskflow conversion is that for almost all ML workloads, we don't need to consider the control flows, as they only have inter-task data dependencies.

Features to Support

Compiler Level:

  • Use this dialect to represent the provided irregular workloads
  • Realize taskflow.task level fusion, to enable multi-kernels run on a CGRA

RTL Level:

  • Implement an affine controller in RTL repo, which needs further discussion

@ShangkunLi ShangkunLi self-assigned this Jan 5, 2026
@ShangkunLi ShangkunLi requested a review from guosran January 6, 2026 04:24
@ShangkunLi ShangkunLi merged commit 68829b7 into coredac:main Jan 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants