Large Language Models (LLMs) are being widely used across numerous applications. The key model architecture powering these LLMs is the Transformer, which relies on a fundamental component known as the self-attention operation.
In this project, we will delve into the mechanics of how the attention operation works. The primary goal is to implement a highly optimized CUDA kernel for this operation, specifically targeting the FlashAttention algorithm.
Attention is a powerful operation first introduced in the "Attention Is All You Need" paper link. The basic operation defines a relationship between three matrices: Query (Q), Key (K), and Value (V). The output is calculated using the following formula:
where
Self-attention is a specific application of this operation where the Q, K, and V matrices are all generated from the same input sequence,
In the context of language models, this input sequence [batch_size, seq_len, hidden_dim].
From this single input
-
Query:
$Q = X \cdot W_Q$ -
Key:
$K = X \cdot W_K$ -
Value:
$V = X \cdot W_V$
Each of these weight matrices ([hidden_dim, hidden_dim]. This means that Q, K, and V will also have the same shape as [batch_size, seq_len, hidden_dim]), effectively projecting the input embeddings into three different "representation subspaces" required for the attention operation.
For a better visual and conceptual understanding of self-attention, refer to these links and Youtube videos:
In many Transformer models, such as those used for text generation, we must ensure that tokens can only attend to previous tokens in the sequence. This is achieved by applying a causal mask (also known as a look-ahead mask) to the score matrix (
This mask is typically a lower triangular matrix that is multiplied with the scores before the softmax operation. By masking out all elements corresponding to future tokens (i.e., setting them to negative infinity), we ensure that a token at a given position cannot "see" or attend to any tokens that come after it.
To allow the model to learn richer representations, we want each token to attend to other tokens in multiple contexts or "representation subspaces." This is the goal of Multi-headed Attention (MHA), which runs multiple self-attention operations in parallel.
Instead of performing
- Perform one large linear projection for each of Q, K, and V (using
$W_Q, W_K, W_V$ as described previously). - "Split" the resulting
$Q$ ,$K$ , and$V$ matrices into$h$ heads.
This "split" is typically a tensor reshape, where the hidden_dim (or [batch_size, seq_len, hidden_dim]) is reshaped to [batch_size, seq_len, num_heads, head_dim], where num_heads is the number of heads and head_dim is the dimension of each head (head_dim = hidden_dim / num_heads).
We can then define
The outputs of all heads (
It is important to note that while this architecture is "Multi-head," all
Follow these instructions to set up the project environment, especially if you are working on the ICE cluster.
First, allocate an interactive session.
From the command line (via SSH): A typical allocation request for one H100 GPU, 8 CPUs, and 128GB of memory for 4 hours looks like this:
salloc --gres=gpu:h100 --cpus-per-task=8 --mem=128G --time=4:00:00From the web: Alternatively, you can use the Open OnDemand website to request GPU allocation.
Important: Use Scratch Directory on ICE Cluster
When cloning this repository on the ICE cluster, make sure to work inside your scratch directory. Your home directory has only 30 GB of storage, while scratch provides up to 300 GB.
You can check your scratch path by running:
pace-quotaon any ICE node.
To simplify future access, consider creating a symlink from your home directory to your scratch directory:
ln -s /path/to/your/scratch ~/scratch
This ensures you always work within the larger storage space and avoid exceeding your home directory limit. Once this setup is done, please clone this repo and start working on this project.
cd <path to your scratch directory>
git clone <URL_TO_THIS_REPO>Important: The default home folder (~/) is NFS-based and has a very limited quota. Working from this directory will be very slow and may cause your programs to fail. Always use ~/scratch/ for your projects.
To get started, you first need to load the uv module, which is a fast Python package installer and virtual environment manager.
module load uvNext, create a new virtual environment and install the required packages.
# Create a virtual environment using Python 3.12
uv venv --python 3.12
# Activate the new environment
source .venv/bin/activate
# Install the required libraries
uv pip install --no-cache-dir torch numpy transformersFor any future sessions after you log out and log back into PACE, you only need to re-activate the virtual environment to get started:
source .venv/bin/activateYour first task is to implement a naive, non-optimized self-attention operation in PyTorch. This will help you become familiar with the multi-headed self-attention mechanism.
You will implement your algorithm in the forward() method of attention.py. You can then use the provided test.py script to compare your results with PyTorch's built-in torch.nn.MultiheadAttention to verify correctness.
Here are the steps to follow:
-
Split Heads: Use the
torch.Tensor.view()method to reshape the inputq,k, andvtensors. You will change their dimensions from[batch_size, seq_len, hidden_dim]to[batch_size, seq_len, num_heads, head_dim]to split thehidden_diminto multiple attention heads. -
Transpose: Use the
.transpose()method to swap the sequence length and head dimensions. This changes the tensor shape to[batch_size, num_heads, seq_len, head_dim], which is the standard layout for batched attention computation. -
Compute Scores: Compute the scaled dot-product attention scores. This is done by performing a matrix multiplication of
$Q$ and$K^T$ , and then scaling by the square root of the head dimension ($d_k$ ). You can achieve this using the@operator for matrix multiplication andmath.sqrt()for scaling. The full operation is$QK^T / \sqrt{d}$ . -
Apply Causal Mask (if enabled): If the
causal=Trueflag is passed to the function, you must apply a causal (look-ahead) mask.
- First, create the mask using
torch.triuto get an upper-triangular matrix. - Then, apply this mask to the score matrix using the
.masked_fill(mask, value)method. You should fill the masked positions withfloat('-inf')to ensure they become zero after the softmax.
-
Apply Softmax: Use the
torch.nn.functional.softmax()(imported asF) to apply the softmax function to each row (i.e., along the last dimension) of the scaled and masked score matrix. -
Compute Output: Compute the final output matrix
$O$ by multiplying the softmax-normalized attention weights with theV(Value) tensor. We have already provided the code for transposing and concatenating the heads back into the final output shape.
Once you have completed your implementation, please read through the test code in test.py to understand how it works. You can then run the test case from your terminal with the following command:
python -m task1.testIn this task, you will implement a naive self-attention operation using a combination of PyTorch and your own custom CUDA kernels.
While we will continue to use the PyTorch framework for high-level evaluation, the core computations will be replaced by your CUDA code. These kernels will be bound to Python, allowing them to be called directly from the attention.py script instead of using standard PyTorch tensor operations. Please examine the forward() method to understand how this Python-CUDA binding is structured.
Your specific task is to complete the attention_kernel.cu file by implementing the following components.
Steps to Follow:
-
Softmax Kernel This kernel will compute the softmax function, which normalizes a vector of values into a probability distribution. To ensure numerical stability, your implementation must first find the maximum value of each input row, subtract it from all elements in that row, and then compute the exponent and sum. This two-pass approach prevents overflow or underflow issues with large input values.
You must use shared memory (
__shared__) to minimize costly global memory accesses. Implement techniques such as parallel reduction within a thread block to efficiently find the row maximum and, subsequently, the sum of the exponents.Finally, you must implement a batched softmax kernel. The design should map one CUDA block to compute the softmax for one input row vector. As you can observe from the
test_batched_softmax()function, thegridDim.x(the first dimension of the grid) will correspond to the batch dimension. -
GEMM Kernels The full attention operation requires two distinct matrix multiplication (GEMM) steps:
$S = QK^T$ and$O = SV$ (where$S$ is the softmax result). Instead of creating one complex kernel that handles transposition on the fly, you will implement two separate, optimized GEMM kernels:-
GEMM_NT: PerformsA @ B.T(Normal-Transposed). This kernel will be used for the$QK^T$ computation. -
GEMM_NN: PerformsA @ B(Normal-Normal). This kernel will be used for the$SV$ computation.
Note on Matrix Layout: Our
N(Normal) andT(Transposed) notation differs from the cuBLAS convention. In PyTorch and our CUDA implementation, matrices are row-major by default. Therefore,Nsignifies a row-major matrix, andTsignifies a column-major (transposed row-major) matrix.For the implementation, you must use the standard tiled-GEMM approach (as you learned in Project 1). This involves using shared memory to stage tiles of the input matrices, which significantly reduces global memory bandwidth and improves performance. Pay close attention to the test code to determine how batching is handled in the kernel launch parameters.
-
-
Attention Computation Finally, you will assemble the full operation inside the
custom_attention()host function inattention_kernel.cu. You must call your newly implemented kernels in the correct sequence.We have provided the
scale_and_causal_maskkernel for you. You must call this kernel after your$QK^T$ (GEMM_NT) computation to perform both the scaling (division by$\sqrt{d}$ ) and the application of the causal mask (if enabled).It is your responsibility to choose the correct block size, grid size, and tile size (
TILE_SIZE) for all kernel launches. If you are unsure about the launch configuration, study the testing functions intest.pyfor guidance.
To test your code, you must first compile the CUDA kernels. The provided script handles the PyTorch C++ extension building process.
# Compile the CUDA kernels
python -m task2.compile After a successful compilation, you can test your individual kernels to isolate any issues:
# Test only the batched softmax kernel
python -m task2.test --softmax
# Test only the batched GEMM kernels
python -m task2.test --gemmOnce your individual kernels are passing, you can test the full end-to-end attention kernel.
# Test the complete custom attention implementation
python -m task2.test --attentionNote: This CUDA implementation in Task 2 still involves multiple separate kernel calls (GEMM, scale/mask, softmax, GEMM). Because each kernel launch has overhead, we are not expecting to see significant performance benefits over PyTorch just yet.
Now, you will implement the FlashAttention 2 algorithm in PyTorch. This task is similar in spirit to Task 1 (as it's a PyTorch-only implementation), but the logic is significantly more complex. Your goal here is to understand and implement the attention operation in a tiled manner, which is the key to its efficiency.
Your implementation must follow Algorithm 1 from the paper "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" by Tri Dao.
I strongly suggest you first read the original FlashAttention-1 paper. This will help you understand why this tiled approach, which avoids materializing the massive intermediate
You will be implementing the forward pass only, so you can safely ignore the
If you trace through Algorithm 1 and carefully track the intermediate values and their tensor dimensions, you should be able to implement the logic in Python without too much difficulty.
Important Typo in Paper!
There is a well-known typo in Algorithm 1 of the original paper. Please refer to this GitHub issue for the correction.
Tips for PyTorch Implementation: The notation
- For
$\mathrm{diag}(v)A$ , you can use:A * v.unsqueeze(-1) - Similarly, for
$\mathrm{diag}(v)^{-1}A$ , you can use:A / v.unsqueeze(-1)
To test the functionality of your implementation, run the following command:
python -m task3.testHere is the polished "Task 4" section, formatted as requested:
This is the final and most important task. You will now implement the full FlashAttention algorithm in PyTorch + CUDA.
Your goal is to translate the tiled logic you developed in Task 3 (the PyTorch FlashAttention-2 algorithm) into a single, high-performance CUDA kernel. This fused kernel will perform the entire attention computation (GEMMs, masking, softmax, and output GEMM) in one pass, using tiling to avoid materializing the large
You must use the pre-defined shared memory arrays to store blocks of Q, K, and V. You can assume for your implementation that the maximum head_dim will be 128.
A critical part of this task is performance tuning. In the custom_flash_attention() host function, you must experiment with and tune the tile sizes (B_r and B_c, the block sizes for rows and columns) to find the configuration that achieves the best performance on your GPU.
Testing Your Kernel: First, compile your new CUDA kernel with the following command.
python -m task4.compileThen, run the test and benchmarking script:
python -m task4.testThe test code will verify the correctness of your output and then compare the performance of your custom FlashAttention kernel against the naive torch.nn.MultiheadAttention implementation.
Reporting Your Results: The score for this project will be based on the performance difference you achieve. You must report the 4 speedup numbers from the test script in the Results.md file.
This task focuses on the second phase of LLM inference: token generation (decode), and how to optimize it using a KV Cache.
How LLMs Generate New Tokens
LLM token generation is a two-step process:
- Prefill Phase: We run the full self-attention (as in Task 4) on the entire input prompt (e.g., 1024 tokens) to generate the first new token.
- Decode Phase: To generate every subsequent token, we append the newest token to our input sequence and run the attention operation again.
However, since we use causal attention, old tokens can never attend to new tokens. Only the single, newest token needs to attend to all the previous tokens (including itself). This insight changes the computation dramatically:
- Instead of a full matrix-matrix multiply (GEMM), the attention computation for the new token is effectively a matrix-vector multiply (GEMV).
This is much faster, but we still need the Key (K) and Value (V) matrices from all the old tokens. To avoid recomputing them at every step, we store them in a KV Cache and simply reuse them.
- To better understand KV Caching, please refer to this article: link
Your Task
Go through attention.py and test.py to understand the new workflow. You will see that it first generates Q, K, and V. Then, it uses a CUDA kernel to update (append) the newest K and V vectors into the KV cache. Finally, it runs a new decode kernel using the single new Q and the entire K and V from the cache.
Step 1: Implement update_cache_kernel()
Your first task is to write the update_cache_kernel() in CUDA. This kernel's simple job is to copy the new K and V vectors (from the current step) into the correct slot in the kv_cache tensor, based on the current token index.
We do this because we don't want to allocate and free new memory blocks at every decode step, as that would be extremely slow. Instead, we pre-allocate a fixed, large amount of memory for the KV cache (up to a max_seq_len) and just write into it.
This method can become inefficient if the allocated
max_seq_lenis much larger than the real sequence length. The state-of-the-art solution to this problem is PagedAttention. If you are interested, please look into the PagedAttention paper.
Step 2: Implement the Decode Kernel
Your second task is to modify the FlashAttention kernel you built in Task 4 and turn it into a specialized kernel for the decode phase.
The main change is that you can now assume the Q tensor's sequence length is always 1. This greatly simplifies your kernel's logic. The core FlashAttention concepts (tiling, shared memory, online softmax) are the same, but this assumption will likely eliminate one of the main for loops in your implementation (the one that iterates over blocks of Q,
Testing and Performance
After implementing both kernels, you will be able to run test.py. This test will:
- Run a prefill on 1024 tokens.
- Use your
update_cache_kernelanddecode_kernelto generate 100 new tokens, one by one.
Again, try tuning the tile size (B_r and B_c) in your decode kernel for the best performance. Be aware that a tile size that is too large may blow up the shared memory.
Compile and run the test:
python -m task5.compilepython -m task5.testThe test script will compare the performance of your decode kernel. Please report the speedup numbers in Results.md.
(Optional) Optimizing the Decode Kernel
Explore performance improvements to the decode stage by implementing a chunked FlashAttention kernel, as described in Stanford CRFM’s FlashDecoding article (link). Compare its performance against the baseline decode kernel to quantify potential speedups.
This final task brings all your work together. You will now implement an end-to-end LLM inference pipeline on a dummy LLM.
Your objective is to fill in the forward() method of the customAttention class. This single function must now be able to intelligently handle both the prefill and decode phases of inference.
Specifically, your forward() method must:
- Detect the Prefill Phase: If the model detects the prefill phase, it must call your optimized Task 4 FlashAttention kernel to process the full prompt.
- Detect the Decode Phase: If the model detects the decode phase, it must call your specialized Task 5 decode kernel, using the provided KV cache.
- Manage the Cache: On decode phase, it must correctly call your
update_kv_cache(from Task 5) to append the newly computed K and V vectors to the cache, making them available for the next decode step. On prefill phase, it must store the current K and V vectors into the cache for future uses.
Testing and Reporting
The test script for this task simulates a full generative inference loop with generating 100 tokens:
python -m task6.testThis test will measure two key performance metrics:
- TTFT (Time To First Token): This measures the performance of your prefill kernel (Task 4).
- TBT (Time Between Tokens): This measures the average performance of your decode kernel (Task 5) over many steps.
Please report the final TTFT and TBT times in Results.md.
This final task is a demo to see your kernels working in a real end-to-end LLM inference pipeline with a Hugging Face model. You will see your custom attention kernel generating real tokens!
We (or the "cursor") have provided all the code for you, which attaches your prefill (Task 4) and decode (Task 5) kernels to a GPT-2 model. There is nothing new for you to code.
Your only job here is to run the model and verify that it works correctly. Make sure that your LLM is generating something useful and coherent. If it generates random, noisy tokens, it means your attention kernel has a bug. You will need to go back and debug your CUDA implementations from Tasks 4 and 5.
Running the Demo
To run the inference with your custom kernels:
python -m task7.inferencePlease report the Generated text in Results.md.
Verifying Correctness
You might observe that the GPT-2 model generates sentences that are not factually correct. This is expected, as GPT-2 is an older model.
The important thing is to check if your model's "intelligence" matches the original GPT-2. To do this, you can run the reference implementation, which uses the standard Hugging Face attention:
python -m task7.inference_refPlease check whether you think the two models are "equally intelligent." If your model (task7.inference) seems significantly dumber or produces nonsensical garbage compared to the reference, go back and check your kernel implementations for bugs.
We are still finalizing the official code collection process for submission.
For all current beta testers, please upload your completed Results.md file directly to the class Teams channel. We appreciate your help in testing this assignment.
Furthermore, please feel free to make any pull requests to this repository to update the content of this README file. If you find any discrepancies, or if you think additional materials would be helpful for future students, your contributions are welcome.
Euijun Chung (echung67@gatech.edu)