Channel Avatar

NextGen AI @UC-19s9JDuopv1csg_e5nTIg@youtube.com

55 subscribers - no pronouns :c

Welcome to NextGen AI, your gateway to the future of artific


Welcoem to posts!!

in the future - u will be able to do some more stuff here,,,!! like pat catgirl- i mean um yeah... for now u can only see others's posts :c

NextGen AI
Posted 4 months ago

If you want to work in the field of AI, more than ever, you need to understand the hardware used to train your models!

In a GPU, the smallest unit of processing is called a “thread“. It can perform simple arithmetic operations like addition, subtraction, multiplication, and division. In common GPU cards, we have thousands of Cuda cores that can each run multiple threads. For example, the NVidia H100 has 16,896 Cuda cores.

Threads are grouped into thread blocks, where each executes the same operation. For example, common NVidia GPU cards tend to have up to 1024 threads per thread block. Each thread block has access to a fast shared memory (SRAM). That memory is small but fast! Most high-end GPUs have between 10 MB and 40 MB.

All the thread blocks can also share a large global memory. In most of the latest GPUs, they have access to faster high-bandwidth memories (HBM). HBM can be more than a thousand times larger than the SRAM. The data access to HBM is fast but slower than the SRAM’s:
- HBM Bandwidth: 1.5-2.0TB/s
- SRAM Bandwidth: 19TB/s ~ 10x HBM

Understanding the way the data gets moved from and to the memories is critical to writing better algorithms. For example, in the attention layer, we need to compute the tensor multiplication between the queries and the keys:

S = QK^T

The computation gets distributed across thread blocks, and the resulting variable S gets written into the global memory (or HBM is available). Once this is done, we need to pull the S matrix back onto the threads to compute the softmax transformation:

Attention = Softmax(S)

And again, we need to move the resulting matrix back to the global memory.

The matrices get moved back and forth between the threads and the global memory because there is no logical way to isolate computations to a thread block and, therefore, utilize the SRAMs to cache intermediary matrices. One strategy that is becoming common now is to tile the computation of the S matrix into smaller matrices such that each small operation can be isolated on a thread block.

This strategy is used in what we call the Flash-attention, where we can make more efficient use of the fast access to the SRAM. Here, if you want to read the paper: arxiv.org/pdf/2307.08691

0 - 0