What I cannot create, I do not understand. - Richard Feynman
I wanted to try CUDA programming for a long time, and this weekend I finally checked it off.
When learning something new, I follow something I call the W3 rule.
Each W stands for:
- What - What is cuda?
- Why - Why cuda?
- How - How to cuda?
What is CUDA?
Matrix multiplication is by far the simplest hands-on exercise to start learning cuda, I’d say it’s the Hello World of cuda programming.
Why? Because parallelism shows up very naturally here. If we multiply two matrices A and B to get C, each value in C can be computed independently.
That is a perfect fit for CUDA’s execution model: give one thread one output cell of C, then let many threads work across the GPU at the same time.
One small but important detail: one thread is not doing just one multiplication. It is responsible for one final output cell, so it still has to compute the full dot product for that cell.
More formally, each output element of
can be computed independently as follows:
This kernel, which is just a fancy name for the matrix multiplication function, is considered naive because it does not use shared memory. In other words, each thread keeps going back to global memory instead of first cooperating with nearby threads to cache tiles of A and B.
So I would not read too much into the profiled results just yet. There is still plenty of room to improve it, and that is part of what makes this a nice first CUDA exercise.
C’s Shape
You might already know this from linear algebra, but it helps to quickly revisit how matrix shapes relate.
If:
- has shape
- has shape
then:
- has shape
So each output position needs:
- one row from
- one column from
That is exactly why row and col are the two indices that matter most in the kernel.
CUDA’s Thread Mapping
CUDA organizes work into blocks and grids. An easy way to picture it is as larger containers that each hold many smaller ones.
blockIdx and threadIdx to compute one output coordinate (row, col) per thread.
In the chart, each larger box is a block, and each small square inside it is a thread. The important idea is not that a thread has some mysterious global thread number floating around. The real story is simpler: each thread knows which block it is in, and where it sits inside that block. From those two pieces, it can compute its output coordinates (row, col).
dim3 block(16, 16);
dim3 grid((N + block.x - 1) / block.x, (M + block.y - 1) / block.y);
The first line sets up a block of 256 threads arranged as 16 x 16.
The grid is then computed in 2D:
grid.xcovers columns (N) withblock.xthreads per block in x.grid.ycovers rows (M) withblock.ythreads per block in y.
The + block_size - 1 part is a neat integer trick for ceiling division.
For any dimension, it is equivalent to:
ceil(total / block_size) = (total + block_size - 1) / block_size
Quick intuition:
- If
N = 65andblock.x = 16, thengrid.x = (65 + 16 - 1) / 16 = 5. - If
N = 64andblock.x = 16, thengrid.x = (64 + 16 - 1) / 16 = 4.
So this handles both exact multiples and dimensions that do not divide evenly.
Now let us see how this 2D setup maps block coordinates and thread coordinates to matrix coordinates.
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
blockIdx.yis the block’s row index in the gridblockDim.yis the number of threads along the block’s y dimensionthreadIdx.yis the thread’s row index within its blockrowgives the output row this thread is responsible for
The exact same logic in x handles the columns.
So you can read the formula like this:
blockIdx.y * blockDim.yjumps to the first row covered by this block+ threadIdx.ymoves to this thread’s row inside the block
The same reasoning applies in x for columns.
Say we want to calculate the row and col values for the thread at local position (0, 1) inside block (1, 0) in the chart above:
blockIdx= (1, 0)blockDim= (3, 3)threadIdx= (0, 1)
row = blockIdx.y * blockDim.y + threadIdx.y
= 0 * 3 + 1
= 1
col = blockIdx.x * blockDim.x + threadIdx.x
= 1 * 3 + 0
= 3
Before we jump into the loop, it helps to look at the memory mapping this relies on.
Flattened Memory Mapping
Even though we think about matrices in 2D, they are stored in contiguous 1D memory.
index = row * numberOfColumns + col.
In row-major layout:
A[row, k]becomesA[row * K + k]B[k, col]becomesB[k * N + col]C[row, col]becomesC[row * N + col]
That is the core mapping the kernel keeps using behind the scenes.
Now we multiply the elements in row row of A with the elements in column col of B, then sum them to get one output value.
So each thread walks along one row of A and one column of B, pairing entries one by one.
for (int k=0; k < K; k++){
sum += da[row * K + k] * db[k * N + col];
}
dc[row * N + col] = sum;
daanddbrepresent the matricesAandBin the flattened memory layout.Kis the number of columns inAand rows inB.row * K + kpoints toA[row, k].row * Kjumps to the start of rowrowinA, and+ kmoves across columns.k * N + colpoints toB[k, col].k * Njumps to the start of rowkinB, and+ colpicks the target column.
So even though one thread owns just one output cell, it still has to do K multiply-add steps to finish that cell.
Kernel’s Code - The HOW!
Here is the code stripped down to the parts that matter most:
#include <iostream>
using namespace std;
// CUDA kernel: computes one output element per thread
__global__ void matmul(float* da, float* db, float* dc, int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y; // global row index
int col = blockIdx.x * blockDim.x + threadIdx.x; // global column index
if (row < M && col < N) { // bounds check for matrix dimensions
float sum = 0;
for (int k = 0; k < K; k++) { // dot product of row and column
sum += da[row * K + k] * db[k * N + col];
}
dc[row * N + col] = sum; // write result to output matrix
}
}
int main() {
int M = 8072, N = 8072, K = 8072;
// host memory allocation for matrices A, B, and C
float *ha = new float[M * K];
float *hb = new float[K * N];
float *hc = new float[M * N];
float *da, *db, *dc;
// initialize input matrices
for (int i = 0; i < M * K; i++) ha[i] = 1.0f;
for (int i = 0; i < K * N; i++) hb[i] = 1.0f;
// allocate device memory
cudaMalloc(&da, M * K * sizeof(float));
cudaMalloc(&db, K * N * sizeof(float));
cudaMalloc(&dc, M * N * sizeof(float));
// create CUDA events for kernel timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start); // start timing
// copy inputs from host to device
cudaMemcpy(da, ha, M * K * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(db, hb, K * N * sizeof(float), cudaMemcpyHostToDevice);
// define execution configuration (2D grid of 16x16 thread blocks)
dim3 block(16, 16);
dim3 grid((N + 15) / 16, (M + 15) / 16);
// launch matrix multiplication kernel
matmul<<<grid, block>>>(da, db, dc, M, N, K);
cudaDeviceSynchronize(); // ensure kernel completion
cudaEventRecord(stop); // stop timing
cudaEventSynchronize(stop);
float ms = 0;
cudaEventElapsedTime(&ms, start, stop); // compute elapsed time
cout << "Kernel time: " << ms << " ms\n";
// copy result back to host
cudaMemcpy(hc, dc, M * N * sizeof(float), cudaMemcpyDeviceToHost);
// free device memory
cudaFree(da);
cudaFree(db);
cudaFree(dc);
// free host memory
delete[] ha;
delete[] hb;
delete[] hc;
return 0;
}
Important syntax points:
__global__means the function is a CUDA kernel launched from the CPU and executed on the GPU.<<<grid, block>>>is CUDA launch syntax.blockIdx,blockDim, andthreadIdxare built-in CUDA variables.- The boundary check matters because the grid is rounded up to cover the whole matrix. If
N = 65and a block covers 16 columns, the last block still launches threads for columns64through79, even though only column64is valid.
BEATS NumPy!? -> That’s Why!
I profiled this kernel against NumPy’s matrix multiplication in a Colab notebook, and for matrices of size 8072 x 8072, it showed about a 4x speedup.
That does not mean this kernel is anywhere near optimal. It just shows that even a simple “one thread per output cell” GPU version can outperform a CPU-side baseline once the matrices are large enough.
You can find that code here if you want to look through the notebook and results yourself.
*Note: I used a T4 GPU in Colab, so the results may vary based on the GPU used.