Learning CUDA Through Matrix Multiplication

What I cannot create, I do not understand. - Richard Feynman

I wanted to try CUDA programming for a long time, and this weekend I finally checked it off.

When learning something new, I follow something I call the W3 rule.

Each W stands for:

What - What is cuda?
Why - Why cuda?
How - How to cuda?

What is CUDA?

Matrix multiplication is by far the simplest hands-on exercise to start learning cuda, I’d say it’s the Hello World of cuda programming. Why? Because parallelism shows up very naturally here. If we multiply two matrices A and B to get C, each value in C can be computed independently.

That is a perfect fit for CUDA’s execution model: give one thread one output cell of C, then let many threads work across the GPU at the same time. One small but important detail: one thread is not doing just one multiplication. It is responsible for one final output cell, so it still has to compute the full dot product for that cell. More formally, each output element of

C = A \times B

can be computed independently as follows:

C_{[row,col]} = \sum_{k=0}^{K-1} A_{[row,k]} B_{[k,col]}

This kernel, which is just a fancy name for the matrix multiplication function, is considered naive because it does not use shared memory. In other words, each thread keeps going back to global memory instead of first cooperating with nearby threads to cache tiles of A and B. So I would not read too much into the profiled results just yet. There is still plenty of room to improve it, and that is part of what makes this a nice first CUDA exercise.

C’s Shape

You might already know this from linear algebra, but it helps to quickly revisit how matrix shapes relate.

If:

$A$ has shape $M \times K$
$B$ has shape $K \times N$

then:

$C$ has shape $M \times N$

So each output position needs:

one row from $A$
one column from $B$

That is exactly why row and col are the two indices that matter most in the kernel.

CUDA’s Thread Mapping

CUDA organizes work into blocks and grids. An easy way to picture it is as larger containers that each hold many smaller ones.

Grid (2D)

Each thread is identified by its block coordinates and its local coordinates inside that block.

block (0, 0) b0

t0 (0,0)

t1 (1,0)

t2 (2,0)

t3 (0,1)

t4 (1,1)

t5 (2,1)

t6 (0,2)

t7 (1,2)

t8 (2,2)

block (1, 0) b1

t0 (0,0)

t1 (1,0)

t2 (2,0)

t3 (0,1)

t4 (1,1)

t5 (2,1)

t6 (0,2)

t7 (1,2)

t8 (2,2)

block (2, 0) b2

t0 (0,0)

t1 (1,0)

t2 (2,0)

t3 (0,1)

t4 (1,1)

t5 (2,1)

t6 (0,2)

t7 (1,2)

t8 (2,2)

block (0, 1) b3

t0 (0,0)

t1 (1,0)

t2 (2,0)

t3 (0,1)

t4 (1,1)

t5 (2,1)

t6 (0,2)

t7 (1,2)

t8 (2,2)

block (1, 1) b4

t0 (0,0)

t1 (1,0)

t2 (2,0)

t3 (0,1)

t4 (1,1)

t5 (2,1)

t6 (0,2)

t7 (1,2)

t8 (2,2)

block (2, 1) b5

t0 (0,0)

t1 (1,0)

t2 (2,0)

t3 (0,1)

t4 (1,1)

t5 (2,1)

t6 (0,2)

t7 (1,2)

t8 (2,2)

The kernel uses blockIdx and threadIdx to compute one output coordinate (row, col) per thread.

In the chart, each larger box is a block, and each small square inside it is a thread. The important idea is not that a thread has some mysterious global thread number floating around. The real story is simpler: each thread knows which block it is in, and where it sits inside that block. From those two pieces, it can compute its output coordinates (row, col).

dim3 block(16, 16);
dim3 grid((N + block.x - 1) / block.x, (M + block.y - 1) / block.y);

The first line sets up a block of 256 threads arranged as 16 x 16. The grid is then computed in 2D:

grid.x covers columns (N) with block.x threads per block in x.
grid.y covers rows (M) with block.y threads per block in y.

The + block_size - 1 part is a neat integer trick for ceiling division. For any dimension, it is equivalent to:

ceil(total / block_size) = (total + block_size - 1) / block_size

Quick intuition:

If N = 65 and block.x = 16, then grid.x = (65 + 16 - 1) / 16 = 5.
If N = 64 and block.x = 16, then grid.x = (64 + 16 - 1) / 16 = 4.

So this handles both exact multiples and dimensions that do not divide evenly.

Now let us see how this 2D setup maps block coordinates and thread coordinates to matrix coordinates.

  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

blockIdx.y is the block’s row index in the grid
blockDim.y is the number of threads along the block’s y dimension
threadIdx.y is the thread’s row index within its block
row gives the output row this thread is responsible for

The exact same logic in x handles the columns.

So you can read the formula like this:

blockIdx.y * blockDim.y jumps to the first row covered by this block
+ threadIdx.y moves to this thread’s row inside the block

The same reasoning applies in x for columns.

Say we want to calculate the row and col values for the thread at local position (0, 1) inside block (1, 0) in the chart above:

blockIdx = (1, 0)
blockDim = (3, 3)
threadIdx = (0, 1)

  row = blockIdx.y * blockDim.y + threadIdx.y
    = 0 * 3 + 1
    = 1

  col = blockIdx.x * blockDim.x + threadIdx.x
    = 1 * 3 + 0
    = 3

Before we jump into the loop, it helps to look at the memory mapping this relies on.

Flattened Memory Mapping

Even though we think about matrices in 2D, they are stored in contiguous 1D memory.

Row-major flattening uses index = row * numberOfColumns + col.

In row-major layout:

A[row, k] becomes A[row * K + k]
B[k, col] becomes B[k * N + col]
C[row, col] becomes C[row * N + col]

That is the core mapping the kernel keeps using behind the scenes.

Now we multiply the elements in row row of A with the elements in column col of B, then sum them to get one output value. So each thread walks along one row of A and one column of B, pairing entries one by one.

  for (int k=0; k < K; k++){
    sum += da[row * K + k] * db[k * N + col];
  }
    dc[row * N + col] = sum;

da and db represent the matrices A and B in the flattened memory layout.
K is the number of columns in A and rows in B.
row * K + k points to A[row, k].
row * K jumps to the start of row row in A, and + k moves across columns.
k * N + col points to B[k, col].
k * N jumps to the start of row k in B, and + col picks the target column.

So even though one thread owns just one output cell, it still has to do K multiply-add steps to finish that cell.

Kernel’s Code - The HOW!

Here is the code stripped down to the parts that matter most:

#include <iostream>
using namespace std;

// CUDA kernel: computes one output element per thread
__global__ void matmul(float* da, float* db, float* dc, int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y; // global row index
    int col = blockIdx.x * blockDim.x + threadIdx.x; // global column index

    if (row < M && col < N) { // bounds check for matrix dimensions
        float sum = 0;
        for (int k = 0; k < K; k++) { // dot product of row and column
            sum += da[row * K + k] * db[k * N + col];
        }
        dc[row * N + col] = sum; // write result to output matrix
    }
}

int main() {
    int M = 8072, N = 8072, K = 8072;

    // host memory allocation for matrices A, B, and C
    float *ha = new float[M * K];
    float *hb = new float[K * N];
    float *hc = new float[M * N];

    float *da, *db, *dc;

    // initialize input matrices
    for (int i = 0; i < M * K; i++) ha[i] = 1.0f;
    for (int i = 0; i < K * N; i++) hb[i] = 1.0f;

    // allocate device memory
    cudaMalloc(&da, M * K * sizeof(float));
    cudaMalloc(&db, K * N * sizeof(float));
    cudaMalloc(&dc, M * N * sizeof(float));

    // create CUDA events for kernel timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start); // start timing

    // copy inputs from host to device
    cudaMemcpy(da, ha, M * K * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(db, hb, K * N * sizeof(float), cudaMemcpyHostToDevice);

    // define execution configuration (2D grid of 16x16 thread blocks)
    dim3 block(16, 16);
    dim3 grid((N + 15) / 16, (M + 15) / 16);


    // launch matrix multiplication kernel
    matmul<<<grid, block>>>(da, db, dc, M, N, K);

    cudaDeviceSynchronize(); // ensure kernel completion

    cudaEventRecord(stop); // stop timing
    cudaEventSynchronize(stop);

    float ms = 0;
    cudaEventElapsedTime(&ms, start, stop); // compute elapsed time

    cout << "Kernel time: " << ms << " ms\n";

    // copy result back to host
    cudaMemcpy(hc, dc, M * N * sizeof(float), cudaMemcpyDeviceToHost);

    // free device memory
    cudaFree(da);
    cudaFree(db);
    cudaFree(dc);

    // free host memory
    delete[] ha;
    delete[] hb;
    delete[] hc;

    return 0;
}

Important syntax points:

__global__ means the function is a CUDA kernel launched from the CPU and executed on the GPU.
<<<grid, block>>> is CUDA launch syntax.
blockIdx, blockDim, and threadIdx are built-in CUDA variables.
The boundary check matters because the grid is rounded up to cover the whole matrix. If N = 65 and a block covers 16 columns, the last block still launches threads for columns 64 through 79, even though only column 64 is valid.

BEATS NumPy!? -> That’s Why!

I profiled this kernel against NumPy’s matrix multiplication in a Colab notebook, and for matrices of size 8072 x 8072, it showed about a 4x speedup. That does not mean this kernel is anywhere near optimal. It just shows that even a simple “one thread per output cell” GPU version can outperform a CPU-side baseline once the matrices are large enough.

You can find that code here if you want to look through the notebook and results yourself.

*Note: I used a T4 GPU in Colab, so the results may vary based on the GPU used.