Skip to content
CacheNova
Go back

Learning CUDA Through Matrix Multiplication

Updated:

What I cannot create, I do not understand. - Richard Feynman

I wanted to try CUDA programming for a long time, and this weekend I finally checked it off.

When learning something new, I follow something I call the W3 rule.

Each W stands for:

What is CUDA?

Matrix multiplication is by far the simplest hands-on exercise to start learning cuda, I’d say it’s the Hello World of cuda programming. Why? Because parallelism shows up very naturally here. If we multiply two matrices A and B to get C, each value in C can be computed independently.

That is a perfect fit for CUDA’s execution model: give one thread one output cell of C, then let many threads work across the GPU at the same time. One small but important detail: one thread is not doing just one multiplication. It is responsible for one final output cell, so it still has to compute the full dot product for that cell. More formally, each output element of

C=A×BC = A \times B

can be computed independently as follows:

C[row,col]=k=0K1A[row,k]B[k,col]C_{[row,col]} = \sum_{k=0}^{K-1} A_{[row,k]} B_{[k,col]}

This kernel, which is just a fancy name for the matrix multiplication function, is considered naive because it does not use shared memory. In other words, each thread keeps going back to global memory instead of first cooperating with nearby threads to cache tiles of A and B. So I would not read too much into the profiled results just yet. There is still plenty of room to improve it, and that is part of what makes this a nice first CUDA exercise.


C’s Shape

You might already know this from linear algebra, but it helps to quickly revisit how matrix shapes relate.

If:

then:

So each output position needs:

That is exactly why row and col are the two indices that matter most in the kernel.


CUDA’s Thread Mapping

CUDA organizes work into blocks and grids. An easy way to picture it is as larger containers that each hold many smaller ones.

Grid (2D)
Each thread is identified by its block coordinates and its local coordinates inside that block.
block (0, 0) b0
t0 (0,0)
t1 (1,0)
t2 (2,0)
t3 (0,1)
t4 (1,1)
t5 (2,1)
t6 (0,2)
t7 (1,2)
t8 (2,2)
block (1, 0) b1
t0 (0,0)
t1 (1,0)
t2 (2,0)
t3 (0,1)
t4 (1,1)
t5 (2,1)
t6 (0,2)
t7 (1,2)
t8 (2,2)
block (2, 0) b2
t0 (0,0)
t1 (1,0)
t2 (2,0)
t3 (0,1)
t4 (1,1)
t5 (2,1)
t6 (0,2)
t7 (1,2)
t8 (2,2)
block (0, 1) b3
t0 (0,0)
t1 (1,0)
t2 (2,0)
t3 (0,1)
t4 (1,1)
t5 (2,1)
t6 (0,2)
t7 (1,2)
t8 (2,2)
block (1, 1) b4
t0 (0,0)
t1 (1,0)
t2 (2,0)
t3 (0,1)
t4 (1,1)
t5 (2,1)
t6 (0,2)
t7 (1,2)
t8 (2,2)
block (2, 1) b5
t0 (0,0)
t1 (1,0)
t2 (2,0)
t3 (0,1)
t4 (1,1)
t5 (2,1)
t6 (0,2)
t7 (1,2)
t8 (2,2)
The kernel uses blockIdx and threadIdx to compute one output coordinate (row, col) per thread.

In the chart, each larger box is a block, and each small square inside it is a thread. The important idea is not that a thread has some mysterious global thread number floating around. The real story is simpler: each thread knows which block it is in, and where it sits inside that block. From those two pieces, it can compute its output coordinates (row, col).

dim3 block(16, 16);
dim3 grid((N + block.x - 1) / block.x, (M + block.y - 1) / block.y);

The first line sets up a block of 256 threads arranged as 16 x 16. The grid is then computed in 2D:

The + block_size - 1 part is a neat integer trick for ceiling division. For any dimension, it is equivalent to:

ceil(total / block_size) = (total + block_size - 1) / block_size

Quick intuition:

So this handles both exact multiples and dimensions that do not divide evenly.

Now let us see how this 2D setup maps block coordinates and thread coordinates to matrix coordinates.

  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

The exact same logic in x handles the columns.

So you can read the formula like this:

The same reasoning applies in x for columns.

Say we want to calculate the row and col values for the thread at local position (0, 1) inside block (1, 0) in the chart above:

  row = blockIdx.y * blockDim.y + threadIdx.y
    = 0 * 3 + 1
    = 1

  col = blockIdx.x * blockDim.x + threadIdx.x
    = 1 * 3 + 0
    = 3

Before we jump into the loop, it helps to look at the memory mapping this relies on.


Flattened Memory Mapping

Even though we think about matrices in 2D, they are stored in contiguous 1D memory.


Row-major flattening uses index = row * numberOfColumns + col.

In row-major layout:

That is the core mapping the kernel keeps using behind the scenes.


Now we multiply the elements in row row of A with the elements in column col of B, then sum them to get one output value. So each thread walks along one row of A and one column of B, pairing entries one by one.

  for (int k=0; k < K; k++){
    sum += da[row * K + k] * db[k * N + col];
  }
    dc[row * N + col] = sum;

So even though one thread owns just one output cell, it still has to do K multiply-add steps to finish that cell.


Kernel’s Code - The HOW!

Here is the code stripped down to the parts that matter most:

#include <iostream>
using namespace std;

// CUDA kernel: computes one output element per thread
__global__ void matmul(float* da, float* db, float* dc, int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y; // global row index
    int col = blockIdx.x * blockDim.x + threadIdx.x; // global column index

    if (row < M && col < N) { // bounds check for matrix dimensions
        float sum = 0;
        for (int k = 0; k < K; k++) { // dot product of row and column
            sum += da[row * K + k] * db[k * N + col];
        }
        dc[row * N + col] = sum; // write result to output matrix
    }
}

int main() {
    int M = 8072, N = 8072, K = 8072;

    // host memory allocation for matrices A, B, and C
    float *ha = new float[M * K];
    float *hb = new float[K * N];
    float *hc = new float[M * N];

    float *da, *db, *dc;

    // initialize input matrices
    for (int i = 0; i < M * K; i++) ha[i] = 1.0f;
    for (int i = 0; i < K * N; i++) hb[i] = 1.0f;

    // allocate device memory
    cudaMalloc(&da, M * K * sizeof(float));
    cudaMalloc(&db, K * N * sizeof(float));
    cudaMalloc(&dc, M * N * sizeof(float));

    // create CUDA events for kernel timing
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start); // start timing

    // copy inputs from host to device
    cudaMemcpy(da, ha, M * K * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(db, hb, K * N * sizeof(float), cudaMemcpyHostToDevice);

    // define execution configuration (2D grid of 16x16 thread blocks)
    dim3 block(16, 16);
    dim3 grid((N + 15) / 16, (M + 15) / 16);


    // launch matrix multiplication kernel
    matmul<<<grid, block>>>(da, db, dc, M, N, K);

    cudaDeviceSynchronize(); // ensure kernel completion

    cudaEventRecord(stop); // stop timing
    cudaEventSynchronize(stop);

    float ms = 0;
    cudaEventElapsedTime(&ms, start, stop); // compute elapsed time

    cout << "Kernel time: " << ms << " ms\n";

    // copy result back to host
    cudaMemcpy(hc, dc, M * N * sizeof(float), cudaMemcpyDeviceToHost);

    // free device memory
    cudaFree(da);
    cudaFree(db);
    cudaFree(dc);

    // free host memory
    delete[] ha;
    delete[] hb;
    delete[] hc;

    return 0;
}

Important syntax points:


BEATS NumPy!? -> That’s Why!

I profiled this kernel against NumPy’s matrix multiplication in a Colab notebook, and for matrices of size 8072 x 8072, it showed about a 4x speedup. That does not mean this kernel is anywhere near optimal. It just shows that even a simple “one thread per output cell” GPU version can outperform a CPU-side baseline once the matrices are large enough.

You can find that code here if you want to look through the notebook and results yourself.

*Note: I used a T4 GPU in Colab, so the results may vary based on the GPU used.


Share this post on:

Previous Post
Tensors: The Bricks of PyTorch
Next Post
Compressing Reality: A PCA Deep Dive