I’ve talked about tackling the difficult problem of performing Matrix Multiplication with less computational complexity before (link). But, there is another angle that can be pursued in order to quicken our calculation, and that is through the use of massive parallelism, ergo the Graphics Processing Unit (GPU).


Both, the strassen algorithm and the naive algorithm can be run in a parallel fashion. For instance, as you know the naive algorithm is $$c_{ij}=\sum_{k=1}^{p} a_{ik} b_{kj}$$ If the matrices that are being operated on have indexes that range for $i$ from $0$ to $n$ and $j$ from $0$ to $m$, then you could launch $n\times m$ threads each computing the summation above. And, with the Strassen Algorithm, you would split the matrix into 4 and do the 7 operations in separate threads described in the Strassen post and each time you come across matrix multiplication of a sub-matrix you would launch 7 more threads, recursively, until it is more convenient to use the naive algorithm.  

Now, for small matrices running these calculations on the CPU is faster since the latency of having the CPU talk to the GPU, then the GPU do the calculation, and have the GPU talk back to the CPU is more than just to have the calculation run on the CPU. Not to mention, for less threads a processor with less but faster threads would result in quicker finish time. Now in this post I won’t talk about at which point it is more convenient to use the GPU over the CPU but to present a small understandable implementation. In the example, I use CUDA and, fortunately, the folks at NVIDIA have implemented a CUDA C version of the BLAS library, which is one of the fastest linear algebra libraries available using some of the latest algorithms.  

In my humble opinion, I think the CUDA API is ugly, so I aim in my implementation to abstract  it.


If we take a look at the CUDA Documentation for cuBLAS, the CUDA C implementation of BLAS, (link), we can see that the function that performs the matrix multiplication for float cublasSgemm takes 14 arguments and describes $C=\alpha AB+\beta C$

  1. cublasHandle_t is a variable created ahead of time that contains some of the environmental aspects.
  2. cublasOperation_t this enum tells the function whether to take the transpose of $A$
  3. cublasOperation_t this enum tells the function whether to take the transpose of $B$
  4. $N$ this is an integer with the number of rows in $A$
  5. $M$ this is also an integer with the number of columns in $B$
  6. $K$ this integer has the number of columns in $A$ and the number of rows in $B$
  7. this constant float pointer is $\alpha$
  8. this constant float pointer points to the first element of the array that has the data of $A$
  9. lda this integer is the size of the leading dimension of $A$, in the case of our example it's the number of rows or $N$
  10. this constant float pointer points to the first element of the array that has the data of $B$
  11. ldb this integer is the size of the leading dimension, in the case of our example it's the number of rows of $B$ or $K$
  12. this constant float pointer is $\beta$ and will be zero
  13. this is the pointer where the calculated data will go
  14. ldc is the size of the leading dimension of $C$ or $N$
    To abstract, I think it would be best to wrap the matrix multipliction in an override of the * operator between two custom created matrix classes. The class needs to allocated the data in a linear fashion, as in there are no pointers of pointers to data or such thing as CUDA doesn't like those, so with something like float *elements = new float[N * M], where N and M are the dimensions of the matrix.
    Anyway, when writing C or C++ code I personally like to use CMake because I love it's ability and that it is cross-platform. Let's create a project folder and a textfile called CMakeLists.txt; The text file is CMake's configuration file. In the text file add the following
cmake_minimum_required(VERSION 3.10)
project(CUDA_Matrix_Multiplication CXX CUDA)


endif ()

add_executable(CUDA_Matrix_Multiplication cumatrix.cu main.cpp)

set_target_properties(CUDA_Matrix_Multiplication PROPERTIES

target_link_libraries(CUDA_Matrix_Multiplication -lcublas)

On the first line, we are telling cmake that the minimum version of cmake that can run this configuration file is version 3.10. I choose that since version 3.8, & 3.9 for windows, the ability to add CUDA as a language was added. The second line is telling cmake the name of the project CUDA_Matrix_Multipliction and languages the project is going to use, in this case C++ and CUDA, so CXX & CUDA. The fourth line is setting the standard to C++11 which is an extention of C++98 and adds alot of new features. NVIDIA applied alot of C++11 features onto their compiler that can be called with the CUDA compiler, nvcc, and is the reason for lines 6-9. In line 11 we tell cmake where the source files, and, in lines 13-15 we are giving the CUDA compiler the flags to generate poistion independent code and allows for cross-platform projects and separable compilation that delays the linking of cuda code as long as possible (link). The last line links the cuBLAS library. Now, to get to the substance.
Let's create start by creating a header for our cumatrix class and name it cumatrix.h. Most of the code I wrote here is a modified version of STL's std::array source code which you can access at this link. In that header file let's add some compiler guards to ensure that it doesn't get included twice and place our class/struct in it.

// cumatrix.h
#ifndef CUMATRIX_H
#define CUMATRIX_H

struct cumatrix {

#endif // CUMATRIX_H

In this struct lets add a couple typedefs to simplify our code and to make it more readable.

struct cumatrix {
  // types:
  typedef float value_type;
  typedef value_type &reference;
  typedef const value_type &const_reference;
  typedef int size_type;
  typedef value_type *pointer;

Here we are using float instead of double because GPUs typically process single precision quicker than double precision, but this struct could be generalized with some work with the use of templates and the help of some of the CUDA sample code for cuBLAS. Anyway, add two float pointers. One of them will contain the address of the information on the host, in the regular memory, and the second will contain the address in the device memory, the GPU's memory. We need to have knowledge of whether we have passed the data to the graphics card so a boolean will also be usefull. Finally, two integers containing the number of rows and the number of columns.

float *elemns;
float *d_elemns;
bool in_device = false;
int N, M;

Let's create the contructor and destructor. The constructor needs to know the number of rows and number of columns. Later on we will have a function that will release the data from the Device memory and I called it release_device_data

cumatrix(int n, int m) {
  N = n; M = m;
  elemns = new float[N * M];
~cumatrix() {
  if (in_device)
  delete [] elemns;

Will the N and M integers are available since we are building a struct it is still nice to have dedicated functions to access the number of rows, columns, and size. We can also let the compiler know that these functions can be evaluated on compile time by using the constexpr specifier.

// Capacity
constexpr size_type size() const noexcept { return (N * M); }
constexpr size_type rows() const noexcept { return N; }
constexpr size_type cols() const noexcept { return M; }

We need to be able to access the elemns elements throught the cumatrix struct by overloading the [] operator.

// element access:
reference operator[](size_type n) { return elemns[n]; }
const_reference operator[](size_type n) const { return elemns[n]; }

Now we need to be able to access the array not as an array but as a matrix, by having a function that obtains the row and columns number.

reference operator()(size_type x, size_type y) { return elemns[x * M + y]; }
const_reference operator()(size_type x, size_type y) const {  return elemns[x * M + y]; }

Let's add the ability to get the underlying pointer

value_type *data() noexcept { return elemns; }

To add some of the functions that deals which the CUDA API, as shown before, we need a function that will release the device data. We also need a function that will allocate memory on the device, move the host data to the device data, and return the pointer to that device data. Also, we need a function that will sync the data from host to device and another from device to host.

pointer get_device_opinter(bool copy = true);
void refresh_from_device();
void refresh_to_device();
void release_device_data();

To finish of the struct, we need to fill the matrix with random float data.

  void fill_rand();

To check the our multipliction was successfull we can override the << operator so that we can print our matrix to std::cout. (in the same header)

std::ostream& operator<<(std::ostream& out, const cumatrix& mat);

Finally, the operator overload for matrix multiplication

cumatrix operator*(cumatrix& a, cumatrix& b);

Let's now create a source fill for both operators, the fill function, and the device memory functions. To let CMake know to pass this throught the CUDA compiler ,nvcc, the source file needs to have the file extension .cu. So, our source fill will be called cumatrix.cu.
In the source fill, let's import some usefull headers along with our header.

// cumatrix.cu
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <random>
#include <stdexcept>

#include "cumatrix.h"

It's pretty simple to implement our << operator to display our matrices in an orderly fashion.

std::ostream& operator << (std::ostream& out, const cumatrix& mat) {
  for (int i = 0; i < mat.rows(); i++) {
    for (int j = 0; j < mat.cols(); j++) {
      out << " " << mat(i, j);
    out << std::endl;
  return out;

We need to create a function that will now only check whether there was an error running cuda code but also where in the source the error occured. This can be done through compiler macros. This is from link

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true) {
  if (code != cudaSuccess) {
    fprintf(stderr, "Error: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort) throw new std::runtime_error("");

Let's make the fill_rand function that will fill our matrix with random numbers. We can take advantage of C++11 features like std::uniform_real_distribution.

void cumatrix::fill_rand() {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_real_distribution<float> dis((float)-1.0, 1.0);
  for (int i = 0; i < size(); i++)
    (*this)[i] = dis(gen);

The function that will allocate device memory and transfer from host to device memory if bool copy is true is cudaMalloc and cudaMemcpy. Here is the documentation on them (link).

float* cumatrix::get_device_pointer(bool copy) {
  if (!in_device) {
    gpuErrchk(cudaMalloc((void **) &d_elemns, N * M * sizeof(value_type)));
    if (copy) gpuErrchk(cudaMemcpy(d_elemns, elemns, N * M * sizeof(value_type), cudaMemcpyHostToDevice));
    in_device = true;
  return d_elemns;

For the functions that sync the data between the host and device, the last argument is different of cuMemcpy is different and the source and destinations are different.

void cumatrix::refresh_from_device() {
  if (in_device) gpuErrchk(cudaMemcpy(elemns, d_elemns, N * M * sizeof(value_type), cudaMemcpyDeviceToHost));

void cumatrix::refresh_to_device() {
  if (in_device) gpuErrchk(cudaMemcpy(d_elemns, elemns, N * M * sizeof(value_type), cudaMemcpyHostToDevice));

To release the device data use cudaFree

void cumatrix::release_device_data() {
  if (in_device) {
    in_device = false;

Finally, for the * operator, where the magic happens

cumatrix operator*(cumatrix& a, cumatrix& b) {
  // TODO

Let's create an output cumatrix that will have the number of rows a has and the number columns that b has, so

cumatrix output(a.rows(), b.cols());

Then optain the device pointer, in the background it will also allocate and transfer to the device

const float* d_a = a.get_device_pointer();
const float* d_b = b.get_device_pointer();

For the output, we will only allocate and not transfer so,

float* d_c = output.get_device_pointer(false);

Let's get the the number of rows in a in the variable N, the number of columns in b in M, and the number columns in a in K; Then, added those the respective lda, ldb, and ldc integer variables

int N = a.rows(), M = b.cols(), K = a.cols();
int lda = N, ldb = K, ldc = N;

For alpha and beta

const float alpha = 1;
const float beta = 0;

Create a cuBLAS handle with cublasHandle_t and reference it to the cublasCreate function

cublasHandle_t handle;

For the entré, the call to cublasSgemm

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, &alpha, d_a, lda, d_b, ldb, &beta, d_c, ldc);

To clean up, we destroy the handle transfer the calculated data to the host memory, and return the output

  return output;

To test out our matrix struct, create a main.cpp with two arbitrarily large cumatrices that we fill with random numbers and multiply them with * and output the resulting cumatrix.

#include <iostream>
#include "cumatrix.h"

int main() {
  int size = 1000;
  cumatrix a(size, size), b(size, size);

  cumatrix c = a * b;

  std::cout << c;

  return 0;

When run you should see a wall of numbers in the terminal.