Introduction

Matrix Multiplication is a staple in mathematics. Since most images are made out of multiple matrices, 3 to be exact, matrix multiplication is a great tool for manipulating these images. For example, by using Singular-value decomposition (SVD), which is a factorization of a matrix, it is possible to reduce the noise stored in images. And, SVD is possible with matrix multiplication but, that is a story for a future post. Matrix Multiplication is used in neural networks, robotics, image processing, and etc. Of course, for faster image processing, faster training of neural networks, and less delay of robotics, quickening the processes of matrix multiplication is important.
For a really long time it was thought that in terms of computational complexity the naive algorithm for the multiplication of matrices was the optimal one, with a computational complexity of $O(n^3)$. In 1969, Volker Strassen proved that wrong with a paper titled "Gaussian elimination is not optimal". Of course, there have been faster algorithms developed since then, like the fastest yet algorithm developed by Coppersmith and Winograd and generalized by Francois Le Gall. Here is a link to his paper.

Algorithm

The strassen is as follows. Where $C=AB$ and $A,B,C\in R^{2^{n-1} \times 2^{n-1}}$, we divide each matrix into 4 equal size matrices at divided starting from the corners.
$$A=\begin{bmatrix}
A_{1,1} & A_{1,2} \\
A_{2,1} & A_{2,2}
\end{bmatrix},B=\begin{bmatrix}
B_{1,1} & B_{1,2} \\
B_{2,1} & B_{2,2}
\end{bmatrix},C=\begin{bmatrix}
C_{1,1} & C_{1,2} \\
C_{2,1} & C_{2,2}
\end{bmatrix}$$
We then define new matrices where these matrices are added and multiplied together recursively until only scalers remain.
\begin{align}
I&:=(A_{1,1}+A_{2,2})(B_{1,1}+B_{2,2}) \\
II&:=(A_{2,1}+A_{2,2})B_{1,1} \\
III&:=A_{1,1}(B_{1,2}-B_{2,2}) \\
IV&:=A_{2,2}(B_{2,1}-B_{1,1}) \\
V&:=(A_{1,1}+A_{1,2})B_{2,2} \\
VI&:=(A_{2,1}-A_{1,1})(B_{1,1}+B_{1,2}) \\
VII&:=(A_{1,2}-A_{2,2})(B_{2,1}+B_{2,2})
\end{align}
Then we add these as follows
\begin{align}
C_{1,1}&=I+IV-V+VII \\
C_{1,2}&=III+V \\
C_{2,1}&=II+IV \\
C_{2,2}&=I-II+III+VI
\end{align}
Doing this results in 7 multiplication operations while with the naive way of matrix multiplication, $c_{ij}=\sum_{k=1}^{m}a_{ik}b_{kj}$, 8 multiplication operations are required. But, the strassen algorithm requires more additions and subtractions then the naive way, meaning, for small matrices the naive way of matrix multiplication is faster and on very large matrices the strassen algorithm is faster due to the fact that multiplication operations are more computationally complex then addition or subtraction operations. It would be optimal to use the strassen algorithm to break down a large matrix into matrices small to the point where regular matrix multiplication is faster. Refer to this link for more information on the complexity of operations.

Examples

Let's say we have two matrices $A$ and $B$ whose multiplication results in $AB=C$ and they are defined by $A,B,C\in R^{2\times 2}$. Let's give the two matrices $A$ and $B$ random real elements.
$$A=\begin{bmatrix}
0.5 & 0.75 \\
1.5 & 1.75\end{bmatrix}
,B=\begin{bmatrix}
0.8 & 0.12 \\
1.7 & .85
\end{bmatrix}$$
If we follow the strassen algorithm
\begin{align}
I&:=3.7125 \\
II&:=2.6 \\
III&:=-0.3650 \\
IV&:=1.575 \\
V&:=1.0625 \\
VI&:=0.92 \\
VII&:=-2.55
\end{align}
And, finally
\begin{align}
c_{11}&=I+IV-V+VII=1.675 \\
c_{12}&=III+V=0.6225 \\
c_{21}&=II+IV=4.175 \\
c_{22}&=I-II+III+VI=1.6675
\end{align}
For matrices larger then $R^{2 \times 2}$, we add rows and columns of zeros until the number of rows and columns is the same and divisible by 2. After which the matrix is divided into 4 sub-matrices for which the same operations are performed. For example, if have the matrices, $D,E,F\in R^{4\times 4}$, where $DE=F$. We will split the matrix into four quadrants with which we do the operations above recursively. Let's give these matrices random real numbers:
\begin{align}
D&=\left[\begin{array}{cc|cc}
4 & 10 & 15 & 11 \\
4 & 10 & 7 & 18 \\
\hline
3 & 14 & 12 & 11 \\
5 & 17 & 19 & 17
\end{array}\right] \\
E&=\left[\begin{array}{cc|cc}
3 & 11 & 3 & 8 \\
9 & 15 & 9 & 17 \\
\hline
11 & 19 & 12 & 16 \\
18 & 15 & 9 & 17
\end{array}\right]
\end{align}
So,
\begin{align}
D_{1,1}&=\begin{bmatrix}
4 & 10 \\
4 & 10
\end{bmatrix} \\
D_{1,2}&=\begin{bmatrix}
15 & 11 \\
7 & 8
\end{bmatrix} \\
D_{2,1}&=\begin{bmatrix}
3 & 14 \\
5 & 17
\end{bmatrix} \\
D_{2,2}&=\begin{bmatrix}
12 & 11 \\
19 & 17
\end{bmatrix} \\
E_{1,1}&=\begin{bmatrix}
3 & 11 \\
9 & 15
\end{bmatrix} \\
E_{1,2}&=\begin{bmatrix}
3 & 8 \\
9 & 17
\end{bmatrix} \\
E_{2,1}&=\begin{bmatrix}
11 & 19 \\
18 & 15
\end{bmatrix} \\
E_{2,2}&=\begin{bmatrix}
12 & 16 \\
9 & 17
\end{bmatrix}
\end{align}
And, using the algorithm while also doing the additions and subtractions for each temporary matrix
\begin{align}
I&:=\begin{bmatrix}
16 & 21 \\
23 & 27
\end{bmatrix}\begin{bmatrix}
15 & 27 \\
18 & 32
\end{bmatrix} \\
II&:=\begin{bmatrix}
15 & 25 \\
24 & 34
\end{bmatrix}\begin{bmatrix}
3 & 11 \\
9 & 15
\end{bmatrix} \\
III&:=\begin{bmatrix}
3 & 11 \\
9 & 15
\end{bmatrix}\begin{bmatrix}
-9 & -8 \\
0 & 0
\end{bmatrix} \\
IV&:=\begin{bmatrix}
12 & 11 \\
19 & 17
\end{bmatrix}\begin{bmatrix}
8 & 8 \\
9 & 0
\end{bmatrix} \\
V&:=\begin{bmatrix}
19 & 21 \\
11 & 18
\end{bmatrix}\begin{bmatrix}
12 & 16 \\
9 & 17
\end{bmatrix} \\
VI&:=\begin{bmatrix}
-1 & 4 \\
1 & 7
\end{bmatrix}\begin{bmatrix}
6 & 19 \\
18 & 32
\end{bmatrix} \\
VII&:=\begin{bmatrix}
3 & 0 \\
-12 & 1
\end{bmatrix}\begin{bmatrix}
23 & 35 \\
27 & 32
\end{bmatrix}
\end{align}
Now, we apply the strassen algorithm, again, as shown above, for each time we do matrix multiplication
\begin{align}
I&:=\begin{bmatrix}
618 & 1104 \\
831 & 1485
\end{bmatrix} \\
II&:=\begin{bmatrix}
270 & 540 \\
378 & 774
\end{bmatrix} \\
III&:=\begin{bmatrix}
-36 & -32 \\
-36 & -32
\end{bmatrix} \\
IV&:=\begin{bmatrix}
195 & 96 \\
305 & 152
\end{bmatrix} \\
V&:=\begin{bmatrix}
417 & 661 \\
384 & 652
\end{bmatrix} \\
VI&:=\begin{bmatrix}
66 & 109 \\
132 & 243
\end{bmatrix} \\
VII&:=\begin{bmatrix}
69 & 109 \\
132 & 243
\end{bmatrix}
\end{align}
Now, we just add and subtract the matrices to get the four quadrants for the $F$ matrix
\begin{align}
F_{1,1}&=I+IV-V+VII=\begin{bmatrix}
465 & 644 \\
503 & 597
\end{bmatrix} \\
F_{1,2}&=III+V=\begin{bmatrix}
381 & 629 \\
348 & 620
\end{bmatrix} \\
F_{2,1}&=II+IV=\begin{bmatrix}
465 & 636 \\
683 & 926
\end{bmatrix}\\
F_{2,2}&=I-II+III+VI=\begin{bmatrix}
378 & 641 \\
549 & 922
\end{bmatrix}
\end{align}
We put the four quadrants together
$$F=\begin{bmatrix}
465 & 644 & 381 & 629 \\
503 & 597 & 348 & 620 \\
465 & 636 & 378 & 641 \\
683 & 926 & 549 & 922
\end{bmatrix}$$
For matrices that are not cube shaped or whose sides are not divisible by 2 we add rows and columns of zeros until the matrix complies with the requirement.

Experiment

I wanted to make sure that this algorithm actually worked in practice, so I fired up MATLAB. I quickly implemented the naive algorithm and called the function matrix_multiplication, then, I not so quickly setup the strassen algorithm in a function called strassen_multiplication not taking into account matrices whose sides are not square and not completely divisible by 2. Lastly, I went ahead and used the built in function provided by MATLAB. When implementing the strassen algorithm, I found that only using the strassen algorithm while recursing was wasting such a large amount of resources that it was slower then the naive algorithm. A fix was to, when the side size reach $48$, use the naive algorithm, thereby creating a faster hybrid algorithm.
We need two large matrices whose dimensions are square and completely divisible by 2 and whose elements are random. By completely divisible by 2 I mean that you can divide that number over and over by 2 until only two 2s remain, to come up with such a number we would just raise 2 to any power. I chose side dimensions that would not take too long to compute but still shows the difference in computation time. I chose $A,B,C\in R^{2^{11}\times 2^{11}}\rightarrow A,B,C\in R^{2048\times 2048}$. In MATLAB, to do that you would use the rand function (let's write a script named strassen.m)

% strassen.m
n = 2048;
a = rand(n, n);
b = rand(n, n);

We then run the functions and use the tic toc scheme to time the code

fprintf("NAIVE MATMUL\n");
tic;
c1 = matrix_multiplication(a, b);
toc;

Implementing the naive algorithm is pretty straight forward

% matrix_multiplication.m
function f = matrix_multiplication(a, b)
  [n1, m1] = size(a);
  [n2, m2] = size(b);
  f = zeros(n1,m2);
  for i = 1:n1
    for j = 1:m2
      for k = 1:m1
        f(i,j) = f(i,j) + a(i,k) * b(k,j);
      end
    end
  end
end

We add the code to run the strassen algorithm to the main script

fprintf("STRASSEN MATMUL\n");
tic;
c2 = strassen_multiplication(a, b);
toc;

As said before, implementing the strassen algorithm is not so straight forward. Let's first create the function, get the size of the matrices, and fill the output matrix with zeros

% strassen_multiplication.m
function f = strassen_multiplication(a, b)
  [n,~] = size(a);
  f = zeros(n, n);
  % TODO
end

We then check if the size of the input matrix is less then $48$ and switch over to the naive implementation.

if n <= 48
  f = matrix_multiplication(a, b);
  return
end

Since we cannot use float point numbers when using indexing

nhalf = int32(n/2);

We now get the four quadrants of the two matrices for ease of use

a11 = a(1:nhalf,1:nhhalf);
a12 = a(1:nhalf,nhalf+1:n);
a21 = a(nhalf+1:n,1:nhalf);
a22 = a(nhalf+1:n,nhalf+1:n);

b11 = b(1:nhalf,1:nhalf);
b12 = b(1:nhalf,nhalf+1:n);
b21 = b(nhalf+1:n,1:nhalf);
b22 = b(nhalf+1:n,nhalf+1:n);

Calculate the temporary matrices indicated in the strassen algorithm

m1 = strassen_multiplication(a11 + a22, b11 + b22);
m2 = strassen_multiplication(a21 + a22, b11);
m3 = strassen_multiplication(a11, b12 - b22);
m4 = strassen_multiplication(a22, b21 - b11);
m5 = strassen_multiplication(a11 + a12, b22);
m6 = strassen_multiplication(a21 - a11, b11 + b12);
m7 = strassen_multiplication(a12 - a22, b21 + b22);

Finally, use the temporary matrices to get the output matrix and end the function

  f(1:nhalf,1:nhalf) = m1 + m4 - m5 + m7;
  f(1:nhalf,nhalf+1:n) = m3 + m5;
  f(nhalf+1:n,1:nhalf) = m2 + m4;
  f(nhalf+1:n,nhalf+1:n) = m1 - m2 + m3 + m6;
end

In the main script we now will use the function provided by MATLAB for matrix multiplication

fprintf("MATLAB'S MATMUL\n");
tic;
c3 = mtimes(a,b);
toc;

To validate the function we created, and since we know that MATLAB function is pretty accurate, we can use the c3 matrix to validate. We know that the three matrices are suppose to be equal to each other so we can subtract them and we know that it will be at least close to a matrix of zeros. We then average that across the elements of the matrix to get the average deviation.

zero1 = c1 - c3;
zero2 = c2 - c3;
avg1 = 0;
avg2 = 0;
for i = 1:n
  for j = 1:n
    avg1 = avg1 + zero1(i, j);
	avg2 = avg2 + zero2(i, j);
  end
end
avg1 = avg1 / (n^2);
avg2 = avg2 / (n^2);
fprintf("Average deviation for naive %d \n", avg1);
fprintf("Average deviation for strassen %d \n", avg2);

In the source code, I also calculate the Percent Average Deviation for good measure.
The output from my machine was

>> run strassen.m
NAIVE MATMUL
Elapsed time is 276.967041 seconds.
STRASSEN MATMUL
Elapsed time is 32.829293 seconds.
MATLAB's MATMUL
Elapsed time is 0.115984 seconds.
Average deviation for naive 4.605590e-13 
Average deviation for strassen 1.172674e-12 
Percent Average Deviation 9.005056e-14 % 
Percent Average Deviation 2.292865e-13 %

The fact that MATLAB's matrix multiplication was under a second is a testament to the developers of LAPACK and BLAS and the low overhead that the Fortran programming language provides as MATLAB uses the LAPACK and BLAS Fortran libraries to do these calculation.
As shown above, the Strassen hybrid algorithm is quite a bit faster then the Naive algorithm.