Both the Stochastic Gradient Descent and Backpropagation are important concepts to understand when studying Neural Networks. Stochastic Gradient Descent optimizes the weights and Backpropagation finds the gradient of the error function quickly.

# Error Functions

There are two functions most in use with respect to Neural Networks, one is the L1 error function and the other is the L2 error function.

## L1 Error Function

The L1 error function is simply the absolute value of the difference of the actual and the predicted. So, $$L1=\sum_{i=1}^n|y_{\text{true}} - y_{\text{predicted}}|$$

## L2 Error Function

The L2 is the square of the difference of the actual and the predicted. So, $$L2=\sum_{i=1}^n (y_{\text{true}} - y_{\text{predicted}}) ^2$$

# Stochastic Gradient Descent

Gradient Descent is an algorithm that finds the minimum of a function. It uses the fact that if a function $f(x)$ is defined and differentiable then the direction of fastest descent is that of the negative gradient. So, for a function $\vec{f}(\vec{x})$ we can find the minimum with the following equation $$\vec{x}_{n+1}=\vec{x}_n-\gamma \nabla \vec{f}(\vec{x})$$ Stochastic Gradient Descent, on the other hand, is a stochastic approximation of Gradient Descent. Meaning, according to wikipedia, "samples are selected randomly (or shuffled) instead of as a single group... ". It is used mostly in Machine Learning where one is minimizing the error function of a Neural Network. Where $w$ are the weights $$w:=w-\gamma \nabla f_i(w)$$ Or, as a batch $$w:=w-\frac{\gamma}{2}\sum_{i=1}^n \nabla f_i(w)$$

# Backpropagation, Neural Networks

In the book "Neural Network Design" by Martin T. Hagan, et al., the backpropagation algorithm with the Stochastic Gradient Descent is presented and explained with greater detail then I can in this forum, link. I'll present the algorithm as shown and leave the research of the proof, which is available in the link, to the reader.

Where the input is $\vec{a}^0$, we first propagate forward through the network $$\vec{a}^{m+1}=f^{m+1}(\bar{W}^{m+1}\vec{a}^m+\vec{b}^{m+1})|m\in \mathbb{Z}\cap [0,M-1]$$

Where $M$ is the number of layers, $\bar{W}$ the weights, and $\vec{b}$ the bias. we define the output of the network $\vec{a}=\vec{a}^M$. Let's set $\vec{t}$ as the target $\dot{\vec{F}}$ is the derivative of the activation function, $\vec{n}$ is the output of the network before applying the activation function. We calculate the sensitivities $\vec{s}$ \begin{align}

s^M&=-2\dot{\bar{F}}^M(\vec{n}^M)(\vec{t}-\vec{a}) \\

s^m&=\dot{\bar{F}}^m(\vec{n}^m)(\bar{W}^{m+1})^Ts^{m+1} &m=M-1,\dots,2,1

\end{align} Then apply the sensitivities on the weights $\bar{W}$ and biases $\vec{b}$ \begin{align}

\bar{W}^m(k+1)&=\bar{W}^m(k)-\alpha \vec{s}^m (\vec{a}^{m-1})^T \\

\vec{b}^m(k+1)&=\vec{b}^m(k)-\alpha \vec{s}^m

\end{align}

One thing that was not clear in the book, is that $\dot{\bar{F}}$ refers to the diagonal matrix whose diagonal elements are $\dot{f}^m(n_1^m),\dot{f}^m(n_2^m),\dots,\dot{f}^m(n^m_{S^m})$ and calculating the sensitivities involve matrix multiplication. Since, the matrix is a diagonal one we can evaluate that simply as $$

\begin{bmatrix}\dot{f}^m(n_1^m) a_{1,1} & \dot{f}^m(n_1^m) a_{1,2} & \dots \\

\dot{f}^m(n_2^m) a_{2,1} & \dot{f}^m(n_2^m) a_{2,2} & \dots \\

\vdots & \vdots & \ddots

\end{bmatrix}$$ Where $a_{i,j}$ are the components of the matrix we are multiplying with. So, altogether \begin{align}

s_i^M&=-\dot{f}^M(n_i^M)(t_i-a_i)|i\in \mathbb{Z}\cap [1,S^M] \\

s_i^m&=\sum_{j=1}^{S^m}\dot{f}^m(n_i^m)w_{ji}^{m+1}s_j^{m+1}|i\in \mathbb{Z}\cap [1,S^m], m=M-1,\dots,2,1

\end{align}