User:Chakazul/AI

Neural Network Notations

Dimensions

	Dimension	Variable
# Samples	$m$	$\mu$
# Layers (exclude input)	$L$	$\ell$
# Units in Input Layer	$n_{X}\equiv n_{0}$	$i$
# Units in Hidden Layer	$n_{\ell }$	$i,j$
# Units in Output Layer / # Classes	$n_{Y}\equiv n_{L}\equiv K$	$k$

Constants

	Constant
Learning Rate	$\alpha$
Regularization Factor	$\lambda$

Matrices

	Notation	Equation	Dimensions	Layers
Input	$X$	(given)	$n_{X}\times m$	(global)
Output	$Y$	(given)	$n_{Y}\times m$	(global)
Feedforward
$\overbrace {{\biggl (}{\color {red}X\equiv A_{0}}{\biggr )}} ^{\text{input}}{\xrightarrow[{\color {green}W_{1}\ b_{1}}]{}}\overbrace {{\biggl (}Z_{1}{\xrightarrow[{g()}]{}}{\color {blue}A_{1}}{\biggr )}{\xrightarrow[{\color {green}W_{2}\ b_{2}}]{}}{\biggl (}Z_{2}{\xrightarrow[{g()}]{}}{\color {blue}A_{2}}{\biggr )}\twoheadrightarrow {\biggl (}Z_{\ell }\rightarrow {\color {blue}A_{\ell }}{\biggr )}\twoheadrightarrow {\biggl (}Z_{L-1}\rightarrow {\color {blue}A_{L-1}}{\biggr )}} ^{\text{hidden}}{\xrightarrow[{\color {green}W_{L}\ b_{L}}]{}}\overbrace {{\biggl (}Z_{L}{\xrightarrow[{g()}]{}}{\color {blue}A_{L}\equiv {\hat {Y}}}{\biggr )}} ^{\text{output}}$
Weight	$W_{\ell }$	(given / calculated)	$n_{\ell }\times n_{\ell -1}$	$1\ldots L$
Bias	$b_{\ell }$	(given / calculated)	$n_{\ell }\times 1$	$1\ldots L$
Input	$A_{0}$	$A_{0}\equiv X$	$n_{X}\times m$	$0$
Weighted Input	$Z_{\ell }$	$Z=WA_{\ominus }+b\ 1$	${\begin{aligned}n_{\ell }\times m&=(n_{\ell }\times n_{\ell -1})(n_{\ell -1}\times m)\\&=(n_{\ell }\times 1)(1\times m)\end{aligned}}$	$1\ldots L$
Activation	$A_{\ell }$	$A=g(Z)$	$n_{\ell }\times m$	$1\ldots L$
Predicted Output	${\hat {Y}}$	$A_{L}\equiv {\hat {Y}}$	$n_{Y}\times m$	$L$
Backpropagation
$\overbrace {{\biggl (}{\begin{matrix}\quad \\\quad \\\quad \end{matrix}}{\biggl )}} ^{\text{input}}\quad \overbrace {{\biggl (}{\begin{matrix}E_{1}\\\downarrow \\\color {green}\Delta {W_{1}}\ \Delta {b_{1}}\end{matrix}}{\biggr )}{\xleftarrow[{W_{2}\ g'(Z_{1})}]{}}{\biggl (}{\begin{matrix}E_{2}\\\downarrow \\\color {green}\Delta {W_{2}}\ \Delta {b_{2}}\end{matrix}}{\biggr )}\twoheadleftarrow {\biggl (}{\begin{matrix}E_{\ell }\\\downarrow \\\ \end{matrix}}{\biggr )}\twoheadleftarrow {\biggl (}{\begin{matrix}E_{L-1}\\\downarrow \\\ \end{matrix}}{\biggr )}} ^{\text{hidden}}{\xleftarrow[{W_{L}\ g'(Z_{L-1})}]{}}\overbrace {{\biggl (}{\begin{matrix}E_{L}\\\downarrow \\\color {green}\Delta {W_{L}}\ \Delta {b_{L}}\end{matrix}}{\xleftarrow[{g'(Z_{L})}]{}}({\color {blue}{\hat {Y}}}-{\color {red}Y}){\biggr )}} ^{\text{output}}$
Loss Function (CE or MSE)	${\mathcal {L}}({\hat {Y}},Y)$	${\mathcal {L}}={\begin{cases}-Y\odot log{\hat {Y}}-(1-Y)\odot log(1-{\hat {Y}})\\({\hat {Y}}-Y)^{\circ 2}\end{cases}}$	$n_{Y}\times m$	$1\ldots L$
Cost Function	$C(W,b)$	$C=\displaystyle {1 \over m}\textstyle \sum _{\ell }\mathrm {sum} ({\mathcal {L}})+\displaystyle {\lambda \over 2m}\textstyle \sum _{\ell }\mathrm {sum} (W^{\circ 2})$	(scalar)	(global)
Optimization	$W^{},b^{}$	$W^{},b^{}={\mathrm {arg} \min }_{W,b}C$
Output Error	$E_{L}={\partial C \over \partial Z_{L}}$	$E_{L}=({\hat {Y}}-Y)\odot g'(Z_{L})$	$n_{Y}\times m$	$L$
Hidden Error	$E_{\ell }={\partial C \over \partial Z_{\ell }}$	$E_{\ominus }=W^{T}E\odot g'(Z_{\ominus })$	$n_{\ell -1}\times m=(n_{\ell -1}\times n_{\ell })(n_{\ell }\times m)$	$1\ldots L-1$
Weight Update (Gradient Descent)	$\Delta {W_{\ell }}={\partial C \over \partial W_{\ell }}$	${\begin{array}{l}\Delta W={1 \over m}(EA_{\ominus }^{T}+\lambda W)\\W\rightarrow W-\alpha \Delta W\end{array}}$	$n_{\ell }\times n_{\ell -1}=(n_{\ell }\times m)(m\times n_{\ell -1})$	$1\ldots L$
Bias Update (Gradient Descent)	$\Delta {b_{\ell }}={\partial C \over \partial b_{\ell }}$	${\begin{array}{l}\Delta b={1 \over m}(E\ 1)\\b\rightarrow b-\alpha \Delta b\end{array}}$	$n_{\ell }\times 1=(n_{\ell }\times m)(m\times 1)$	$1\ldots L$

Details

Functions and Partial Derivatives

${\begin{array}{lcl}C=\displaystyle {1 \over m}{\text{sum}}({\mathcal {L}}({\hat {Y}},Y))&\Rightarrow &\displaystyle {\partial C \over \partial {\hat {Y}}}={\hat {Y}}-Y\\A=g(Z)={\begin{cases}1/(1+e^{-Z})\\\tanh(Z)\\\max(0,Z)\\e^{Z}/\sum e^{Z}\end{cases}}&\Rightarrow &\displaystyle {\partial A \over \partial Z}=g'(Z)={\begin{cases}A(1-A)&{\mathsf {..sigmoid}}\\1-A^{2}&{\mathsf {..tanh}}\\0{\text{ or }}1&{\mathsf {..ReLU}}\\A(1-A)&{\mathsf {..softmax}}\end{cases}}\\Z=WA_{\ominus }+b\ 1&\Rightarrow &\displaystyle {\partial Z \over \partial A_{\ominus }}=W\quad {\Bigl |}\quad \displaystyle {\partial Z \over \partial W}=A_{\ominus }\quad {\Bigl |}\quad \displaystyle {\partial Z \over \partial b}=1\\\end{array}}$

Chain Rule

${\begin{array}{l}\displaystyle {\partial C \over \partial W}={\Bigl [}{\Bigl [}\displaystyle {\partial C \over \partial {\hat {Y}}}\displaystyle {\partial {\hat {Y}} \over \partial Z_{L}}{\Bigl ]}\cdots \displaystyle {\partial Z_{\oplus } \over \partial A}\displaystyle {\partial A \over \partial Z}{\Bigr ]}\displaystyle {\partial Z \over \partial W}={\Bigl [}W_{\oplus }^{T}\cdots {\Bigl [}({\hat {Y}}-Y)\odot g'(Z_{L}){\Bigl ]}\cdots \odot g'(Z){\Bigr ]}A_{\ominus }^{T}\\\displaystyle {\partial C \over \partial b}={\Bigl [}{\Bigl [}\displaystyle {\partial C \over \partial {\hat {Y}}}\displaystyle {\partial {\hat {Y}} \over \partial Z_{L}}{\Bigl ]}\cdots \displaystyle {\partial Z_{\oplus } \over \partial A}\displaystyle {\partial A \over \partial Z}{\Bigr ]}\displaystyle {\partial Z \over \partial b}={\Bigl [}W_{\oplus }^{T}\cdots {\Bigl [}({\hat {Y}}-Y)\odot g'(Z_{L}){\Bigl ]}\cdots \odot g'(Z){\Bigr ]}\\\end{array}}$

Weight / Bias Update (Gradient Descent)

${\begin{array}{ll}\Delta W=-\alpha \displaystyle {\partial C \over \partial W}&\quad W=W+\Delta W\\\Delta b=-\alpha \displaystyle {\partial C \over \partial b}&\quad b=b+\Delta b\\\end{array}}$

Examples

${\begin{array}{l}{\partial C \over \partial W_{2}}={\partial C \over \partial A_{2}}{\partial A_{2} \over \partial Z_{2}}{\partial Z_{2} \over \partial W_{2}}=(A_{2}-Y)\odot g'(Z_{2})A_{1}\\{\partial C \over \partial b_{2}}={\partial C \over \partial A_{2}}{\partial A_{2} \over \partial Z_{2}}{\partial Z_{2} \over \partial b_{2}}=(A_{2}-Y)\odot g'(Z_{2})\\{\partial C \over \partial W_{1}}={\partial C \over \partial A_{2}}{\partial A_{2} \over \partial Z_{2}}{\partial Z_{2} \over \partial A_{1}}{\partial A_{1} \over \partial Z_{1}}{\partial Z_{1} \over \partial W_{1}}=(A_{2}-Y)\odot g'(Z_{2})W_{2}\odot g'(Z_{1})A_{0}\\{\partial C \over \partial b_{1}}={\partial C \over \partial A_{2}}{\partial A_{2} \over \partial Z_{2}}{\partial Z_{2} \over \partial A_{1}}{\partial A_{1} \over \partial Z_{1}}{\partial Z_{1} \over \partial b_{1}}=(A_{2}-Y)\odot g'(Z_{2})W_{2}\odot g'(Z_{1})\\\end{array}}$

Remarks

$\Box _{\ominus }\equiv \Box _{\ell -1}$ is the matrix of the previous layer, $\Box _{\oplus }\equiv \Box _{\ell +1}$ is that of the next layer, otherwise $\Box \equiv \Box _{\ell }$ implicitly refer to the current layer
$g()$ is the activation function (e.g. sigmoid, tanh, ReLU)
$\odot$ is the element-wise product
$\Box ^{\circ 2}$ is the element-wise power
$\mathrm {sum} (\Box )$ is the matrix's sum of elements
${\partial \over \partial \Box }$ is the matrix derivative
Variations:
1. All matrices transposed, matrix multiplcations in reverse order (row vectors instead of column vectors)
2. $W,b$ combined into one parameter matrix $\Theta$
3. No $\odot g'(Z_{L})$ term in $E_{L}$