|
46 | 46 | None, |
47 | 47 | 'reminder-on-books-with-hands-on-material-and-codes'), |
48 | 48 | ('Reading recommendations', 2, None, 'reading-recommendations'), |
| 49 | + ('From last week: Overarching view of a neural network', |
| 50 | + 2, |
| 51 | + None, |
| 52 | + 'from-last-week-overarching-view-of-a-neural-network'), |
| 53 | + ('The optimization problem', 2, None, 'the-optimization-problem'), |
| 54 | + ('Parameters of neural networks', |
| 55 | + 2, |
| 56 | + None, |
| 57 | + 'parameters-of-neural-networks'), |
| 58 | + ('Other ingredients of a neural network', |
| 59 | + 2, |
| 60 | + None, |
| 61 | + 'other-ingredients-of-a-neural-network'), |
| 62 | + ('Other parameters', 2, None, 'other-parameters'), |
49 | 63 | ('From last week, overarching discussions of neural networks: ' |
50 | 64 | 'Fine-tuning neural network hyperparameters', |
51 | 65 | 2, |
|
249 | 263 | <!-- navigation toc: --> <li><a href="#mathematics-of-deep-learning" style="font-size: 80%;"><b>Mathematics of deep learning</b></a></li> |
250 | 264 | <!-- navigation toc: --> <li><a href="#reminder-on-books-with-hands-on-material-and-codes" style="font-size: 80%;"><b>Reminder on books with hands-on material and codes</b></a></li> |
251 | 265 | <!-- navigation toc: --> <li><a href="#reading-recommendations" style="font-size: 80%;"><b>Reading recommendations</b></a></li> |
| 266 | + <!-- navigation toc: --> <li><a href="#from-last-week-overarching-view-of-a-neural-network" style="font-size: 80%;"><b>From last week: Overarching view of a neural network</b></a></li> |
| 267 | + <!-- navigation toc: --> <li><a href="#the-optimization-problem" style="font-size: 80%;"><b>The optimization problem</b></a></li> |
| 268 | + <!-- navigation toc: --> <li><a href="#parameters-of-neural-networks" style="font-size: 80%;"><b>Parameters of neural networks</b></a></li> |
| 269 | + <!-- navigation toc: --> <li><a href="#other-ingredients-of-a-neural-network" style="font-size: 80%;"><b>Other ingredients of a neural network</b></a></li> |
| 270 | + <!-- navigation toc: --> <li><a href="#other-parameters" style="font-size: 80%;"><b>Other parameters</b></a></li> |
252 | 271 | <!-- navigation toc: --> <li><a href="#from-last-week-overarching-discussions-of-neural-networks-fine-tuning-neural-network-hyperparameters" style="font-size: 80%;"><b>From last week, overarching discussions of neural networks: Fine-tuning neural network hyperparameters</b></a></li> |
253 | 272 | <!-- navigation toc: --> <li><a href="#hidden-layers" style="font-size: 80%;"><b>Hidden layers</b></a></li> |
254 | 273 | <!-- navigation toc: --> <li><a href="#which-activation-function-should-i-use" style="font-size: 80%;"><b>Which activation function should I use?</b></a></li> |
@@ -401,6 +420,84 @@ <h2 id="reading-recommendations" class="anchor">Reading recommendations </h2> |
401 | 420 | <li> Rashkca et al., chapters 11-13 for NNs and chapter 14 for CNNs, jupyter-notebook sent separately, from <a href="https://github.com/rasbt/machine-learning-book" target="_self">GitHub</a></li> |
402 | 421 | <li> Goodfellow et al, chapter 6 and 7 contain most of the neural network background. For CNNs see chapter 9.</li> |
403 | 422 | </ol> |
| 423 | +<!-- !split --> |
| 424 | +<h2 id="from-last-week-overarching-view-of-a-neural-network" class="anchor">From last week: Overarching view of a neural network </h2> |
| 425 | + |
| 426 | +<p>The architecture of a neural network defines our model. This model |
| 427 | +aims at describing some function \( f(\boldsymbol{x} \) that is meant to describe |
| 428 | +some final result (outputs or target values \( bm{y} \)) given a specific input |
| 429 | +\( \boldsymbol{x} \). Note that here \( \boldsymbol{y} \) and \( \boldsymbol{x} \) are not limited to be |
| 430 | +vectors. |
| 431 | +</p> |
| 432 | + |
| 433 | +<p>The architecture consists of</p> |
| 434 | +<ol> |
| 435 | +<li> An input and an output layer where the input layer is defined by the inputs \( \boldsymbol{x} \). The output layer produces the model ouput \( \boldsymbol{\tilde{y}} \) which is compared with the target value \( \boldsymbol{y} \)</li> |
| 436 | +<li> A given number of hidden layers and neurons/nodes/units for each layer (this may vary)</li> |
| 437 | +<li> A given activation function \( \sigma(\boldsymbol{z}) \) with arguments \( \boldsymbol{z} \) to be defined below. The activation functions may differ from layer to layer.</li> |
| 438 | +<li> The last layer, normally called <b>output</b> layer has an activation function tailored to the specific problem</li> |
| 439 | +<li> Finally, we define a so-called cost or loss function which is used to gauge the quality of our model.</li> |
| 440 | +</ol> |
| 441 | +<!-- !split --> |
| 442 | +<h2 id="the-optimization-problem" class="anchor">The optimization problem </h2> |
| 443 | + |
| 444 | +<p>The cost function is a function of the unknown parameters |
| 445 | +\( \boldsymbol{\Theta} \) where the latter is a container for all possible |
| 446 | +parameters needed to define a neural network |
| 447 | +</p> |
| 448 | + |
| 449 | +<p>If we are dealing with a regression task a typical cost/loss function |
| 450 | +is the mean squared error |
| 451 | +</p> |
| 452 | +$$ |
| 453 | +C(\boldsymbol{\Theta})=\frac{1}{n}\left\{\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)^T\left(\boldsymbol{y}-\boldsymbol{X}\boldsymbol{\theta}\right)\right\}. |
| 454 | +$$ |
| 455 | + |
| 456 | +<p>This function represents one of many possible ways to define |
| 457 | +the so-called cost function. Note that here we have assumed a linear dependence in terms of the paramters \( \boldsymbol{\Theta} \). This is in general not the case. |
| 458 | +</p> |
| 459 | + |
| 460 | +<!-- !split --> |
| 461 | +<h2 id="parameters-of-neural-networks" class="anchor">Parameters of neural networks </h2> |
| 462 | +<p>For neural networks the parameters |
| 463 | +\( \boldsymbol{\Theta} \) are given by the so-called weights and biases (to be |
| 464 | +defined below). |
| 465 | +</p> |
| 466 | + |
| 467 | +<p>The weights are given by matrix elements \( w_{ij}^{(l)} \) where the |
| 468 | +superscript indicates the layer number. The biases are typically given |
| 469 | +by vector elements representing each single node of a given layer, |
| 470 | +that is \( b_j^{(l)} \). |
| 471 | +</p> |
| 472 | + |
| 473 | +<!-- !split --> |
| 474 | +<h2 id="other-ingredients-of-a-neural-network" class="anchor">Other ingredients of a neural network </h2> |
| 475 | + |
| 476 | +<p>Having defined the architecture of a neural network, the optimization |
| 477 | +of the cost function with respect to the parameters \( \boldsymbol{\Theta} \), |
| 478 | +involves the calculations of gradients and their optimization. The |
| 479 | +gradients represent the derivatives of a multidimensional object and |
| 480 | +are often approximated by various gradient methods, including |
| 481 | +</p> |
| 482 | +<ol> |
| 483 | +<li> various quasi-Newton methods,</li> |
| 484 | +<li> plain gradient descent (GD) with a constant learning rate \( \eta \),</li> |
| 485 | +<li> GD with momentum and other approximations to the learning rates such as</li> |
| 486 | +<ul> |
| 487 | + <li> Adapative gradient (ADAgrad)</li> |
| 488 | + <li> Root mean-square propagation (RMSprop)</li> |
| 489 | + <li> Adaptive gradient with momentum (ADAM) and many other</li> |
| 490 | +</ul> |
| 491 | +<li> Stochastic gradient descent and various families of learning rate approximations</li> |
| 492 | +</ol> |
| 493 | +<!-- !split --> |
| 494 | +<h2 id="other-parameters" class="anchor">Other parameters </h2> |
| 495 | + |
| 496 | +<p>In addition to the above, there are often additional hyperparamaters |
| 497 | +which are included in the setup of a neural network. These will be |
| 498 | +discussed below. |
| 499 | +</p> |
| 500 | + |
404 | 501 | <!-- !split --> |
405 | 502 | <h2 id="from-last-week-overarching-discussions-of-neural-networks-fine-tuning-neural-network-hyperparameters" class="anchor">From last week, overarching discussions of neural networks: Fine-tuning neural network hyperparameters </h2> |
406 | 503 |
|
|
0 commit comments