I assumed you have the knowledge of following topics:

Now we are finally going to construct a complete Feedforward Neural Network. All the layers are hooked up by a process of Forward Propagation. And later in this tutorial, we will train this network with Backpropagation Algorithm, a close derivation of Gradient-based Optimization.

Forward Propagation

Forward Propagation, or usually fprop is rather simple in feedforward layer, you can take previous output as input and produce the output for next layer.

A example implementation looks like this (here):

 1    def fprop(self,
 2              X):
 3        """Forward propagation
 5        Parameters
 6        ----------
 7        X : matrix or 4D tensor
 8            input samples, the size is (number of cases, in_dim)
10        Returns
11        -------
12        out : list
13            output list from each layer
14        """
16        out=[];
17        level_out=X;
18        for k, layer in enumerate(self.layers):
20            level_out=layer.apply(level_out);
22            out.append(level_out);
24        return out;

Here the loop iterates through all the layers and each layer is applied to previous layer’s output. Note that a uniform API is important when you develop a Deep Learning algorithm.

Example of create a network model

2                  out_dim=500);
4                  out_dim=200);
6                     out_dim=10);
8model=FeedForward(layers=[layer_0, layer_1, layer_2]);

The above code describes 3 layers network, the first 2 layers are ReLULayer and the output layer is SoftmaxLayer.

Cost Function

In order to train a neural network using gradient-based algorithm, there are two necessary parts: a cost function and a list of parameters that is subject to change. The simplest form Stochastic Gradient Descent (SGD) is updated by:

where \(J\) is the function.

A cost is defined as the difference between actual output and target output. Here since we use Softmax layer, we described here in code for categorical cross entropy cost (here).

Categorical cross entropy summarized cross entropy between 2 probability distribution (Remember that output of Softmax Layer is a probability distribution).

 1def categorical_cross_entropy_cost(Y_hat, Y_star):
 2    """Categorical Cross Entropy Cost
 4    Parameters
 5    ----------
 6    Y_hat : tensor
 7        predicted output of neural network
 8    Y_star : tensor
 9        optimal output of neural network
11    Returns
12    -------
13    costs : scalar
14        cost of Categorical Cross Entropy Cost
15    """
17    return T.nnet.categorical_crossentropy(Y_hat, Y_star).mean();

Besides the cost between actual output and target, we usually also introduce L1 and L2 regularization

Summary Parameters

All relevant parameters in the model should be documented together in order to get the correct gradient for the entire model. You can use an one-liner to do this job:

1    @property
2    def params(self):
3        return [param for layer in self.layers if hasattr(layer, 'params') for param in layer.params];

The above code is what I used to zip all parameters in the neural network.

Training Model

Usually, writing a BP algorithm is tedious and complex. Since Theano introduced auto-gradient method based on computation graph search. This process is now easy and flexible to all kinds of use.

You just need to call the function grad(cost, params), it will compute a corresponding list of parameter gradients.

1    gparams=T.grad(cost, params);
3    for gparam, param in zip(gparams, params):
4        if method=="sgd":
5            updates[param]=param-learning_rate*gparam;

The above is a typical example of SGD.

You can then use the updates of the parameters to build a training model:

2                      outputs=cost,
3                      updates=updates,
4                      givens={X: train_set_x[idx * batch_size: (idx + 1) * batch_size],
5                              y: train_set_y[idx * batch_size: (idx + 1) * batch_size]});

The above Theano function can be used to train all the parameters. Given a batch of data, the cost is used to update the parameters. the rest for the training is just to call this function on every training batch in number of epochs.