# Natural Language Processing — Neural Networks and Neural Language Models Lecture series — Training a feed forward neural network part 2 (The cross-entropy loss function)

In the previous post, we spoke about what it means to ‘train’ a feed-forward neural network. We also briefly touched on the different tasks that are performed in the training process of a feed-forward neural network. In this post, we will be solely focused on the loss function, the cross-entropy loss function to be precise.

What is a loss function and what role does it play in training a neural network?

As briefly stated in the previous post, the main purpose of the loss function is to indicate how close the predicted output value of the neural network is to the actual output value. The loss value can be represented mathematically by the following expression:

Where y-hat represents the predicted output value and y represents the actual output value.

What is the cross-entropy loss function?

In this post, the loss function that we will be discussing is known as the cross-entropy loss function. The cross-entropy loss function is a loss function that operates by preferring that the correct output values are more likely to be produced given their corresponding input values. That is for the cross-entropy loss function, the greater the probability that the given input values produce the correct output values, the lower the loss value and the lower the probability that the given input values produce the correct output values, the higher the loss value.

How can we derive the cross-entropy loss function?

Assume that there are only 2 discrete output values (0 or 1) for a given set of input values, and we will like to learn the weights (w and b) that maximise the probability of the correct output value y being predicted given its corresponding input values x (i.e maximise p(y|x)), we can start of by first observing the following mathematical equation in Fig. 1:

If you are already familiar with various mathematical expressions representing statistical distributions, you will realise that the equation written in Fig. 1 represents the Bernoulli distribution. We are using the mathematical expression of a Bernoulli distribution here because we stated in our opening assumption that only 2 discrete output values can be produced (either 0 or 1).

Now, we can take the log of both sides of the equation in Fig. 1 to yield the equation in Fig. 2, since taking the log does not affect the results of the mathematical expression but it rather prevents numeric overflow or underflow when the expression is represented in a computer system.

In maximising the probability represented in Fig. 2, we will be simultaneously minimising the loss function value. This is because, to re-iterate, the higher the probability that our neural network is able to generate correct output values, the lower the loss function value, since the loss function value demonstrates the difference between the correct output value and the predicted output value.

To focus on the minimisation of the loss function value, we append the minus sign to log of the probability and thus produce the mathematical representation of the cross-entropy loss function as demonstrated in Fig. 3: