CROSS-ENTROPY-LOSS : BINARY AND CATEGORICAL
Let’s first understand what is Cross-Entropy(CE) : Suppose , there ae two probability distributions ,say p and q , in Information Theory , CE measures the average number of bits required to identify q, which is estimated from any coding optimization scheme , from p.
The formal definition of cross entropy is :
Definition of Cross Entropy Loss function:
For multi-class classification , the prediction (q) and Ground Truth(p) will be of multi dimensional rather than a mere number :
Then , the Loss Function will be :
Still scrapping your head ??…….Let’s understand through an example of multi-class classification.
CATEGORICAL-CROSS-ENTROPY LOSS :
Suppose , in our data set there are three classes to be identified :
{0 : ‘crocodile’ , 1 : ‘Deer’ , 2 : ‘Rhino’}
Now , actual distribution for the three classes will be :
Now , suppose , during training Rhino Data is fed to the network and the algorithm returns corresponding estimated probability distribution :
So , CE-Loss for Rhino data will be :
Observe , Contribution to CE-Loss is coming from q_{r3} only.
But , there is a constraint in the values of q_{r1} , q_{r2} and q_{r3} :
While training , loss tends to decrease , implies that :
So ,
When , q_{r3} = 1 , then Loss becomes 0 , because , ln(1) = 0 .
Binary Cross Entropy Loss :
In this case , there are two classes only .
So , the CE loss function will :
But , now ,
and ,
Because there are two classes only , so, if we know the probability distribution for one class , that of for other class will be complement to each other.
So, the final Loss function will be :
As, evident from the above equation , it is clear that , if we have only q_1 and the corresponding p_1 , the hole class can be evaluated.
So , to know only ‘q_1 class’ , we just need only one node at our output.
That why ,while applying Sigmoid ,we need only one node at the output that hold the probability distribution for the positive class only.
Conclusion:
As , evident from the above discussion , between Binary CE and Categorical CE , Binary CE is some kind of special case of Categorical CE, where only one probability distribution is required to know the probability distribution for the other one.
Let’s have a look at this blog , and let me know about your thoughts in the comment section. Constructive criticisms ill be appreciated.