CROSS-ENTROPY-LOSS : BINARY AND CATEGORICAL

Akash Manna
4 min readMay 18, 2021

--

Let’s first understand what is Cross-Entropy(CE) : Suppose , there ae two probability distributions ,say p and q , in Information Theory , CE measures the average number of bits required to identify q, which is estimated from any coding optimization scheme , from p.

The formal definition of cross entropy is :

Def. : CE

Definition of Cross Entropy Loss function:

Def. CE Loss

For multi-class classification , the prediction (q) and Ground Truth(p) will be of multi dimensional rather than a mere number :

Then , the Loss Function will be :

for n-class classification.

Still scrapping your head ??…….Let’s understand through an example of multi-class classification.

CATEGORICAL-CROSS-ENTROPY LOSS :

Suppose , in our data set there are three classes to be identified :

{0 : ‘crocodile’ , 1 : ‘Deer’ , 2 : ‘Rhino’}

Now , actual distribution for the three classes will be :

for crocodile
for Deer
for Rhino.

Now , suppose , during training Rhino Data is fed to the network and the algorithm returns corresponding estimated probability distribution :

So , CE-Loss for Rhino data will be :

Total Loss

Observe , Contribution to CE-Loss is coming from q_{r3} only.

But , there is a constraint in the values of q_{r1} , q_{r2} and q_{r3} :

Constraint

While training , loss tends to decrease , implies that :

So ,

During training , value corresponding to Rhino increases towards 1 but other to values decreases towards 0.

When , q_{r3} = 1 , then Loss becomes 0 , because , ln(1) = 0 .

Binary Cross Entropy Loss :

In this case , there are two classes only .

So , the CE loss function will :

Loss for binary

But , now ,

constraint on p_i

and ,

Constraint on q_i

Because there are two classes only , so, if we know the probability distribution for one class , that of for other class will be complement to each other.

So, the final Loss function will be :

As, evident from the above equation , it is clear that , if we have only q_1 and the corresponding p_1 , the hole class can be evaluated.

So , to know only ‘q_1 class’ , we just need only one node at our output.

That why ,while applying Sigmoid ,we need only one node at the output that hold the probability distribution for the positive class only.

Conclusion:

As , evident from the above discussion , between Binary CE and Categorical CE , Binary CE is some kind of special case of Categorical CE, where only one probability distribution is required to know the probability distribution for the other one.

Let’s have a look at this blog , and let me know about your thoughts in the comment section. Constructive criticisms ill be appreciated.

--

--

Akash Manna

An aspirant Data Scientist. Completed Post Grad in Physics. Believe in 'Universe is made of Data.'