# Notes on Python deep learning (11): loss function

2022-02-02 09:50:02 Tanlyn

# 1. Definition

The general expression of the loss function is  L(y,f(x)), In order to use Measure true value y And the forecast f(x) The extent of the inconsistency , Generally smaller is better . In order to facilitate the comparison of different loss functions , It is often expressed as a function of a single variable , stay The return question The variable in is  [y-f(x)] ： Residual means , stay Classification problem Medium is  yf(x) ：  The trend is the same .

Neural networks with multiple outputs may have multiple loss functions , Each output corresponds to a loss function . However, the gradient descent process must be based on the loss value of a single scalar . therefore , For networks with multiple loss functions , All losses need to be averaged , Become a scalar value . # 2. classification ## 2.1 Classification problem

For dichotomies , label y ∈ { + 1 , − 1 }. The loss function is often expressed as y f ( x ) Monotonically decreasing form of . As shown in the figure below ： yf(x) It's called the interval margin, obviously y f ( x ) > 0 It indicates that the classifier classification is correct ,y f ( x ) < 0 , Description classifier classification error . among f ( x ) Is the hyperplane of the classifier . This reminds us of, for example, perceptron , Classification problems of support vector machines, etc , Minimizing the loss function is maximizing margin A process of .

### 2.1.1 0-1 Loss （0-1 loss）

The function formula is as follows ： 0-1 Loss treats any misclassification point the same , Even in remote areas .0-1 The losses are discrete 、 Nonconvex functions , Therefore, optimization is difficult . Therefore, other proxy loss functions are often used for optimization .

### 2.1.2  Cross entropy loss （cross entropy loss）

Cross entropy loss is the of deep learning Classification problem The loss function commonly used in . Cross entropy It is also an important concept in information theory , It is mainly used for Measure the difference between two probability distributions . There is no detailed knowledge of cross entropy theory , Only show the specific code implementation .

Background using cross entropy ：

Solve the problem through neural network Classification problem when , It's usually set to k Output points ,k Represents the number of categories , Here's the picture ： Each output node , The score of the corresponding category of the node will be output , Such as [cat,dog,car,pedestrian] by [44,10,22,5]. But the output node outputs the score , Not the probability distribution , Then there is no way to use cross entropy to measure the predicted results and true results , Then what shall I do? , The solution is to output the result followed by a layer softmax,softmax The function of is to convert the output score into probability distribution .

Dichotomous problem ： Binary cross entropy （binary_cross_entropy） Use tensorflow Realization ：

``````#  When batch_size by 1, The total number of tags is 1, The output shape by (1,1,1) when
import tensorflow as tf
y_true = [[[0.]]]
y_pred = [[[0.5]]]
#  Use the built-in function to realize
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
loss.numpy()
print(loss.numpy())
print
#  Their own coding, using formulas
loss_1 = -(1/1)*( 0*tf.math.log(0.5) +(1-0)*tf.math.log(1-0.5))
print(loss_1)`````` Use pytorch Realization ：

``````import numpy as np
import torch
import torch.nn.functional as F
y_true = np.array([0., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
y_pred = np.array([0.2, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8])
#  Code yourself and calculate with formula
my_loss = - y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)
mean_my_loss = np.mean(my_loss)
print('my_loss:', mean_my_loss)
#  Use pytorch Self contained function calculation
torch_pred = torch.tensor(y_pred)
torch_true = torch.tensor(y_true)
bce_loss = F.binary_cross_entropy(torch_pred, torch_true)
print('bce_loss:', bce_loss)`````` Multiple classification problem ： Classification cross entropy （categorical_cross_entropy）

## Use tensorflow Realization ：

``````import tensorflow as tf
y_true = [[[0.,1.]]]
y_pred = [[[0.4,0.6]]]#  Suppose you've gone through softmax, So and must be 1
#  Use the built-in function to calculate
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)

print(loss.numpy())
#  Use formula code to calculate
loss = -( 0*tf.math.log(0.4) + 1*tf.math.log(0.6) )
print(loss.numpy())`````` ## 2.2  The return question

The learning of regression problem is equivalent to Function fitting ： Select a function curve to make it fit the known data well and predict the unknown data well . So the problem of regression y and f(x) all ∈ R And use residual y − f ( x ) To measure the inconsistency between the predicted value and the real value of the regression problem .

### 2.2.1 Loss of mean square error （MSE,L2 loss）

The loss of mean square error is also called L2 Loss , The mathematical expression is as follows ： This is the most common loss function , It's a convex function , The gradient descent method can be used for optimization . But it is relatively sensitive to points far from the real value , The cost of the loss function is very high , This makes the robustness of the mean square error loss function worse .

### 2.2.2  Absolute value loss （MAE,L1 Loss）

The absolute value loss function is also called L 1 L1L1 Loss function , The mathematical expression is as follows ： The absolute value loss function is good for the processing of remote points relative to the mean square error , But in y = f ( x ) y=f(x)y=f(x) Where is a non differentiable function , also M A E MAEMAE The updated gradient is always the same , It is still possible to maintain a large gradient near the optimal value and miss the optimal value . # 3. summary ：

1. Square loss is most commonly used , The drawback is that For abnormal points will be given a greater punishment , So it's not enough robust.
2. Absolute loss has the characteristics of resisting the interference of abnormal points , But in y-f(x) A discontinuity can lead , Difficult to optimize .
3.Huber Loss is a combination of the two , When |y-f(x)| Less than a pre specified value δ when , Change to square loss ; Greater than δ when , It becomes something like an absolute loss , So it's also a comparison robust Loss function of .
4. If the outlier represents an important exception , And need to be detected , use MSE. Regression problems are often used MSE Loss function . If you just treat abnormal data as damaged data , use MAE.  