How To Train ML Models With Mislabeled Data

Photo by Alex Chumak on Unsplash
Leaderboard position of the Cassava Leaf Disease Classification competition.

Rule n° 1 in data science: Garbage In = Garbage Out.

A brief introduction to the dataset:

5 images representing cassava plants with 4 different diseases 1 healthy plant
Figure 1. Cassava plant classes 4 diseases and one healthy class: CBB, Healthy, CBSD, CGM and CMD [Image by Author]
2 images of cassava plants. the first image has dead leaves, the second one has healthy leaves
Figure 2. An example of a mislabeled image (left) and a correctly labeled image (right) [Image By Author]

1- Bi-Tempered loss function:

Examples of the decision boundaries of models using logistic loss (stretched boundaries) and Bi-Tempered loss (adjusted boundaries) functions
Figure 3. Difference in decision boundaries between logistic loss (cross-entropy loss) and Bi-Tempered loss Source:
  • With small margin noise: The noise stretched the decision boundary in a heavy tailed form. This was solved with the Bi-Tempered loss by tuning the t2 parameter from t2=1 to t2=4.
  • With large margin noise: The large noise stretched the decision boundary in a bounded way, covering more surface than the heavy tail in the case of small margin noise. The Bi-Tempered loss solved this by tuning the t1 parameter from t1=1 to t1=0.2.
  • With random noise: Here, we can see both heavy tails and bounded decision boundaries, so both t1 and t2 are adjusted in the Bi-Tempered loss.
#Import package
from mlxtend.plotting import plot_decision_regions
# Plot decision boundary
plot_decision_regions(x=x_test, y=y_test, clf=model, legend=2)

2- Self Distillation:

Self Distillation: You train your model and then you retrain it using itself as a teacher.

  • 1- Split your dataset to k folds cross-validation.
  • 2- Train model 1 to predict the out of folds predictions.
A diagram representing a dataset split to 5 folds cross-validation fed to a neural networks model that predicts the out of folds predictions
Figure 4. The dataset is split to 5 folds cross-validation. Model 1 predicts out of fold predictions. [Image by Author]
  • 3- After saving the out of folds predictions predicted by our model, we load them and blend them with the original labels. The blending coefficients are tunable, the original labels should have a higher coefficient.
A dataset split to 5 fold cross validation, the labels are blend with the out of folds predictions from the previous figure. The new dataset is fed to the neural networks model
Figure 5. Model 2 uses out of fold predicitons (OOF) from Model 1 to improve its performance. [Image by Author]
  • In this particular example we have a multiclass classification with 5 classes [0,1,2,3,4].
  • The labels are one hot encoded. Class 2 is represented as [0,0,1,0,0].
  • Model 1 predicted the class 2 correctly: [0.1, 0.1 ,0.4 ,0.1 ,0.3]. Giving it a probability of 0.4, higher than the other classes. But, it also gave class 4 a high probability of 0.3.
  • Model 2 will use this information to improve its predictions.

3- Ensemble learning:

A diagram representing a dataset split to 5 folds cross-validation fed to 3 different neural networks: Vision transformer, resnext50_32d_4d and efficientnet B3
Figure 6. Ensemble of 3 different models: VitBase-16, resnext50_32x4d and EfficientNet B3. [Image by Author]
  • The Bi-Tempered loss function and tuning its parameters t1 and t2 correctly.
  • Self Distillation: Train your model and retrain it again using itself as a teacher.
  • Ensemble learning: Ensemble the predictions of different models.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store