How To Train ML Models With Mislabeled Data

3 Tips on how to train machine learning models efficiently when your data is noisy and mislabeled…

6 min readFeb 24, 2021


Photo by Alex Chumak on Unsplash

In this article, I would like to talk about 3 tricks that helped me to efficiently train models and win a silver medal in a kaggle competition where the dataset was mislabeled and contained a significant amount of noise.

Leaderboard position of the Cassava Leaf Disease Classification competition.

By using those 3 tricks, I managed to deal with the noisy data and finish in the 114th position out of 3900 teams in this competition.

Rule n° 1 in data science: Garbage In = Garbage Out.

Mislabeled data is part of real world data, not all the datasets are clean. Most datasets tend to have some amount of noise which can be challenging when training a machine learning model. The good news is that the Garbage In = Garbage Out rule can be overcome with some tricks that can help your model adapt to the mislabeled data.

A brief introduction to the dataset:

Cassava leaf disease prediction: It’s a computer vision competition with a dataset of 21,367 labeled images of cassava plants. The aim of the competition was to classify each cassava image into four disease categories or a fifth category indicating a healthy leaf.

5 images representing cassava plants with 4 different diseases 1 healthy plant
Figure 1. Cassava plant classes 4 diseases and one healthy class: CBB, Healthy, CBSD, CGM and CMD [Image by Author]

After a quick exploratory data analysis, I realized that some of the images were mislabeled, let’s have the example of the 2 images below:

2 images of cassava plants. the first image has dead leaves, the second one has healthy leaves
Figure 2. An example of a mislabeled image (left) and a correctly labeled image (right) [Image By Author]

We can clearly see that the 1st image contains diseased leaves while the 2nd one has healthy leaves . Well, both images were labeled as ‘healthy’ in this dataset, which makes the task of the model harder since it has to extract and learn the features of healthy and diseased leaves and assign them to the same class: Healthy.

In the following section, I would like to talk about 3 tricks I found useful to deal with noisy datasets:

1- Bi-Tempered loss function:

Picking the right loss function is very critical in machine learning. It depends a lot on your data, task and metric. In this case, we have a multi-class classification (5 classes) with categorical accuracy as a metric. So, the first loss function that comes to mind is categorical cross-entropy.

However, we have a mislabeled dataset and the cross-entropy loss is very sensitive to outliers. Mislabeled images can stretch the decision boundaries and dominate the overall loss.

To solve this problem, Google AI researchers introduced a “bi-tempered” generalization of the logistic loss endowed with two tunable parameters that handle those situations well, which they call “temperatures” — t1, which characterizes boundedness, and t2 for tail-heaviness.

It’s basically a cross-entropy loss with 2 new tunable parameters t1 and t2. The standard cross-entropy can be recovered by setting both t1 and t2 equal to 1.

So, what happens when we tune t1 and t2 parameters?

Examples of the decision boundaries of models using logistic loss (stretched boundaries) and Bi-Tempered loss (adjusted boundaries) functions
Figure 3. Difference in decision boundaries between logistic loss (cross-entropy loss) and Bi-Tempered loss Source:

Let’s understand what’s happening here:

  • With small margin noise: The noise stretched the decision boundary in a heavy tailed form. This was solved with the Bi-Tempered loss by tuning the t2 parameter from t2=1 to t2=4.
  • With large margin noise: The large noise stretched the decision boundary in a bounded way, covering more surface than the heavy tail in the case of small margin noise. The Bi-Tempered loss solved this by tuning the t1 parameter from t1=1 to t1=0.2.
  • With random noise: Here, we can see both heavy tails and bounded decision boundaries, so both t1 and t2 are adjusted in the Bi-Tempered loss.

The best way to finetune the t1 and t2 parameters is by plotting your model’s decision boundary and checking if your decision boundary is heavy tailed, bounded or both, then tweak the t1 and t2 parameters accordingly.

If you are dealing with tabular data, you can use the Plot_decision_regions() function to visualize your model’s decision boundaries.

#Import package
from mlxtend.plotting import plot_decision_regions
# Plot decision boundary
plot_decision_regions(x=x_test, y=y_test, clf=model, legend=2)

You can learn more about the Bi-Tempered loss in the Google AI blog and their github repository.

2- Self Distillation:

If you are already familiar with knowledge distillation where knowledge transfer takes place from a teacher to a student model, self distillation is a very similar concept.

This new concept was introduced in the paper: Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation. The idea is so simple:

Self Distillation: You train your model and then you retrain it using itself as a teacher.

The paper discusses a more advanced approach that includes several loss functions and some architecture modifications (Additional bottleneck and fully connected layers). In this article. I’d like to introduce a much simpler approach.

I read about this approach in the first place solution of the plant pathology competition on kaggle, where the winner team used self-distillation to deal with the noisy dataset. You can check the code in their github repository.

Self Distillation in 3 steps:

  • 1- Split your dataset to k folds cross-validation.
  • 2- Train model 1 to predict the out of folds predictions.
A diagram representing a dataset split to 5 folds cross-validation fed to a neural networks model that predicts the out of folds predictions
Figure 4. The dataset is split to 5 folds cross-validation. Model 1 predicts out of fold predictions. [Image by Author]
  • 3- After saving the out of folds predictions predicted by our model, we load them and blend them with the original labels. The blending coefficients are tunable, the original labels should have a higher coefficient.
A dataset split to 5 fold cross validation, the labels are blend with the out of folds predictions from the previous figure. The new dataset is fed to the neural networks model
Figure 5. Model 2 uses out of fold predicitons (OOF) from Model 1 to improve its performance. [Image by Author]

The out of fold predictions here are class probabilities predicted by model 1:

  • In this particular example we have a multiclass classification with 5 classes [0,1,2,3,4].
  • The labels are one hot encoded. Class 2 is represented as [0,0,1,0,0].
  • Model 1 predicted the class 2 correctly: [0.1, 0.1 ,0.4 ,0.1 ,0.3]. Giving it a probability of 0.4, higher than the other classes. But, it also gave class 4 a high probability of 0.3.
  • Model 2 will use this information to improve its predictions.

3- Ensemble learning:

Ensemble learning is well known to improve the quality of predictions in general. In the case of noisy datasets it can be very helpful because each model has a different architecture and learns different patterns.

I was planning to try Vision Transformer models released by Google AI in the paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, and this competition was the perfect place to try and learn more about them since they introduce a new concept in computer vision that is different than convolutional neural networks that are dominating the field.

In short, the ensemble of a vision transformer model with 2 different CNN architectures improves the predictions quality of single models:

A diagram representing a dataset split to 5 folds cross-validation fed to 3 different neural networks: Vision transformer, resnext50_32d_4d and efficientnet B3
Figure 6. Ensemble of 3 different models: VitBase-16, resnext50_32x4d and EfficientNet B3. [Image by Author]

To sum up, You can train machine learning models with mislabeled data by using:

  • The Bi-Tempered loss function and tuning its parameters t1 and t2 correctly.
  • Self Distillation: Train your model and retrain it again using itself as a teacher.
  • Ensemble learning: Ensemble the predictions of different models.

If you would like to learn more details about the models training process, check the summary of my approach on kaggle.

The charts were made with the app