Azure Machine Learning. Classifier with extremely imbalanced dataset.

Ivan Casana-Gallen 26 Reputation points
2024-10-04T15:28:44.2766667+00:00

Hi All,

I am working on a two-type classifier. The challenge I am facing is how imbalanced my dataset is. Only 2% of the rows belong to Category A (positive), whereas the 98% left belong to Category B (negative).

Having great accuracy on this occasion does not mean anything. All we need to say is that it is going to be negative, and we will right almost every single time.

As you may imagine I am keen on true positives.

I have tried to use SMOTE on Azure Machine Learning Designer but the results I get are quite poor. To avoid any data leakage I apply SMOTE after splitting the data. If I apply SMOTE before splitting the data I get better results but these are probably fake due to data leakage.

I have tried Automated ML hoping it would somehow deal with the imbalanced dataset but it does not. It just sends me a warning message about the Class Imalabace. The results are pretty bad too.

I was wondering if I could apply SMOTE and possibly Tomek Links too, but only to the training data set for Auto ML..

I would imagine I would need to go to notebooks, copy the code of the best-ranked automated model, and tweak it so I can apply those transformations to the training part only ( if this is possible). Also by doing this, I am not using the power of automated ML after transforming the data but I would be using just a model.

Any thoughts and guidance are very welcome.

Thank you

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,959 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vahid Ghafarpour 21,725 Reputation points
    2024-10-04T15:58:42.82+00:00

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.