Azure Machine Learning. Classifier with extremely imbalanced dataset.

Question

Hi All,

I am working on a two-type classifier. The challenge I am facing is how imbalanced my dataset is. Only 2% of the rows belong to Category A (positive), whereas the 98% left belong to Category B (negative).

Having great accuracy on this occasion does not mean anything. All we need to say is that it is going to be negative, and we will right almost every single time.

As you may imagine I am keen on true positives.

I have tried to use SMOTE on Azure Machine Learning Designer but the results I get are quite poor. To avoid any data leakage I apply SMOTE after splitting the data. If I apply SMOTE before splitting the data I get better results but these are probably fake due to data leakage.

I have tried Automated ML hoping it would somehow deal with the imbalanced dataset but it does not. It just sends me a warning message about the Class Imalabace. The results are pretty bad too.

I was wondering if I could apply SMOTE and possibly Tomek Links too, but only to the training data set for Auto ML..

I would imagine I would need to go to notebooks, copy the code of the best-ranked automated model, and tweak it so I can apply those transformations to the training part only ( if this is possible). Also by doing this, I am not using the power of automated ML after transforming the data but I would be using just a model.

Any thoughts and guidance are very welcome.

Thank you

Answer

May this article help you

https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls?view=azureml-api-2

Share via

Azure Machine Learning. Classifier with extremely imbalanced dataset.

1 answer

Your answer