Machine Learning Classification How to Deal with Imbalanced Data ❌ Practical ML Project with Python

preview_player
Показать описание

Microsoft Azure Certified:

Databricks Certified:

---

---

COURSERA SPECIALIZATIONS:

COURSES:

LEARN PYTHON:

LEARN SQL:

LEARN STATISTICS:

LEARN MACHINE LEARNING:

---

For business enquiries please connect with me on LinkedIn or book a call:

Disclaimer: I may earn a commission if you decide to use the links above. Thank you for supporting the channel!

#DecisionForest
Рекомендации по теме
Комментарии
Автор

I am not sure if this is a best way to deal with data imbalance and it won't work in a real case. You have used SMOTE to balance the dataset and used your test dataset from the oversampled data which is synthetic. To make sure your model is working well, you have to save part of the original imbalance dataset as your test dataset and then apply SMOTE on the rest. In this way your test dataset is a perfect representation of the original data. I am sure you f1-sccore will be very small. One of the best methods are One Class Support Vector Machine (OCSVM), Generalized One-class Discriminative Sub-spaces (GODS), One Class CNN (OCCNN) and Deep SVDD (DSVDD)

amansamsonmogos
Автор

Hello thanks for the video.

However I noticed that your did SMOTE before running the train test split. I am afraid that this might be causing the results to improve drastically since the the upsampled observations from the minority class might have entered the testing dataset. So basically your model learned and test on pretty much the same variable which caused the results to improve.

Let me know what you think.

ammarkamran
Автор

You do realize that in your pipeline, once you run the oversample step, you have 6 perfectly balanced groups with 900 samples in each group. There's no real majority class to sample from. When you then undersample from a perfectly balanced dataset, it appears to leave group intact and resamples the others. If you plot the data, it will look essentially the same as before, when you only oversampled, with some samples missing and other samples duplicated. The scores will be similar as well.

philwebb
Автор

Can you suggest any techniques to solve imbalanced image dataset??
Thank you..

nickpgr
Автор

In order to truly evaluate you need to test on an IMBALANCED test set. :) you can train on a balanced train set but hold out needs to be on a true imbalanced set . Because in the real world the data you encounter will have the same imbalanced-ness and that’s what your performance metric needs to measure: how well you score on unseen imbalanced data.

TrainingDay
Автор

Hi thanks for the content!

I am confused that, instead of applying this method for the y variable, can I apply this technique for imbalanced predictors that have levels with large differences in sample size?

For example, class A: 900, class B:100, class C: 2

Thanks!

kar
Автор

Unfortunately, your website link and notebook link are not available here. Any suggestion?

subhajit
Автор

When should we use under_sampling? As I see there's a potential risk of losing information

rahuldey
Автор

I don't understand how you apply under and oversampling at the same time. One of them will balance the data, and the other one has nothing left to do...

Mustistics
Автор

It is not clear sir, and I have a question, what is the technique u have used in sorting the problem class imbalance?

wliiliammitiku
Автор

You said to deal with 'multi-class classification problems'. But what if we have imbalanced data and binary classification?

titow
Автор

Please how can we get the jupyter notebook code?

tahirullah
Автор

Thanks for this share.
Please Could you send me this code?
I need it .

mahdimed
Автор

Excuse me, is there any way to find the original Notebook file? Can't open the one in the description. Thank you.

itstoufique
Автор

are u sure that the undersampling method is work? the number still same 900.

farisocta
visit shbcf.ru