How to handle imbalanced datasets in Python

preview_player
Показать описание
In this video, you will be learning about how you can handle imbalanced datasets. Particularly, your class labels for your classification model is imbalanced (one class is significantly larger than the other which essentially gives rise to a majority class and minority class). Here, we will use the imbalanced-learn Python library to perform random undersampling and random oversampling so that you can address this issue of imbalanced datasets.

⭕ Support my work:

⭕ Recommended Books:

⭕ Disclaimer:
Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents.

⭕ Stock photos, graphics and videos used on this channel:

#python #data #datascience #dataprofessor
Рекомендации по теме
Комментарии
Автор

It would have been nice to demonstrate the impact these resampling methods have on the test metrics of some benchmark model (especially one that can use class weights in the loss function). In my experience, resampling can sometimes make a model perform worse and it can be better to use models with class-weighted loss functions.

alexioannides
Автор

Great example. Perhaps you could make another video showing the oversampling on training data. Lots of people (myself included) start doing the oversampling on the whole dataset, which leads to data leakage... which is a mistake.

caioglech
Автор

Very great job professor ! Thank you so much for this clear video . By the way, do you think that after applying oversampling for example and after training a model (like XGBoost ) on the data, it would be interesting to use the Matthews Correlation Coefficient as a KPI to measure the efficiency of the model ? Or do you think it is not necessary? Thank you 🙏🏽

samuelbaba
Автор

thanks for this quick guide to overcoming the imbalance issue. I like to know, before applying these oversampling or undersampling techniques.. do i need to like standardize my dataset, or I can go with the original form of the data set?

gunjankumar
Автор

It's helpful for me and many more. Great tutorial, Chanin. Thank you so much for sharing with us.

thinamG
Автор

Thank you so so much. This is something that I am looking for. I struggled with this step in R-language for many months. I understand that by randomly sampling the overweight samples to mix with the underweight samples, just one time and further do model developing -- would create a poor model. Thus, my question is 1. How many times should I randomly sample 2. Does the distribution of both overweight and underweight samples affect times that we have to sample? Could you please share your thoughts?

minicorefacility
Автор

Hi professor, I am trying to do binary classification on advertising conversions using Markov Chain but I'm not sure how should I implement it. Do you have any suggestions on this?

joeyng
Автор

I've been following your channel since the collab with Ken Jee without realizing your name. Now you're inspiring me to pursue Data science even more! Thank you krub Ajarn Chanin! 🙏😂

sericthueksuban
Автор

Great tutorial Sir, When you split the data into X and Y and performed the resampling method, how can you make a concatenation with each other later?

ahmedjamel
Автор

Thanks alot . very precise and easy to understand

ifeanyiedward
Автор

This is a clear and simple guide to get started, thanks for sharing! About your last question, I am curious what would be your answer, which approach do you prefer from your experience?

michellpayano
Автор

Great tutorial as usual. Thanks for sharing, Professor!

Ghasforing
Автор

Ooo awesome tutorial! Love how clear it is

TinaHuang
Автор

What the side effect if we use synthetic data when handling the imbalance for building the models? And what if we have a lot of data, should we use oversample or undersample? Thank you prof

allanmarzuki
Автор

Awesome explained every line of code lot helpful for Novice in understanding ipynb

aashishmalhotra
Автор

I think there are some scenarios where we can use this technique differently..Can you tell us the different scenarios where we can perform oversampling, undersampling or random sampling

mukeshkund
Автор

Can u explain how does logistics regression behave with imbalanced dataset

aashishmalhotra
Автор

prof, thankyou for the nice video. But, i want to ask, how to show the balance data after had do SMOTE?

farahilyana
Автор

Why do undersampling instead slice the dataset do take the same amount of results?

eduardodimperio
Автор

in my data science course, we used the stratification parameter from train_test_split() from sklearn, how do they differ?

hubbiemid