Handling Missing Data Easily Explained| Machine Learning

preview_player
Показать описание
Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.

Handling missing data is important as many machine learning algorithms do not support data with missing values.

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:

How to marking invalid or corrupt values as missing in your dataset.
How to remove rows with missing data from your dataset.
How to impute missing values with mean values in your dataset.

You can buy my book where I have provided a detailed explanation of how we can use Machine Learning, Deep Learning in Finance using python

Рекомендации по теме
Комментарии
Автор

Your channel is awesome, please keep going! Can't tell you how valuable your videos are when starting to learn!

stevechops
Автор

Honestly, I really love your videos, simple and easy to understand. Always answering my machine learning and data science questions! I do have one though. I watched your video on standardisation and normalisation. I am trying to build a benchmark/index, would it be okay to make the data standardized before creating it or?

dishydez
Автор

Thank you Krish sir. I was following the kaggle learn course on machine learning but couldn't understand this topic even after so much of hard work - now it's all clear. Keep it up.

raunasur
Автор

Thanks Krish. I can't think of an easier explanation of a tricky topic!!! Simply superb!!!👍

equiwave
Автор

Today I started working on the titanic data. Tried to predict the missing age values but failed and was very tensed. So, I started watching your video in hope for a way. When you opened the notebook I felt such a relief - 'ki aab to ho hi jaega'. Thank you for making this video.

himalayasinghsheoran
Автор

Your explanation is pretty much amazing and your my perfect as usual.

hanman
Автор

I think there is a quantitative justification why we should fill the NaN values on 'Age' with median that classified by 'Sex' and 'Pclass'. On EDA step, we can print or visualize heatmap of the correlations between each columns (dataset.corr().abs()). We can see that 'Age' columns has relatively high correlation to 'Sex' and 'Pclass' columns.

radifantaufik
Автор

Nice explanation, conclusion depending on your end goal, and whether if drop or change to mean will affect on your analysis, in he’s example he need the age but he didn’t need the cabinet.

strangereview
Автор

Hi Krish,


Your videos are quite useful and simple to understand. My request is if you can create video on how we can deploy ML model with Flask that will be very useful..

dilipgawade
Автор

Thought that you will also implement Regression Model for synthetic imputation. But the content is great!!

finance_tamil
Автор

Thank you for making life so much easier for us!

aimenbaig
Автор

Cleared all my doubts! Great..Thank you so much!!

bhaktibailurkar
Автор

Thank you Krish, you have explained the second option very well. Wondering how we do this for categorical columns and when values are missing from multiple fields

amarendrakolukula
Автор

thanks a lot for sharing your knowledge with us. Kindly address one confusion that do we need to impute missing values in the test data set the same way you have taught in the video?

gaziya
Автор

Thanks a lot for detailed explanation. It really helps

coolsun-lifestyle
Автор

Well, I appreciate the video that Mr.
krish naik made and i love to see his videos and I really want to discuss on how can we handle missing values. Ok well using separate model to see relation between variables that have complete dataset is not great though because the value, since it's a value generated from machine learning, is not a real data and may statistically far from central of population data because it comes from other equation. I would love to use statistical method like mean, median or mode and, I don't know this will work or not, checking the range of population mean and make sure that the value is not going far from population mean

channelfisikaasik
Автор

beautifully explained with the detailing!

kukulaarohi
Автор

Great way of explaining things. I like it very much.

konradpyrz
Автор

Thanks for the video, you said that option -2 (model based imputation) is less preferred for huge datasets, does that mean that in general it is good to go with statistical based imputation over model based imputation in real world datasets? Since we get lot of data in real world?. I am working on Home-Credit-Default-Risk (kaggle competetion dataset) request your comment on which imputation method to use?

gopie
Автор

a really good idea of creating seprate model thanks for sharing.

saurabhtripathi