Missing Data? No Problem!

preview_player
Показать описание
5 Ways Data Scientists deal with Missing Values.

Check out my other videos:

Links to my stuff:
Рекомендации по теме
Комментарии
Автор

Didn’t know about the built in interpolate, back fill, and forward fill functions. Thanks!

michael_bryant
Автор

This is data fabrication which (at least in physics) is academic malpractice and can lead to your data seeming more precise than it is

mapo
Автор

Please never drop samples with missing values. This will create a bias in your analysis and will break your pipeline on new data. Mean/Mode Imputation is a great and fast method to handle the problem. If you have non-linear predictive models MM is a great imputation method that doesn't affect the performance.

theaerogr
Автор

Excellent. Can't remember the last time I learned so many things in such a short time. 👍

johnmo
Автор

If you want to use that to train a neural net or a supported vector machine, you may want to add a new column indicating that the original value was missing ...

stevenkovacs
Автор

so informative, clear, presice and easy to understand

sumangorkhali
Автор

These shorts are awesome; thank you for sharing 😀.

BrunetteViking
Автор

Actually the best data science channel. This tips encourage you how to choose a solution for missing info

thebreath
Автор

Love me a lil casual data manipulation

pbaby
Автор

Please don't use NAN and null interchangeably it will probably confuse people. They are different concepts that have different implications. Null generally means there was an unreported error or there's a serious bug in the program.

NAN generally means either there was nothing collected, or the value calculated/measured is not expressible as a floating point number. This can suggest a flaw in the math performed on floating point numbers.

Null only is possible with numbers that are stored as integers of some kind and NAN is only possible on specifically floating point numbers. The closest thing to null that python has is `None` but it's not identical

Rose-eche
Автор

This is awesome! I had no idea you could interpolate like that. Thanks Rob!

ezhankhan
Автор

I definitely should read the f..ing pandas manual 😅

molmock
Автор

Great video! Wish i knew this when i was working with time series data at a past job... those NaNs gave me a couple of headaches for sure 😂

stevecti
Автор

Your channel is totally different. You bring to light things that people don't even know exist in many libraries. Big ups to you.

kinghezzy
Автор

Side note: blindly fill is NOT acceptable. It is best to estimate the missing values based on existing information. Interpolation is a good one.

hdtlab
Автор

The true approach:


1. Determine why the data was lost: defective measuring equipment, data corruption, etc.
No point in data if the measurement is bad.

2. Think thoroughly about the topic/domain of ressearch.
For instance, the temperature of what — air in a specific city? Then you could probably find the missing values online by the forecast.

An experiment? Probably, you'd have to repeat it anyways.

3. If you can't find other sources, only then you should choose these methods. And still think about what is measured and why the data went missing.

repeatbot
Автор

why is having gaps in time data a bad thing if we drop all NaN values, i dont see why this would cause problems. can anyone pls explain?

koen._.
Автор

I would just use interpolation to be honest. I work with sensor data and they rarely jump. With looking through generic data streams and daily record in general, it usually becomes easy to simply interpolate the data. It is easy, versatile and works for waaaay more thing than just missing data in a graph.

sanfera
Автор

Can you mention the scenarios in which interpolate is used in real-life projects?

parthibank
Автор

in my study field (thermal engineering), we dont really have any data scientists or even python gurus, just some script kiddies who can barely copy/understand code given by ai, in a master's thesis there was a subject using ai to determine the best config for the required humidity of a room, after collecting data, the student and the supervisor didnt really preprocess data and did the unforgivable: 0ing all the NaNs, the problem is that when they injected the data into an ANN, it gave the worst results ever, and both of them didnt know why so they ended up dropping the idea of using ai coz 'humans cant be replaced yet'

and_rotate