Dealing with Missing Data in R

preview_player
Показать описание
Data imputation is a technique that allows missing data to be replaced with data without affecting the trend of the analysis. It can be done in a huge numbers of ways. In R there's a lot of package that could allow the imputation of data easily as long as you understand the method you desire and why you are running on such method. IN this video I want to show case how you can use the mice package to easily replace data in a matrix and how you can compare the performance of each algorithm using ggplot2.

Slides

Github

Chapters
0:00 Introduction
1:05 What's imputation
1:45 Types of missing data
3:22 Measuring success
3:55 A number of different imputation techniques
9:05 R Script: introduction of the rmd format
10:06 Mean Imputation
11:40 locf and nocb
14:36 kNN and kNN imputation
19:00 Advance imputation with mice()
23:00 How does pmm and rf performed?
25:07 TCGA data Imputation
30:13 Effectiveness of Imputation
Рекомендации по теме
Комментарии
Автор

Fantastic video! Really really helpful and informative! I recommend! Thanks for your video!

helenahh
Автор

Woow.. This is wonderful.. Thank you for creating and sharing informative videos

mangalahegde
Автор

Thanks for this thorough demonstration! I wonder what you think about what percentage of missing values is okay to do imputation. Also the number of available complete cases might be important. E.g. if I have 3.000 complete cases is it okay to impute 12.000 missing values in the other cases? Information on these considerations are rarely to be found.

Philantrope
Автор

Thank you for your informative video!// At 15:03, I was wondering if you could provide me with reason(s) as to why data need to be normalised first before applying the KNN imputation. What would be consequence(s) if actual values are used for KNN imputation directly?// Are there quantitative method(s) which could be used to assess the accuracy of the imputation rather than visualisation? My data contains more than three thousand rows, so it is hard to assess the accuracy by using the three types of plotting described in the video.

LeviRafal
Автор

If I have panel data from 2000 to 2019 with health indices as predictors and data for these indices is missing for some years due to the frequency of data reporting. What type of missingness is that?

elizabethnalule
Автор

It would be nice to know where some of the functions you are using are coming from (without having to visit github). I cannot find locf, nobc or forbak in nomemica. I checked the zoo package. It does not have those but similar ones (na.locf for both LOCF and NOBC).

haraldurkarlsson
Автор

Nice presentation. However, I find difficult to find a good account of the difference between the different classes of missings (MCAR, MAR, MNAR). After reading the description of these types of classes by different youtubers I am just left a loss. Perhaps no one can explain these things?

haraldurkarlsson
Автор

How to check the quality of the imputation with Mice?

abdulbouraa