R Tutorial: Handling missing data

Показать описание

---

In this video we are going to answer two fundamental questions around dealing with missing values in your data.

An important question is what to do with your missing values.

First, explore your dataset, identify its missingness and visualize it.

Then talk to your client about these missing values and see if there is any business rationale that can be applied to deal with them.

In general, there are three main avenues:
* Ignore them by discarding samples/variables with high levels of missingness. Use this option sparingly as this often leads to loss of valuable data.
* Impute them, i.e., replace them with other (hopefully more meaningful) values.
* Accept them and proceed to choose methods that naturally deal with these missing values. Unfortunately, not many methods can do that.

The strategy depends on the type of missingness you have.

The Maniar package provides many useful functions to identify, visualize and deal with missing data. any_na will tell you if there are missing values in your dataframe.

Sometimes you need to manually replace one or more missing data symbols with NAs, as done here.

You can easily summarize the level of missingness across variables and instances in your dataset. Here we see that 5 out of 6 variables have missing values.

Visualizing the missing values in a dataframe is very easy. Just invoke the vis_miss function. You can optionally arrange the rows according to their missingness with cluster=TRUE.

The gg_miss_case function displays the missingness at the row level or cases. In this example, only a very small fraction of the observations have two or more missing values.

There exist 3 types of missing data:

* Missing Completely at Random (MCAR)
* Missing at Random (MAR)
* Missing Not At Random (MNAR)

To understand these patterns of missingness better, check Chapter 2 of this DataCamp course.

Each missingness type has its own implications when it comes to performing imputation or deletion, as bias could be introduced in the data by doing so. There are also some visual cues related to the missingness clusters that may help identify the type of missingness in the data, although this is not a bulletproof resource.

Evaluating the quality of imputation is also an important issue. You can do that in two different ways.

An external evaluation relies on building a Machine Learning model from the imputed dataset, then assessing its performance as a function of the imputation method alone. All else being equal, this should give us a good indication of how beneficial that imputation was to our entire ML pipeline.

An internal evaluation compares the distributions of the variables before and after the imputation in terms of their mean, variance and scale. Ideally, you want an imputation model that does not drastically change the distribution of the imputed variables or their relationships. Big changes in these indicators could signal a problem with the imputation.

naniar allows easily constructing a shadowed matrix. This is a data structure with new columns labeled after the original column names but with _NA appended to them. These extra columns indicate whether a variable value was missing or not. This feature makes it very easy to track imputed values.

The simputation package provides the impute_lm function to impute the values of a dependent variable as a linear function of the values of independent variables.

We use bind_rows to aggregate the information from multiple imputation models into a single dataframe that looks like this.

We are almost ready to visualize the output of the different imputation models using common tidyr and ggplot2 functions. Let us do that in the exercises.

#DataCamp #RTutorial #PracticingMachineLearningInterviewQuestionsinR #Handlingmissingdata