R Tutorial : How to summarise missing values

preview_player
Показать описание

---

Now that you understand what missing values are, how to count them, and how they operate, let's scale these up to more detailed summaries of missingness.

We need to summarise missing data to identify variables, cases, or patterns of missingness, as these can bias our data analysis.

There are two main summaries: basic, and dataframe summaries.

Basic summaries return a single number, like the number of missing or complete values using n_miss or n_complete.

However, you will need more detailed missingness summaries to help you on your journey through a data analysis.

This lesson introduces you to missing data summaries.

naniar provides a family of functions all starting with miss_., which each provide different summaries of missingness, and return a dataframe.

This allows us to see features that can be difficult to articulate, or time consuming to calculate.

For example, miss_var_summary and miss_case_summary return the number and percentage of missings in each variable or case.

These summaries work with dplyr''s group_by, so you can fluidly explore missingness by each groups.

Use miss_var_summary to summarise the number of missings in each variable.

This returns a dataframe where each row is a variable. It also includes summaries of the number and percentage of missings for each variable in the dataset, and is sorted by the number of missings.

For example, Ozone has 37 missing values, and is about 24.2 percent missing.

Similar to miss_var_summary, miss_case_summary returns a summary dataframe, where each case represents a dataset row number.

Here, case 5 - the fifth row in the dataset - has 2 missing values, which means 33% of that case is missing.

Tabulation of missingness counts the number of times there are 0, 1, 2, 3, and so on, missings. They are very useful, compact summaries that reveal interesting structure.

miss_var_table returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.

For example, there are four variables with no missings detected, which corresponds to 66.7 percent of variables, and there was 1 variable with 7 missings, and 1 variable with 37 missings.

Similarly, miss_case_table returns the same information, but for cases.

We can also look at missingness over a given span or, run for a given variable using miss_var_span and miss_var_run.

These can be really useful for data with many regular measurements, like time series data.

miss_var_span calculates the number of missings in a variable for a repeating span. This is really useful in time series data to look for weekly (7 day) patterns of missingness.

miss_var_span returns a dataframe with columns "span_counter" which identifies the span - the first, the second, and so on, and also includes the number and proportion of missing and complete values.

For example, in span 10, there are there are 432 missings, and 3568 complete values. Note also that n_miss + n_complete equals the span, here 4000. Out of these 4000 span values, 0.108 are missings, and 0.892 are complete, given by prop_miss and prop_complete.

miss_var_run returns the "runs" or "streaks" of missingness. This is useful to try and find unusual patterns of missingness. It returns the length of the run of "complete" and "missing" data. This is particularly useful for finding repeating patterns of missingness.

Sometimes you are interested in missingness for groups in the data.

Each missingness summary function can be calculated by group, using group_by from dplyr.

For example, we can look at the missingness by Month in the airquality dataset.

Here we see that Month 5 for Ozone there are 5 missings, but for Month 6 Ozone has 21 missings.

Let's practice

#DataCamp #RTutorial #DealingWithMissingDatainR
Рекомендации по теме
Комментарии
Автор

The group by function sounds new. So what does the %>% in the example mean and how to use it? Wish this was also mentioned briefly. Thank you!

bettys
welcome to shbcf.ru