R Tutorial : How to summarise missing values

Показать описание

---

Now that you understand what missing values are, how to count them, and how they operate, let's scale these up to more detailed summaries of missingness.

We need to summarise missing data to identify variables, cases, or patterns of missingness, as these can bias our data analysis.

There are two main summaries: basic, and dataframe summaries.

Basic summaries return a single number, like the number of missing or complete values using n_miss or n_complete.

However, you will need more detailed missingness summaries to help you on your journey through a data analysis.

This lesson introduces you to missing data summaries.

naniar provides a family of functions all starting with miss_., which each provide different summaries of missingness, and return a dataframe.

This allows us to see features that can be difficult to articulate, or time consuming to calculate.

For example, miss_var_summary and miss_case_summary return the number and percentage of missings in each variable or case.

These summaries work with dplyr''s group_by, so you can fluidly explore missingness by each groups.

Use miss_var_summary to summarise the number of missings in each variable.

This returns a dataframe where each row is a variable. It also includes summaries of the number and percentage of missings for each variable in the dataset, and is sorted by the number of missings.

For example, Ozone has 37 missing values, and is about 24.2 percent missing.

Similar to miss_var_summary, miss_case_summary returns a summary dataframe, where each case represents a dataset row number.

Here, case 5 - the fifth row in the dataset - has 2 missing values, which means 33% of that case is missing.

Tabulation of missingness counts the number of times there are 0, 1, 2, 3, and so on, missings. They are very useful, compact summaries that reveal interesting structure.

miss_var_table returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.

For example, there are four variables with no missings detected, which corresponds to 66.7 percent of variables, and there was 1 variable with 7 missings, and 1 variable with 37 missings.

Similarly, miss_case_table returns the same information, but for cases.

We can also look at missingness over a given span or, run for a given variable using miss_var_span and miss_var_run.

These can be really useful for data with many regular measurements, like time series data.

miss_var_span calculates the number of missings in a variable for a repeating span. This is really useful in time series data to look for weekly (7 day) patterns of missingness.

miss_var_span returns a dataframe with columns "span_counter" which identifies the span - the first, the second, and so on, and also includes the number and proportion of missing and complete values.

For example, in span 10, there are there are 432 missings, and 3568 complete values. Note also that n_miss + n_complete equals the span, here 4000. Out of these 4000 span values, 0.108 are missings, and 0.892 are complete, given by prop_miss and prop_complete.

miss_var_run returns the "runs" or "streaks" of missingness. This is useful to try and find unusual patterns of missingness. It returns the length of the run of "complete" and "missing" data. This is particularly useful for finding repeating patterns of missingness.

Sometimes you are interested in missingness for groups in the data.

Each missingness summary function can be calculated by group, using group_by from dplyr.

For example, we can look at the missingness by Month in the airquality dataset.

Here we see that Month 5 for Ozone there are 5 missings, but for Month 6 Ozone has 21 missings.

Let's practice

#DataCamp #RTutorial #DealingWithMissingDatainR

Рекомендации по теме

Комментарии

The group by function sounds new. So what does the %>% in the example mean and how to use it? Wish this was also mentioned briefly. Thank you!

bettys

R Tutorial : How to summarise missing values

R Programming Tutorial - Learn the Basics of Statistical Computing

R Programming for Beginners | Complete Tutorial | R & RStudio

R Tutorial For Beginners 2022 | R Programming Full Course In 7 Hours | R Tutorial | Simplilearn

R Programming Full Course for 2023 | R Programming For Beginners | R Tutorial | Simplilearn

R Tutorial For Beginners Part - 1 | R Programming For Beginners | R Language Tutorial | Simplilearn

Installation & RStudio - R Tutorial Series #1

R Programming For Beginners | R Language Tutorial | R Tutorial For Beginners | Edureka

R tutorial - Using Factors in R

R tutorial - The True Basics of R

R Shiny for Data Science Tutorial – Build Interactive Data-Driven Web Apps

R Tutorial - Using the Data Frame in R

R tutorial - Learn How to Create and Name Matrices in R

R installieren und RStudio installieren (MacOS) – kurzes Tutorial

Create and Work with Vectors and Matrices in R | R Tutorial 1.4 | MarinStatslectures

RStudio Tutorial For Beginners | RStudio Installation | R Tutorial | R Training | Edureka

R Tutorial For Beginners | R Programming Tutorial l R Language For Beginners | R Training | Edureka

Getting started with R: Basic Arithmetic and Coding in R | R Tutorial 1.3 | MarinStatsLectures

R Shiny Tutorial for beginners [Complete Tutorial] - Part 1

Importing/Reading Excel data into R using RStudio (readxl) | R Tutorial 1.5b | MarinStatsLectures

R Tutorial #7 - Solving systems of linear equations - Statistical Programming Language R

Simple Linear Regression in R | R Tutorial 5.1 | MarinStatsLectures

Basic Analytical Techniques | Data Science With R Tutorial

Writing Scripts in R | R Tutorial 1.12 | MarinStatsLectures

Bar Charts and Pie Charts in R | R Tutorial 2.1 | MarinStatsLectures