R Tutorial: Know your data

preview_player
Показать описание

---
Now that we have read our data into R, let's get to know it a little better.

One of the most important things you can do when working with any new data is to learn about how it was collected. We have been working with the bakers data from The Great British Bake Off.

On each episode of the show, one baker is eliminated, one wins the technical challenge, and one is chosen as star baker. The title of star baker is based on the baker's performance across three timed challenges; the signature, the technical, and the showstopper.

Now, let's have another look at the bakers data.

So far, we've printed tibbles to view them. But, if you have lots of columns, most will be cut off when you print.

Here, when we print our bakers data with 10 columns, we see that there are 4 more variables that are hidden.

To see all the columns, we use the function glimpse from the dplyr package. The argument for glimpse is the name of your tibble.

The glimpse output is a transposed view of your data, where each variable appears in rows from top to bottom instead of left to right. Going across each row, glimpse prints the first few observed values for every variable.

We also see the number of observations and variables at the top.

You may also want to summarize your data by looking at summary statistics for each variable. A quick way to do this is with the skim function from the skimr package.

Like glimpse, the argument for skim is the name of your tibble.

Skim provides statistics for every column depending on the type of variable. The results are printed horizontally with one row per variable, divided in sections for each variable type.

Let's break down the first section of output summarizing our three character variables.

For baker, there are no missing values, and 10 complete observations for each variable. The minimum and maximum values refer to string length. Also, each value is unique here- there are no bakers with the same name.

The next sections of the skim output summarize dates.

The variable last underscore date underscore uk is the last date that each baker appeared on the show in the UK. From the min and max values, we can tell that our data spans about 2 years.

For the series factor, there are only 3 unique values across the 10 observations. Looking at the top counts, series 4 is the most common.

The logical column named aired underscore us is TRUE if that baker appeared in a series that aired in the US, and FALSE if not. The mean tells us that 70% of the bakers here were seen by US viewers

Numeric variables are summarized last.

In addition to the number of missing and complete values, skim returns the means, standard deviations, and quantiles of the variables.

A mini histogram is also printed to give you a sense for the distribution of each variable.

From this skimmed output, we know that the average age of these bakers is 34, and bakers appeared in anywhere from 1 to 10 episodes, with a median of 5. Only one of these 10 bakers was crowned star baker in their time on the show- and they won it twice! Most bakers in this tibble never won the technical challenge, but one did win three technical challenges.

Now it's time to put glimpse and skim into practice with our bakeoff data.

#RTutorial #DataCamp #data #Tidyverse
Рекомендации по теме