R Tutorial: Foundations of 'tidy' machine learning

Показать описание

---
Hi, my name is Dima, and I am excited to welcome you to the Machine Learning in the Tidyverse course.

If you're here then you must already know how easy it is to explore, manipulate and analyze your data with tools from the tidyverse.

The good news is that the tidyverse tools also work exceptionally well for building machine learning models.

The reason for this is that the tidyverse tools center around the dataframe structure known as a tibble. What makes a tibble special for machine learning is that it can natively store arbitrarily complex objects using a special column known as the list column.

This is particularly helpful for storing models since the outputs of these models are always complex objects.

With tibbles you can store models in these list columns and, as a result, explore and evaluate them with the rest of the suite of tidy tools.

Along with the tibble, the functions in the tidyr and purrr packages form the foundational tools for working with list columns. You will use these tools as part of a framework called the List Column Workflow.

At its core, this workflow can be summed up in three basic steps.
The first step is to make a list column. The second step involves using appropriate tools to work with the list column.

And the third and final step is to simplify the list columns into a format that allows further exploration using the familiar tidyverse tools.

These three steps rely on the map family of functions from purrr and the nest and unnest functions from tidyr. To learn how to use the list column workflow you will work with the gapminder dataset.

Unlike previous courses that have used the gapminder package, this course will use a more granular collection of gapminder data adapted from the dslabs package.

This version contains observations for 77 countries across a time period of 52 years. Each observation has six informational elements associated with it, we will refer to these elements as the features of these observations.

In this video and the exercises that follow it you will learn how to use the nest and unnest functions to manipulate the gapminder data.

Here is an excerpt of the gapminder data colored by country.

The process of nesting compacts the chunk of data for each country into a corresponding entry in the new nested dataframe. This is accomplished by the nest function.

To nest the gapminder data by country you first need to use group_by() to group the data by country then use nest() to create a series of nested dataframes for each country.

This process creates a new list column named data. Each element in this column contains the corresponding subsetted dataframes.

Because the data column in the nested dataframe is a list column you can access it directly. This can be very helpful for exploring the data and prototyping your approach.

For example, you can view the fourth list entry, the data for Austria, by specifying the data column and extracting the list with the double brackets.

For the third step of the list column workflow, you need to simplify list columns.

If the list column contains dataframes, like in this example, you can simplify it using the unnest() function.

In this example, you can see how the nested dataframes were simplified into a dataframe with regular columns.

Here the column to unnest is specified as an argument in the unnest() function. If no arguments are provided to unnest() it will, by default, attempt to unnest all list columns.

Now it's your turn to practice using these tools.

#RTutorial #Foundations #machinelearning #Tidyverse