An Introduction to Machine Learning with R

Показать описание

At a very basic level, Machine Learning explores the construction and usage of algorithms that can learn from data. But when does a machine actually learn? We can say that a machine has the ability to learn if it is able to improve its performance in solving certain tasks when it receives more information. This 'experience' typically comes in the form of observations on how particular instances of a problem were solved before.

Maybe an example will clarify. A possible task for a computer could be to label squares with a color, based on the square's size and edge . Initially, the computer has no idea how to do this. However, suppose that a number of squares were labelled earlier by humans. For example, a small dotted square was labelled green, a big striped square was said to be yellow, a medium sized square with a normal edge was labeled green as well and lastly a small striped one was labeled yellow. A machine learning algorithm can use these observations, or instances, to do an informed guess about how to label an unseen square. An example could be a medium striped square. The computer can be right or wrong in doing so.

This specific example was a classification problem. There are many types of machine learning problems; some are related and others are pretty exotic. A concept that keeps popping up is the presence of input knowledge, or simply data. In our example, it was the set of human-labelled squares. In other examples, it can be something totally different. Typically, this data is a data set, containing a number observations that each have a well-defined number of variables, often called features. Each square and its corresponding color is an observation, the features in this case are the size and edge. The color is the label of the square.

As you are well aware, in R, a data set is typically represented by a data frame. Have a look at this code that builds the `squares` data frame for our example.

In a data frame, the observations correspond to the rows and the columns correspond to the variables. To find out the dimensions of your data set, you can use the `dim()` function: we see that we are dealing with 4 observations, and 3 features. The `str()` function gives a more structured overview of our data, also showing how many observations and variables the data set comprises. Another function you can use to observe your data, is the `summary()` function. This function will also give you information on the distribution of each feature. Of course, data sets are typically much larger than this, but we're only dealing with a toy problem here.

Ok! Let's dig into a more theoretical formulation of our example. The problem here is labeling a square. This is actually applying some function on the inputs to generate an output. The size and edge variables of a square go in and the label variable, colour, comes out. A machine learning algorithm tries to come up with a way of labeling the square, so you're actually trying to estimate the function here. This function, could be estimated based on the previous observations of how the problem was solved. Ideally, the function you're building is generally applicable, and can handle all kinds of reasonable inputs. If we put in the unseen example, the medium striped square, with an unknown label, the function will guess a label for us.

It's very important to see when you're actually dealing with a machine learning problem. When you are simply calculating for example the most occuring color of squares in your data set, or calculating the average size of your squares, you're not doing machine learning. The entire point about machine learning is trying to build a model that can help make predictions about your data or future instances of similar problems.

Don't let these general formulations of machine learning confuse you. Some very common problems can actually be thought of as machine learning. Do you know about linear regression? You can actually use it to make predictions on your data. Suppose you've got some data about a group of people: their height and their weight. You can use linear regression to make a function which can predict the height of a new person, given their weight. You do this by using the given data about the known height and weight of the first group of people.

There are so many other examples to machine learning, such as shopping basket analysis, movie recommendation systems, decision-making for self-driving cars and what not. But let's take this step by step.

In the next set of exercises, you'll consolidate the main ideas about machine learning and take your first gentle steps in this exciting field