R Tutorial: Distance metrics

Показать описание

---
Welcome to the second video of the course. In this lesson, we will learn about why we need distance metrics, what a distance metric is exactly, and which types of distance metrics are available.

The similarity between MNIST digits can be quantified using a distance metric.

A metric is a function that for any given points, the output satisfies the following properties.
First, the triangle inequality, which means that the distance between two vectors is the shortest distance along any path. Second, the symmetric property, that is the distance between x and y is the same in either direction, and third, the distance is positive between two different vectors and is zero from a vector to itself.

In a two dimensional space, the Euclidean distance between two points p and q is the length of the line segment connecting them.

In this example, you can see how we compute the Euclidean distance between the points p and q.

Euclidean distance can be also generalized to high dimensional vectors.

You can use the function dist() to compute the Euclidean distance matrix between the rows of a data matrix.

Let's see how we compute the Euclidean distance of the last six digits in mnist_sample.
In the object distances, you can see the computed values.

Now, we will plot those values using a heatmap with each digit label.

The first and third examples of digit eight are the most similar regarding this metric because the color is darker, but that does not happen for the other digits with label eight.

Minkowski provided a generalization for the Euclidean distance.

Each name for the Minkowski distance arises from the order p of the general formula that you are seeing here. When p is equal to 1 we call it the Manhattan distance, when p is equal to 2 the Euclidean distance and when p is equal to infinite it is known as the Chebyshev distance.

In R, we can compute these metrics using the dist() function. This code shows how we compute the Minkowski distance of the order three.

The Manhattan distance computes the distance that would be traveled to get from one data point to another if a grid-like path is followed.

The Kullback-Leibler divergence or KL divergence is a measure of how one probability distribution is different from a second one.
It is not a strict metric since it does not satisfy the symmetric and triangle inequality properties.
A divergence of 0 indicates that the two distributions are identical.

It is a common distance metric used to optimize algorithms in Machine Learning, like in the case of t-SNE. In decision trees, for example, it is called Information Gain.

To compute the KL divergence in R we are going to use the philentropy package. First, we load the package and store the last 6 MNIST records from mnist_sample without getting the true label.

To compute the KL divergence all values need to sum up to one. So we are going to normalize the pixel values of each digit. First, we add 1 to all records to avoid getting a NaN while rescaling. Then we get the rowSums() of each record.

Finally, we will compute the KL divergence using the distance() function and generate the corresponding heatmap.

As you can see we are doing a much better job of finding the similarities between digits. Here all the positions that correspond to the digit eight are the most similar ones.

Now, it's time to practice with the MNIST dataset and compute some similarity or distance metrics to identify similar digits.

#R #RTutorial #DataCamp #Advanced #Dimensionality #Reduction #Distance #metrics