R Tutorial: What is cluster analysis?

Показать описание

---

Hi, my name is Dima and I am very excited to have you join me in learning all about cluster analysis in R. Cluster analysis is a form of data exploration, and the key to harnessing its power lies in understanding how it works. So, in this course you won't just learn the tools necessary to perform cluster analysis - that's the easy part - I will work with you to build the intuition behind the underlying methods. But, before we get to the how, let's take a moment to discuss, what is clustering?

No matter whether you are working with medical data, retail data, or sports data, as a data scientist you are often presented with a bunch of data that you need to make sense of.

To understand what clustering is, let's put aside the details of our data and instead focus on the toy example where the data is represented as a matrix containing entries of card suits.

To look at it another way, this matrix is composed of rows containing our observations and columns that tell us something that we measured across these observations.

We will refer to these columns as the features of our observations.

In cluster analysis, we are interested in grouping our observations such that all members of a group are similar to one another and at the same time they are distinctly different from all members outside of this group.
Imagine in this example we performed a cluster analysis to find which observations are similar to one another based on what suit appears in each column.

In this case, we identified three groups and colored the observations accordingly.

To better see these patterns, let's re-organize our observation into their respective colored clusters.

Here we can start to see clear patterns that emerge. Fundamentally, this is how cluster analysis works.

Or to put it another way, cluster analysis is a form of exploratory data analysis where observations are divided into meaningful groups that share common characteristics amongst each other.

So what are the steps involved in performing cluster analysis?

Well, first, you must make sure that your data is ready for clustering, meaning that your data does not have any missing values and that your features are on similar scales.

Next, you must decide on what metric is appropriate to capture the similarity between your observations using the features that you have.

Once you have calculated this you can use a clustering method to group your observations based on how similar they are to each other into clusters.

But, most importantly you will need to analyze the output of these clusters to determine whether they provide any meaningful insight into your data. This often requires a deep understanding of the problem and the data that you are working with.

As you can see in this flow chart, the analysis you perform on these clusters may require you to iterate on the clustering steps until you converge on a meaningful grouping of your data.

The first three chapters of this course will help you unpack this process.

In this chapter, you will gain a deeper understanding of what it means for two observation to be similar - or more specifically, dissimilar. You will also learn why the features of your data need to be comparable to one another.

In chapters two and three you will learn how to use two commonly used clustering methods: hierarchical clustering and k-means clustering.

At the end of these chapters and in chapter four you will work through two case studies where clustering analysis provides a unique perspective into the underlying data.

So, let's begin!