Python Tutorial : Linear regression by least squares

preview_player
Показать описание

---

Sometimes two variables are related. You may recall from the prequel to this course that we computed the Pearson correlation coefficient between Obama's vote share in each county in swings states and the total vote count of the respective counties.

The Pearson correlation coefficient is important to compute, but we might like to get a fuller understanding of how the data are related to each other. Specifically, we might suspect some underlying function gives the data its shape.

Often times a linear function is appropriate to describe the data, and this is what we will focus on in this course. The parameters of the function are the slope and intercept. The slope sets how steep the line is, and the intercept sets where the line crosses the y-axis. How do we figure out which slope and intercept best describe the data?

A simple answer is that we want to choose the slope and intercept such that the data points collectively lie as close as possible to the line. This is easiest to think about by first considering one data point, say this one. The vertical distance between the data point and the line is called the residual. In this case, the residual has a negative value because the data point lies below the line. Each data point has a residual associated with it.

We define the line that is closest to the data to be the line for which the sum of the squares of all of the residuals is minimal. This process, finding the parameters for which the sum of the squares of the residuals is minimal, is called "least squares".

There are many algorithms to do this in practice. We will use the Numpy function polyfit, which performs least squares analysis with polynomial functions. We can use it because a linear function is a first degree polynomial. The first two arguments and the x and y data. The third argument is the degree of the polynomial you wish to fit; for linear functions, we enter one. The function returns the slope and intercept of the best fit line. The slope tells us that we get about 4 more percent votes for Obama for every 100,000 additional voters in a county.

Now that you know how to perform a linear regression, let's do it with some real data in the exercises!

#DataCamp #PythonTutorial ##StatisticalThinkinginPython #StatisticalThinkinginPythonPart 2
Рекомендации по теме