Welcome to Week 3, Part 2 - Linear Regression


In our last workshop, we talked about taking some continuous data, such as features about houses, and clustering them into classes, such as neighborhoods.

While these classes themselves can be represented continuously (by their centroids), the idea of classes is discrete. For example, if our features are square feet and home value, when we examine our clusters, we may see that our downtown centroid is $[850, 925 000]$. These are continuous values; however, we have a set number of clusters in total (K of them!), making our output discrete.

Now, say we have some more continuous data, but we're looking for continuous output, as well. This is where you'll find linear regression to be useful.

Breaking down the words

Just as "K-Means Clustering" sounded complicated, let's once again break down the terminology to understand the meaning of "Linear Regression."

Linear

All this means is that the relationship between our variables, when plotted, will be in the form of a straight(ish!) line. Think back to $y = mx + b$. Not only is that a great example of a linear relationship, it's actually the only thing we're calculating (more on that later). You may see a linear relationship between variables like "time studied" and "course grade."

This is in contrast to quadratic relationships $y = x^2$, which may be seen between variables such as the height of a thrown ball, and the time after which it was thrown. Linear relationships can also take the form of $y = x^3$, $y = x^5$, and so on, though we won't be covering that today.

Regression

This may be a word you haven't seen before - and that's okay! Normally, in school, a (very, very basic) question would ask you,

"Given the equation $y = 3x - 5$, what is the value of y when x is 9?"

However, we're going to flip this question around. Instead, we're now asking:

"Given this .csv full of x and y values, what is the equation for $y = mx + b?$"

The formal definition of regression, from a quick google search, is:

"A return to a former or less developed state."

When you think about it, we're converting our data from many x and y values, to a less developed state; an equation! That's essentially all regression is.

How do I use it?

Let's just get this out of the way now.

$$ y = bx + a $$ $$ $$ $$ where... $$ $$ $$ $$ b = \frac{n\sum xy - (\sum x\sum y)}{n\sum x^2 - (\sum x)^2} $$ $$ $$ $$ and... $$ $$ $$ $$ a = \hat y - b\hat x $$

Oof. That's ugly. There's a lot to go over here, so let's go step by step.

a and b?

When calculating linear regression, convention is to use the form $y = bx + a$ rather than $y = mx + b$. Feel free to replace a and $b$ with any of the other letters of the alphabet not in use in our equation, it's simply a placeholder.

What's wrong with that E?

That's not a normal E; that's the Greek symbol Sigma, used to define summation notation. Aside from rhyming neatly, summation notation is useful because it lets us save a lot of time writing. Rather than writing $x_{0} + x_{1} + x_{2} ...$ and so on, we can simply use Sigma to instruct us to take the sum of all x values. You can actually think of the Sigma like a for loop: for x in our data: total = total + x

Why do those x and y's have hats?

These hats, as they're commonly referred to, are used to denote taking an average.

Seriously, why does this even work?

Asking "why" is always the right thing to do; it's always vital to understand what you're doing. However, an in-depth explanation of the linear regression formula involves some knowledge of calculus that we won't assume during these workshops. If you're interested in specifically what each term of the equation is doing, then I encourage you to check out this PDF for a more involved explanation. For the purpose of not assuming everyone is fully proficient in calculus, though, we're going to stick to a more intuitive idea of why this formula works.

The Line of Best Fit

Let's go back to y = mx + b for a moment, our core concept. All our equation of a line represents is a line of best fit; the line that can most effectively represent - or, estimate - our data. But how do we define "representing our data?" The standard heuristic is to try to divide all of our points so that half are on one side of the line, and half are on the other. Does that best represent our data?

lineofbestfit (BBC)

Here we see some pretty linear data, and a line through it that looks like it pretty much divides our data evenly. How can we be sure, though, that this is best representing our data? What if we had some data whose line perfectly divided it in half, but one half was extremely close to the line, and one extremely far away? We need a mathematical way to understand "best fit."

Squared Error

We can think about representing our data as the idea that our line should be as close as possible to every single point. How, though, do we represent the distance from our line to each point? Well, at each x value, we could simply subtract our data's y value from the y value of our line (our estimate). Or, mathematically:

Take the ordered pair $(4, 16)$ and the line of best fit as $f(x) = 3x+9$. Calculate the error.
$f(4) = 3(4) + 9 = 21$
$y = 16$
error = $f(4) - y = 21 - 16 = 5$

However, there's a problem with this method. Let's say we have three points in our dataset, and their errors are $[-6, 2, 4]$. What if we wanted to ask, "What's the total amount our line is off by?" Presumably, we would add them up: $-6 + 2 + 4 = $ ... zero! According to this, the total amount our line is off is absolutely 0, even though our points were off by -6, 2, and 4! This makes absolutely no sense.

To counteract this weird little quirk, then, we take the squared error. This is, quite literally, taking the error we just calculated and squaring it. We do this because now, whenever we encounter negative error, it will be multiplied by itself, forcing it to be positive. This means that when we add up our errors, it won't be able to sum to 0 anymore (cause there are no negative numbers). Let's try it:

Errors: $[-6, 2, 4]$
Sum of errors: $-6 + 2 + 4 = 0$
Squared errors: $[36, 4, 16]$
Sum of squared errors: $36 + 4 + 16 = 56$

Much better. If you notice, this also has the effect of emphasizing larger error values. This can be both a good thing and a bad thing.

Putting it all together

So we know that our line of best fit should minimize the distance between each point and that line, and we also know that the way to define "distance" is by using squared error. So what we're really looking to do is minimize the squared error. How do we do that? Well, minimization is a concept you'll hear quite a lot about in calculus, but it's simply beyond the scope of this workshop. Following linear regression, however, I've included another component to this tutorial that not only gives us a way to do this with essentially zero calculus, it also builds on this idea of minimizing error to introduce some of the concepts that lay the foundation for many data science techniques used today.

In summary, if you take anything away from this workshop, let it be that the idea of linear regression is simply minimizing error to make our line of best fit as accurate as possible. There are a number of ways you can define error beyond our squared error, but the concept remains the same.

How do we know how accurate it is?

By using another fun formula!

Correlation Coefficient

$$ r = \frac{n(\sum xy) - (\sum x\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 -(\sum y)^2]}} $$

Brutal, I know, but bear with me. This is called the correlation coefficient. This value, referred to as r, is a measure of the strength of the relationship between two variables. Your r value will range between -1 and 1.

Interpreting the r value can be a bit tricky. I find it makes it much easier to separate the sign from the value. So, the absolute value of r (just the number, forget about the +/- sign) will represent the strength of the relationship. The closer this value is to 1, the stronger the relationship - the variables are very closely related. The closer this value is to 0, the weaker the relationship - the variables have almost nothing to do with each other.

Now we can bring in the sign of the value (positive or negative). If the sign is positive, that means that there is a positive correlation. We can think about a positive correlation as, "as x increases, y also tends to increase." The opposite is true for when the sign is negative, or there is a negative correlation. In this case, as x increases, y tends to decrease.

Take, once again, time studied versus course grade. Pretend we've done a study and gathered data on students. Assume that our r value is $0.76$. We see that the value is 0.76, which is quite strong, and that the sign is positive. This means that as time studied increases, grades also tend to increase quite closely.

Now, let's compare absences to course grade. In our fake study, we've found that the r value is -0.41. The value is 0.41, which indicates that the values are fairly closely related. Our sign is negative, which means that as absences increase, grades tend to decrease.

Coefficient of Determination

We can also look at a value we label as r2, called the coefficient of determination. What this value represents is how well our regression's line of best fit predicts our actual y values.

r2 is the easiest thing we've calculated so far. It's literally r... squared! Remember that r can be between $[-1, 1]$, so when we square this, our r2 value must be between $[0, 1]$. This is because when we square a negative number, it will become positive.

r2 is simply a measure of the predictive ability of our model.

Can we get to programming already?

Sounds good to me.


Programming Linear Regression

Let's import all of our usual packages.

In [147]:
from math import sqrt #for taking the square root in our r formula
import pandas as pd #pandas as always
import numpy as np #numpy as always

#make our notebooks appear inline
%matplotlib inline 
from IPython.display import display #viewing dataframes
import matplotlib.pyplot as plt #matplotlib for graphs
from pandas.plotting import scatter_matrix #see relationships between our variables
from mpl_toolkits.mplot3d import Axes3D #surprise for later
plt.style.use('fivethirtyeight') #make our graphs look pretty

For this tutorial, I've prepared some NBA data to use. Apologies if you're not a sports fan; sports are just useful for data analysis cause of how much data they have available (as well as being pretty straightforward to understand).

Let's read in our data and take a look.

Legend


Team - Self-explanetory
GP - Games played
Win_pct - Percentage of games won
PD - Point differnetial (Points scored - points allowed)
PF - Points for (points scored)
PA - Points allowed

In [148]:
data = pd.read_csv('basketballdata.csv')
display(data.head(5))
TEAM GP Win_pct PD PF PA
0 Golden State 9 0.667 8.3 120.0 111.7
1 Brooklyn 9 0.333 -5.0 114.3 119.3
2 Washington 8 0.500 2.6 113.5 110.9
3 Indiana 9 0.556 2.1 111.3 109.2
4 Orlando 9 0.667 4.7 111.3 106.6

Now that we've loaded our data, let's try to visualize it to look for some relationships. With sports, at the end of the day, we're usually trying to predict win percentage. It's a good question to ask; what makes teams win?

Let's look for the Win_pct row, and scan across it for any linear-looking plots.

In [150]:
scatter_matrix(data);

As we can see here, it seems that point differential looks to have a pretty linear relationship with win percentage. So, we should be able to use point differential to predict win percentage - and we do that with a linear regression!

Let's narrow our data to just what we care about; differential, and win percentage. We're also going to do something extremely important: split our data into training and testing copies. It is extremely important that you do this, to prevent something called overfitting.

When you study for a test, do you memorize how to do the specific questions from the textbook, or do you learn the concepts and how to solve any kind of problem in that concept? I'd hope the latter, because the former is an example of overfitting. You're overfitting your "model" (your brain) to only be able to "predict" (answer) the specific questions, rather than truly learning the "data" (concept). We'll take the first half of the elements from data to train our model.

In [158]:
pred = data[['TEAM', 'PD','Win_pct']].copy()[:15]
display(pred.head(5))
TEAM PD Win_pct
0 Golden State 8.3 0.667
1 Brooklyn -5.0 0.333
2 Washington 2.6 0.500
3 Indiana 2.1 0.556
4 Orlando 4.7 0.667

Note that the following is technically unnecessary, but I find it helpful when doing linear regression by hand.

Going back to our original equation, we have 6 terms that we'll be using. I like to create a table of them for reference.

In [4]:
names = ['X', 'Y', 'XY', 'X^2', 'Y^2', 'N']
values = [np.sum(pred['PD']), np.sum(pred['Win_pct']), np.sum(pred['PD'] * pred['Win_pct']), np.sum(pred['PD']**2), np.sum(pred['Win_pct']**2), len(pred)]

pd.DataFrame(values, names)
Out[4]:
0
X 1.30000
Y 15.01000
XY 25.66280
X^2 1039.75000
Y^2 8.38251
N 30.00000

And with that, let's jump right in. We'll calculate our terms, and plug them into the formula. We'll go back to $y = mx + b$ form as we're likely all more familiar with it.

In [159]:
def linear_regression(x, y):
    '''
    Let's define all the variables we need in our slope formula.
    Sigma (sum) for X, Y, XY, X^2, Y^2, as well as n (the length of our data).
    '''
    x_sum = np.sum(x)
    y_sum = np.sum(y)
    xy_sum = np.sum(x*y)
    x2_sum = np.sum(x**2)
    y2_sum = np.sum(y**2)
    n = len(x)
    
    #Calculate the slope of our line y=mx+b
    m = (n*xy_sum - x_sum*y_sum) / (n*x2_sum - x_sum**2)
    
    '''
    Calculate the average of X and Y for calculating B
    '''
    x_avg = np.mean(x)
    y_avg = np.mean(y)
    
    #Calculate the intercept of our line y=mx+b
    b = y_avg - m * x_avg

    #Print our line
    print("y =",m,"x +",b)
    
    #Use our existing values to compute R, which in turn gives us R^2
    r = (n*xy_sum - x_sum*y_sum) / sqrt((n*x2_sum - (x_sum**2))*(n*y2_sum - y_sum**2))
    r2 = r**2
    
    print("R: ", r)
    print("R^2: ", r2)
    
    return m, b #Return np array of predicted points

x = pred['PD']
y = pred['Win_pct']

m, b = linear_regression(x, y)
y = 0.015077079527353557 x + 0.5344787960054835
R:  0.7438121201131587
R^2:  0.553256470027232

Excellent! We now have our line of best fit. We see that our r and r2 values show that the relationship between point differential is quite strong for our testing data. Now, let's make some predictions on our testing data to make sure that our model can perform well with new data.

In [166]:
test = data[['TEAM', 'PD','Win_pct']].copy()[15:]
predictions = m*test['PD'] + b

plt.scatter(test['Win_pct'], predictions);

Awesome! It looks like our model did a really good job of capturing the relationship between our variables, and was able to prevent overfitting.

Conclusions - kind of

That is, at it's heart, linear regression. You're attempting to create a line of best fit with as little error as possible between your x and y values. There's a lot of work involved, though, and it's not very scalable; how often are you going to have one variable so neatly predict another? And how often are you going to use that to make interesting insights?

While you now know linear regression, there's another, very similar concept that some of you may find useful. If you'd like to simply stick to linear regression and call it a day, I don't blame you, and you can exit out here.

However, if you're interested, keep reading to learn about a method very similar to linear regression that lays the foundation for many modern data science techniques. This method is Gradient Descent. If you're interested in learning more about this, stay tuned for an upcoming workshop!