In this notebook, we're going to look at how to implement linear regression. We're going to look at the 1-dimensional case, where we use a single feature to estimate the value of a related variable. Once you're comfortable with the single-variable case, try our notebook on linear regression with multiple features!

Import Python modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Generate Data for Univariate Regression

In [2]:
n = 100 #Number of observations in the training set

#Assign True parameters to be estimated
trueIntercept = 10
trueGradient = -4
In [3]:
x = np.random.uniform(0, 100, n)
In [4]:
y = trueIntercept + trueGradient*x + np.random.normal(loc=0, scale=40, size = n) #This is the true relationship that we're trying to model
data = pd.DataFrame({'X':x, 'Y':y}) #Put the two arrays into a dataframe to make it easy to work with
data.head(10) #Inspect the first ten rows of our dataset 
Out[4]:
X Y
0 31.506174 -132.567831
1 76.400537 -317.491894
2 99.337868 -309.746221
3 83.106415 -300.432479
4 35.905641 -139.032556
5 11.756495 -6.963557
6 99.994192 -436.045066
7 57.838770 -211.715518
8 27.014948 -113.085993
9 23.944752 -75.823721

Quickly plot the data so that we know what it looks like

In [5]:
plt.scatter(data['X'], data['Y'])
plt.show()

There appears to be a fairly linear relationship between height and weight in this dataset. Linear regression therefore appropriate without doing feature engineering

The Algebra

To fit a linear regression model to data $(x_1, y_1), (x_2, y_2),..., (x_n, y_n)$ such that $y_i = \alpha + \beta x_i + \epsilon_i$, where $\epsilon_i$ is the ith error term, the Least-Squares estimates for the parameters $\alpha, \beta$ are:

$$\hat \beta = \frac{\sum^{n}_{i=1}(x_i - \bar x)(y_i - \bar y)}{\sum^{n}_{i=1}(x_i - \bar x)^2}$$

$$\hat \alpha = \bar y - \hat \beta \bar x $$

Once we've calculated $\hat \alpha \& \hat \beta$ we can estimate the label corresponding to a newly observed feature, $x^*$ as:

$$ \hat y = \hat \alpha + \hat \beta \times x^* $$

How did we arrive at these values?

The short answer is: Multivariate Calculus! There are lots of existing resources online which do a good job of explaining how we derive the equations for linear regression - check out this link if you want to read more.

In [6]:
class LinearRegression1D:
    
    def __init__(self, data, target, feature, trainTestRatio = 0.9):
        #data - a pandas dataset 
        #target - the name of the pandas column which contains the true labels
        #feature - the name of the pandas column which we will use to do the regression
        #trainTestRatio - the proportion of the entire dataset which we'll use for training
                    #   - the rest will be used for testing
        
        self.target = target
        self.feature = feature
        
        #Split up data into a training and testing set
        self.train, self.test = train_test_split(data, test_size=1-trainTestRatio)
        
        
    def fitLR(self):
        #Fit a linear regression to the training data
        #The goal is to estimate the gradient of the line of best fit and the intercept of that line
        
        
        
        #Calculate the estimate for the gradient 
        #if possible try to do it in one line by doing pointwise operations on the arrays
        self.betaHat = np.sum((self.train[self.feature] - np.mean(self.train[self.feature]))*(self.train[self.target] - np.mean(self.train[self.target])))/ np.sum((self.train[self.feature] - np.mean(self.train[self.feature]))**2)

        
        #Use the value of self.betaHat you've just calculated to calculate self.alphaHat, the estimated intercept
        self.alphaHat = np.mean(self.train[self.target]) - self.betaHat*np.mean(self.train[self.feature])
        
        return 0 #We've saved the parameter values as part of the class now
    
    def predict(self,x):
        #Given a vector of new observations x, predict the corresponding target values
        
        prediction = self.alphaHat + self.betaHat*x
        return prediction
    

        
        
        

The process should work as follows:

  1. Create an instance of the class LinearRegression1D and pass it some data
  2. Fit a model to the training data (the training and testing data should have been automatically created) by calling the fitLR method
  3. Predict the target value for each observation in the testing data and see how you did. Alternatively you can plot the estimate line of best through the testing data for visual inspection
In [7]:
myModel = LinearRegression1D(data, 'Y', 'X')
myModel.fitLR() #If this returns a zero, then it should have finished. If not we've got problems!
Out[7]:
0

Now let's see how we did!

We can print the parameters we estimated for the gradient and intercept. Recall that the true values are saved as trueIntercept and trueGradient

In [8]:
print(f'The true intercept value was {trueIntercept}, we estimated the value to be {myModel.alphaHat}')
print(f'The true gradient value was {trueGradient}, we estimated the value to be {myModel.betaHat}')
The true intercept value was 10, we estimated the value to be 12.7003819121893
The true gradient value was -4, we estimated the value to be -4.061376595153571

Not too bad! We didn't have a particularly large dataset to train on so to be in the right ballpark with the parameters is perfectly acceptable, given the variability within the data. A larger dataset would most likely give a more accurate estimation of the two values.

Let's plot the line we have calculated against our data

In [9]:
x = np.arange(np.floor(data['X'].min()), np.ceil(data['X'].max()))

plt.scatter(data['X'], data['Y'], label = 'Data')
plt.plot(x, myModel.alphaHat + x*myModel.betaHat, label = 'Regression line')
plt.legend()
plt.show()

We can see that our regression line runs through the centre of the point cloud, indicating that the model fits nicely.

As a final check, lets plot the residuals (the error for each prediction) and see what they look like. Hopefully they will be evenly scattered either side of 0 - if not, then our model is biased and our predictions will probably be consistently biased (perhaps as a result of overfitting)

In [10]:
testPred = myModel.predict(myModel.test[myModel.feature]) #What our models thinks the true values of y are, given x
residuals = testPred - myModel.test[myModel.target]

plt.scatter(list(range((residuals.shape[0]))), residuals)
plt.ylabel('Residuals')
plt.xlabel('index')
plt.show()