Linear Regression

##linear_regression ##assumptions ##r_squared ##adjusted_r_square

Rajdip Sur Jan 17 2021 · 4 min read
Share this

In the machine learning Linear Regression is one of the most fundamental algorithms with which a beginner should start. We can consider it as the door to the magical world of Machine Learning .

What is Linear Regression ?

It is nothing but the linear relationship between dependent variable (target variable) and independent variable (feature variables). Let suppose we have a dataset of height and weight of students of a school. We all know there is a positive relation b/w height and weight. So with the help of height , we can estimate weight . Here height will be  our independent variable and weight  will be the dependent variable. With the help of linear regression we can built a simple linear equation which help us to predict a weight value closer to the actual value.

Red Dots :- Data points, Blue Line :- Fitted Linear Regression Line

Assumptions of Linear Regression :-

  • Linearity: The relationship between X and the mean of Y is linear. We can check the linearity by scatter plot.
  • Homoscedasticity: The variance of residual (Difference b/w  Predicted and actual value) are constant  for any value of X.
  • Independence:  There should be no multi-collinearity among the independent variables. i.e the independent variables should not highly correlated  with each other. We can easily identify the multi-collinearity by looking at the correlation matrix or by variance inflation factor.
  • Normality: The error or residual terms are normally distributed. This assumption may be checked by looking at a histogram or a Q-Q-Plot.
  • Equation of the Linear regression:- 

                                Y = m . X + b

    Here , Y = Dependent Variable (The Target Variable)
                 X = Independent Variable (The Features)
                 m = Coefficient of X (Slope of the regression line),  which represents the relation b/w X and Y . 
                 b  = The Residual or the error term.

    Now the question is, how can we find the best fitted line for our regression model ?? 

    Regression Line

    In the above diagram,

  • Blue Dots = Data points.
  • Red Line = Regression Line.
  • Blue Line  = The residuals or error . The difference between the predicted value given by Regression line and the actual value.  The difference we can denote as D
  • Now we need to calculate the sum of the square of the residual ,  We can call it the LOSS FUNCTION for regression model. So we need to minimize the loss  and for which line we get the minimum value ,we should select that line as best fit line .

    Residual Sum of Square (RSS)
    Residual Sum of Square (RSS) shown in simple way.

    The below visual is showing the loss minimizing and the best fitting way in regression model. 

    Let's try to understand these things mathematically (Least Square Method)

    In the language of statistics, we call it Least Square Method ,which is show in below.

    According to the first order condition to minimize the equation , we need to do first order partial differentiation with respect to m and b Separately and then equate with zero .From there we can get the ideal value of m and b .

    Now equating the equations with the zero.

    The same equation can be written in matrix form as:

    According to the Second order condition, the second order differentiation should be positive. Let see the condition is satisfying or not.

    From the above image, we can see the second order condition is also satisfied. 

    When we have two independent variable and one dependent variable, we will have the gradient descent diagram ,like shown below.

    Again another question comes in our way that how to calculate the accuracy of regression model ?? 

    R-Squared Statistics :-  
    R-squared is a statistical method which explains how close the data points to the regression line. The R-squared value lies between 0 to 1 .

    Formula of R-squared Statistics

    To understand the formula of the R-squared statistics, first we need to aware about the RSS and TSS.

    RSS and TSS showing diagrammatically. 

      A residual sum of squares (RSS) is a statistical technique used to measure the amount of variance in a data set that is not explained by a regression model.  

    Residual Sum of Square .      ( y = actual value ,y_hat = predicted value )

      The Total Sum of Square (TSS) tells you how much variation is there  in the dependent variable (Y). 

    Total Sum of Square  = Σ(Yi – mean of Y)2

    So when ,RSS = TSS ,

    then ,R-square =
    Which implies that the regression line is unable to explain any variation among the dataset. 

    Again when RSS = 0,
    then, R-square = 1,
    Which implies that the regression line is explaining the 100% of the data variation. 

    Adjusted R-squared statistical method :-

     Whenever  we add up a new independent variable to our dataset  will increase the R-squared value for sure. Even if the independent variable haven't any correlation with the dependent variable . So , we cannot count on the R-squared statistics  always. Let's try to understand this mathematically, 
    Let suppose we have two model having one and two independent variable respectively.

    So, we can conclude that  whenever we increase the number of independent variable , the R-squared statistics will automatically increased..

    To rectify this problem, we use Adjusted R-square statistics which penalizes such dependent variable which do not correlate with the independent variable . 

    In the above equation we can see, when p = 0adjusted R-square = R-square value .
    Thus, adjusted R-square <= R-square . 

    So we can count on the adjusted R-square statistic anytime to check the accuracy of a regression model.

    Thank You  for reading this article  .

    * * *
    Read next