All you need to know about Linear Regression
Table of Content:
- Simple Linear Regression (LR)
- Cost Function and Gradient Descent for LR
- Evaluation Metrices for LR
- Practical Implementation of LR
- Assumptions of LR
- Multiple Linear Regression (MLR)
- Considerations of MLR
- Multicollinearity
- Practical Implementation of MLR
- Influential Observation
- Cook’s distance
- Bias Variance Tradeoff
- Overfitting and Underfitting
- Regularization
- Ridge Regression
- Lasso Regression
- Difference between Ridge and Lasso Regression
- Elastic Net Regression
- Practical Implementation of Regularized Models
Simple Linear Regression (LR):
Linear regression shows the linear relationship between the independent variables and the dependent (continuous) variable. If there is a single input variable X(independent variable), such linear regression is called simple linear regression.
The above graph presents the linear relationship between the output(y) variable and predictor(X) variables. The straight line is referred to as the best fit straight line. To calculate the best fit line, linear regression uses a traditional slope-intercept form:
- Yi = B0 + B1Xi, where Yi = Dependent variable, B0 = constant/Intercept, B1 = Slope, Xi = Independent variable.
- Constant/Intercept — The value of Y when Xi= 0
- Slope — With unit change in Independent variable(X) what will be the change in dependent variable(Y)
The difference between the observed value of the dependent variable(Yi) and the predicted value (on the best fit line) is called the Residuals(Random Error).
- RE = Ypredicted — Yi
- Residual Sum of Squares (RSS) = Σ (Ypredicted — Yi)²
The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum. It is obtained by minimizing the Residual Sum of Squares(RSS).
Cost Function and Gradient Descent for LR:
The cost is the error in our predicted value. In Linear Regression, generally Mean Squared Error (MSE) cost function is used.
Gradient Descent is one of the optimization algorithms that optimizes the cost function(objective function) to reach the optimal minimal solution. To find the optimum solution, we need to reduce the cost function(MSE) for all data points. This is done by updating the values of B0 and B1 iteratively until we get an optimal solution (Minimal).
Take for an example the equation Y= 5x + 4x². We can simply take the derivative of this equation with respect to x, equate it to zero. This gives us the point where this equation is minimum. Therefore substituting that value back into the equation can give us the minimum value of that equation.
Since, The cost function is a quadratic equation, the graph of B1 vs J(B1) will be a parabola with B1 on the x-axis and J(B1) on the y-axis. Similarly, for B0 vs J(B0) the graph is also a parabola. <Refer to the image below>
Our goal is to reach the minimum of the cost function, which we will get when our B1 will be equal to Bmin. Now, to start with we will randomly initialize our B1. The current value of B1 and B0 will be updated using the below equation until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy). The value of B1 and B0 that we are left with now will be the optimum values.
L — learning rate. This controls how much the value of B0 or B1 changes with each iteration. As we are moving close to the minimum, the slope of the curve is also getting less steeper that means, as we are reaching the minimum value, we will be taking smaller and smaller steps.
If the learning rate is very large it is likely to skip the optimal solution. If it is too small we will need too many iterations to converge to the best values. So using a good learning rate is crucial. (L could be a small value like 0.0001 for good accuracy).
Evaluation Metrices for LR:
We need to evaluate the model on several metrics in order to construct and deploy a generalized model. This allows us to better optimize the performance, fine-tune it, and get better results. Although accuracy on training data is required, it is also crucial to generate a true and approximative result on unobserved data, otherwise, the model is useless.
Let’s understand some of the evaluation metrics:
1. Coefficient of Determination or R-Squared (R2):
R-Squared
is a number that explains the amount of variation in the target variable that is explained/captured by the developed model.
R2 = 1 - (Residual Sum of Squares/Total Sum of Squares) = 1 – ( RSS/TSS )
Residual sum of Squares (RSS) is defined as the sum of squares of the residual(difference between the expected and the actual observed output) for each data point. The lower the value of RSS, the better is the model predictions. Or we can say that — a regression line is a line of best fit if it minimizes the RSS value.
Residual Sum of Squares (RSS) = Σ(Ypredicted - Yi)²
If we focus on a single residual, we can say that it is the distance that is not captured by the regression line. Therefore, RSS as a whole gives us the variation in the target variable that is not explained by our model.
Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of the response variable.
Total Sum of Squares (TSS) = Σ(Ymean - Yi)²
We can see that TSS is very similar to the variance of Y. While the variance is the average of the squared sums of difference between actual values and mean, TSS is the total of the squared sums. Therefore, TSS as a whole gives the total variation in Y(Target Variable).
Now, if TSS gives us the total variation in Y, and RSS gives us the variation in Y not explained by X, then TSS-RSS gives us the variation in Y that is explained by our model! We can simply divide this value by TSS to get the proportion of variation in Y that is explained by the model. And this our R-squared statistic!
R-squared = (TSS-RSS)/TSS = Explained variation/Total variation = 1 — Unexplained variation/Total variation = 1 — (RSS/TSS)
- R-squared gives the degree of variability in the target variable that is explained by the model or the independent variables. If this value is 0.8, then it means that the independent variables explain 80% of the variation in the target variable.
- It always ranges between 0 & 1. The higher the value of R-squared, the better the model fits the data.
The R-squared statistic suffers from a major flaw. Its value never decreases no matter the number of variables (even though redundant) we add to our regression model. It either remains the same or increases with the addition of new independent variables. This happens because when a new variable is added, LR model will assign a value to the respective coefficient such that RSS decreases, as a result R-squared increases.
This clearly does not make sense because some of the independent variables might not be useful in determining the target variable(very less correlation with the target variable). Adjusted R-squared deals with this issue.
Adjusted R-squared:
The Adjusted R-squared
takes into account the number of independent variables used for predicting the target variable. In doing so, we can determine whether adding new variables to the model actually increases the model fit. Adjusted R-squared increases only when independent variable is significant and affects dependent variable.
Adjusted R2 = {1 - [(1-R2)(n-1)/(n-k-1)]}
n represents the number of data points in our dataset
k represents the number of independent variables, and
R2 represents the R-squared values determined by the model
- Adjusted R-squared <= R-squared
- If adding a random independent variable did not help in explaining the variation in the target variable, R-squared value remains the same(or might increase). Thus, giving us a false indication that this variable might be helpful in predicting the output. However, the Adjusted R-squared value decreases which indicated that this new variable is actually not capturing the trend in the target variable.
Practical Implementation of LR:
## Importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
## Importing the dataset
data = pd.read_csv('Salary_Data.csv')
## Splitting the data into Training and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
## Training the Simple Linear Regression Model on training set
lr = LinearRegression()
result = lr.fit(X_train, y_train)
#fit method will train the Linear Regression model on the training set
## Predicting the Test set results
y_pred = lr.predict(X_test)
Checking the Evaluation Metrics
Assumptions of Linear Regression
For successful regression analysis, it’s essential to validate the following assumptions:
Linear Relationship : There needs to be a linear relationship between the dependent variable and independent variable(s).
What to do if this assumption is violated:
- Apply a nonlinear transformation to the independent and/or dependent variable. Common examples include taking the log, the square root, or the reciprocal of the independent and/or dependent variable.
- Add another independent variable to the model. For example, if the plot of x vs. y has a parabolic shape then it might make sense to add X² as an additional independent variable in the model.
Independence of Residuals (No Autocorrelation of Error) : The error terms should not be dependent on one another (like in time-series data wherein the next value is dependent on the previous one). Autocorrelation refers to no correlation between residual errors. Simply, there should not be any visible patterns in the error terms.
Normal distribution of residuals : The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero.
In Order to check for normality of residuals we can plot the PDF using KDE
Kdeplot is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables.
We can also use QQ (Quantile — Quantile) plot for this purpose.
If the points on the plot roughly form a straight diagonal line (going from bottom left to top right), then the normality assumption is met. When deviations occur, they are often located at the lower or higher end of the line, whereas deviations in the middle are less likely.
Alternatively, we can use the below code also to generate a QQ plot
import statsmodels.api as sm
sm.qqplot(residual)
plt.show()
If the points on the plot form another shape than a straight line (like S shaped or an exponential curve), our model is probably not correctly specified. Probably we are missing some variables, or maybe the relationships are not actually linear!
What to do if this assumption is violated:
- First, verify that any outliers aren’t having a huge impact on the distribution. If there are outliers present, make sure that they are real values and that they aren’t data entry errors. If Possible, treat the outliers properly.
- Next, you can apply a nonlinear transformation to the independent and/or dependent variable. Common examples include taking the log, the square root, or the reciprocal of the independent and/or dependent variable.
Equal variance of residuals (Homoscedasticity — Having the same scatter/spread)
- The error terms must have constant variance.
- The presence of non-constant variance in the error terms is referred to as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or extreme leverage values.
- Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model doesn’t pick up on this. This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.
What to do if this assumption is violated
- Transform the dependent variable. — One common transformation is to simply take the log of the dependent variable. For example, if we are using population size (independent variable) to predict the number of hospitals in a city (dependent variable), we may instead try to use population size to predict the log of the number of hospitals in a city. Using the log of the dependent variable, rather than the original dependent variable, often causes heteroskedasticity to go away.
- Redefine the dependent variable. — One common way to redefine the dependent variable is to use a rate, rather than the raw value. For example, instead of using the population size to predict the number of hospitals in a city, we may instead use population size to predict the number of hospitals per capita. In most cases, this reduces the variability that naturally occurs among larger populations since we’re measuring the number of hospitals per person, rather than the sheer amount of hospitals.
Multiple Linear Regression(MLR)
Multiple linear regression
is a technique to understand the relationship between a single dependent variable and multiple independent variables.
Yi = B0 + B1X1 + B2X2 + … + BiXi + e
Using matrix notation, the above equation can be expressed in a more compact and intuitive form : y=XB+e
Considerations of MLR
All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear Regression along with a few new additional assumptions:
- Overfitting: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set. This phenomenon is known as the overfitting of a model. This usually leads to high training accuracy and very low test accuracy. This should be avoided. <More on this later in this article>
- Feature Selection: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.
- Multicollinearity: It is the phenomenon where a model with several independent variables, may have some variables correlated with each other.
Multicollinearity
Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model.
The Key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each unit change in an independent variable when you hold all of the other independent variables constant.
The idea is that you can change the value of one independent variable and not the others. However, when independent variables are intercorrelated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another.
It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.
As multicollinearity makes it difficult to find out which variable is actually contributing towards the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable on the target variable.
Multicollinearity can be detected using the following methods:
- Pairwise Correlations: Checking the pairwise correlations between different pairs of independent variables can throw useful insights in detecting multicollinearity.
- Variance Inflation Factor (VIF): Pairwise correlations may not always be helpful because it is possible that more than one variable may be needed to fully explain one or more other variables. Therefore, we may utilize VIF to show how one independent variable interacts with all other independent variable.
- The general rule of thumb for VIF values is that if VIF > 10, the value is unquestionably high and should be eliminated. Additionally, if the VIF is closer to 5, it might be genuine but should first be examined. A good VIF value is one where VIF is less than 5.
- The square root of the variance inflation factor indicates how much larger the standard error increases compared to if that variable had 0 correlation to other predictor variables in the model.
- If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times larger than if that predictor variable had 0 correlation with the other predictor variables.
How is VIF calculated?
Remove dependent variable, take one independent variable out and make it a target variable and other variables as independent variable, fit a regression model, find R-squared for that model and put it in the below formula, this will give us the VIF of that variable. Do it for all other variables and find their VIF.
VIFi = 1/(1-Ri2)
## Where Ri2 represents the unadjusted coefficient of determination for
## regressing the ith independent variable on the remaining ones.
For a high R-squared value, VIF is also high, which means the variability in that Variable is explained by other variables in a very good way and hence we do not need that variable, we can remove it
How to deal with Multicollinearity
- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together, if possible.
- LASSO and Ridge regression are advanced forms of regression analysis that can handle multicollinearity. <More on this later in this article>
Practical Implementation of MLR
## Importing the dataset
df = pd.read_csv('50_Startups.csv')
## Creating Dependent and Independent Variables
X = df.drop(columns = ['Profit']) ##Matrix of Independent variables
y = df['Profit'] ##Vector of Dependent variable
Encoding the Categorical data
A Machine Learning algorithms can only understand the numbers. It cannot understand the text. Hence, we need to convert categorical columns to numerical columns so that the algorithm understands it. This process is called categorical encoding. There are various ways to do the categorical encoding. Some of them are:
- One-Hot Encoding /Dummy Encoding— We apply it when:
- The categorical feature is not ordinal (like the states above)
- The number of categorical features is less (one-hot encoding creates as many column as the number of categorical features)
2. Label Encoding — We apply it when:
- The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
- The number of categories is quite large as one-hot encoding can lead to high memory consumption
states = pd.get_dummies(X['State'], drop_first= True)
X = X.drop(columns= 'State')
X = pd.concat([X, states], axis = 1)
X.head()
Feature scaling is not needed in MLR because it will not matter that some features have high values than others because the coefficients will compensate to put everything on the same scale.
Testing for Multicollinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_info = pd.DataFrame()
vif_info['columns'] = X.columns
vif_info['VIF']=[variance_inflation_factor(X, i) for i in range(X.shape[1])
vif_info
Since, R&D Spend and Marketing Spend have VIF values > 5, both of them are highly correlated. We can remove any one and then check the VIF again. Let’s drop Marketing spend. Below are the VIF values after removing it.
Now the VIF values are <5. We can safely proceed to build the model now.
## Splitting the dataset into Training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
## Training the MLR model on Training set
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(X_train, y_train)
## Predicting the Test set results
y_pred = mlr.predict(X_test)
Checking the Evaluation Metrics
Influential Observation
- An
influential observation
is an observation in a dataset that, when removed, dramatically changes the coefficient estimates of a regression model. - An
outlier
is a data point whose response(y) does not follow the general trend of the rest of the data — error is high - A data point has
extreme(high) leverage
if it has “extreme” predictor x values. - A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results.
- Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine whether or not they are actually influential.
Cook’s Distance
The most common way to measure the influence of observations is to use Cook’s distance.
Essentially Cook’s Distance measures how much all of the fitted values in the model changes when ith observation is deleted.
- The leverage hii is a measure of the distance between the x value for the ith data point and the mean of the x values for all n data points.
- The larger the value for Cook’s distance, the more influential a given observation.
- A general rule of thumb is that any observation with a Cook’s distance greater than 4/n (where n = total observations) is considered to be highly influential.
- Or simply any observation with a Cook’s distance greater than 1 is considered to be an observation with high leverage.
Let’s consider the scenario below:
import pandas as pd
#create dataset
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 7, 3, 2, 12, 11, 15, 14, 17, 22],
'y': [23, 24, 23, 19, 34, 35, 36, 36, 34, 32, 38, 41, 42, 180]})
The observation (22,180) seems to be an outlier and of high leverage. Let’s check if it is an influential observation. For that let’s calculate the cook’s distance.
import statsmodels.api as sm
#define response variable
y = df['y']
#define explanatory variable
x = df['x']
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y, x).fit()
model.summary()
#create instance of influence
influence = model.get_influence()
#obtain Cook's distance for each observation
cooks = influence.cooks_distance
#display Cook's distances
df['cooks_distance'] = cooks[0]
df
The observation (22,180) has a value significantly greater than 1 for Cook’s distance, which tells us that it’s an influential observation.
Suppose we remove this value from the dataset and fit a new simple linear regression model and calculate the cook’s distance.
As evident from the images above, the regression coefficients for the intercept and x both changed dramatically. This tells us that removing the influential observation from the dataset completely changed the fitted regression model.
Note :
It’s important to note that Cook’s Distance should be used as a way to identify potentially influential observations. Just because an observation is influential doesn’t necessarily mean that it should be deleted from the dataset.
First, you should verify that the observation isn’t a result of a data entry error or some other odd occurrence. If it turns out to be a legit value, you can then decide if it’s appropriate to delete it, leave it be, or simply replace it with an alternative value like the median.
Bias Variance Tradeoff
Bias:
- Bias is a measure to determine how accurate is the model likely to be on future unseen data. If there is sufficient training data available, complex models can make precise predictions. While too naïve models are very likely to underperform in terms of predictions.
- Simply, Bias is errors made by training data. (This means, on repeated sample of data — on average we are missing the pattern)
- Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
- Generally, linear algorithms have a high bias which makes them fast to learn and easier to understand but in general, are less flexible.
Variance:
- Variance is the sensitivity of the model towards training data, that is it quantifies how much the model will react when input data is changed.
- Ideally, a model should have lower variance which means that the model doesn’t change drastically after changing the training data(it is generalizable). Having higher variance will make a model change drastically even on a small change in the training dataset. That is, the model learns too much from the training data, so much so, that when confronted with new (testing) data, it is unable to predict accurately based on it.
- Simply, Variance is errors made by test data(new data)
Total Error
- To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.
Total Error = Bias^2 + Variance + Irreducible Error
- Irreducible error is the error that can’t be reduced by creating good models. It is a measure of the amount of noise in our data. No matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed.
The aim of any supervised machine learning algorithm is to achieve low bias and low variance as it is more robust. So that the algorithm should achieve better performance. An optimal balance of bias and variance would never overfit or underfit the model.
If we hope to increase the bias a little bit it is compensated by a large decrease in the variance, to achieve an optimal complexity (more generalized/interpretable model — less complex). This is called Bias Variance Tradeoff.
Overfitting and Underfitting
Overfitting:
- When a model learns each and every pattern and noise in the data to such extent that it affects the performance of the model on the unseen future dataset, it is referred to as
overfitting
. - The model fits the data so well that it interprets noise as patterns in the data.
- Overfitting causes the model to become specific rather than generic. This usually leads to high training accuracy(LOW BIAS) and very low test accuracy(HIGH VARIANCE).
There are several ways to prevent overfitting, which are stated below:
- Cross-validation — In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate.
- If the training data is too small to train add more relevant and clean data.
- If the training data is too large, do some feature selection and remove unnecessary features.
- Regularization — It introduces a cost term for bringing in more features with the objective function. Hence it tries to push the coefficients for many variables to zero and hence reduce cost term.
Underfitting:
- When the model fails to learn from the training dataset and is also not able to generalize the test dataset, is referred to as
underfitting
. This type of problem can be very easily detected by the performance metrics. - When underfit, the model is unable to find the hidden underlying patterns from the data. This usually leads to low training accuracy(HIGH BIAS) and very low test accuracy(HIGH VARIANCE).
There are several ways to prevent underfitting, which are stated below:
- Increase the model complexity
- Increase the number of features in the training data
- Remove noise from the data.
To summarize,
- A model with a high bias error underfits data and makes very simplistic assumptions on it
- A model with a high variance error overfits the data and learns too much from it
- A good model is where both Bias and Variance errors are balanced
Regularization
Linear regression works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are too large(usually correlated with each other), it can lead to model over-fitting on the training dataset. Such a model will become complex and not generalize well on the unseen data. The size of coefficients increase exponentially with increase in model complexity. Large coefficient means that we’re putting a lot of emphasis on that feature, i.e. the particular feature is a good predictor for the outcome. When it becomes too large, the algorithm starts modelling intricate relations to estimate the output and ends up overfitting to the particular training data. To overcome this shortcoming, we do regularization
which penalizes large coefficients.
In Regularization, we are basically shrinking the coefficients such that model becomes less complex and we hope that by doing this we are adding a little bit of bias into the model but in return we are getting a lot of reduction in the variance
To use the penalized regression, we need to first standardize the features. This will allow us to compare the magnitudes of regression coefficients for the feature variables(like a change of one standard deviation in variables rather than a change of one unit(which will be different for different variable))
The two most widely used types of regularization are called L1 and L2 regularization. The idea is quite simple. To create a regularized model, we modify the loss function by adding a penalizing term whose value is higher when the model is more complex.
In regression analysis, to fit our linear model, we need a measure of mismatch. That is the error at each training data. We want to measure the length of error vector.
Manhattan Distance :
In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 — x2| + |y1 — y2|.
The Euclidean distance :
√[ (x2 — x1)² + (y2 — y1)²]
Ridge Regression
Ridge regression performs L2 regularization, i.e. it adds a factor of sum of squares of coefficients in the optimization objective.
Cost function = RSS + α * (sum of square of coefficients), α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:
- α = 0: The objective becomes same as simple linear regression. We’ll get the same coefficients as simple linear regression.
- 0 < α < ∞: The magnitude of α will decide the weightage given to different parts of objective. The coefficients will be somewhere between 0 and ones for simple linear regression. As soon as α is increased, the Cost function tries to minimize the sum of square of coefficients and hence coefficients are shrinked. As the value of alpha increases, the model complexity reduces.
- Smaller values of alpha gives us significant reduction in magnitude of coefficients. Though the coefficients are very very small, they are NOT zero
- We are increasing the α in a hope that after adding a little bit of bias to the model (making the model less complex), variance of the model is going to decrease a lot and overall the MSE is going to decrease.
Though higher values of alpha reduce overfitting, significantly high values can cause underfitting as well. Thus alpha should be chosen wisely. A widely accept technique is cross-validation, i.e. the value of alpha is iterated over a range of values and the one giving higher cross-validation score is chosen
In Cross-validation, we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.
The three steps involved in
cross-validation
are as follows :-Reserve some portion of sample data-set.
-Using the rest dataset to train the model.
-Test the model using the reserve portion of the data-set.
K-Fold Cross Validation:
In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time.
Lasso Regression
LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective.
Cost function = RSS + α * (sum of absolute value of coefficients)
Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values:
- α = 0: Same coefficients as simple linear regression
- α = ∞: All coefficients zero, Because of infinite weightage on sum of absolute value of coefficients, if coefficients are not zero, Cost function will become infinite
- 0 < α < ∞: coefficients between 0 and that of simple linear regression
- The model complexity decreases with increase in the values of alpha.
- For the same values of alpha, the coefficients of lasso regression are much smaller as compared to that of ridge regression (not generalize always but will hold for many cases)
- For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression (not generalize always but will hold for many cases)
Many of the coefficients (for least important features) are zero even for very small values of alpha.
Difference between Ridge and Lasso Regression
- The ridge coefficients are a reduced factor of the simple linear regression coefficients and thus never attain zero values but very small values.
- The lasso coefficients become zero in a certain range and are reduced by a constant factor, which explains there low magnitude in comparison to ridge.
- Ridge includes all of the features in the model. Thus, the major advantage of ridge regression is coefficient shrinkage and reducing model complexity. It is majorly used to prevent overfitting. Since it includes all the features, it is not very useful in case of exorbitantly high features, say in millions, as it will pose computational challenges.
- Along with shrinking coefficients, lasso performs feature selection as well as some of the coefficients become exactly zero, which is equivalent to the particular feature being excluded from the model. Since it provides sparse solutions, it is generally the model of choice for modelling cases where the features are in millions or more. In such a case, getting a sparse solution is of great computational advantage as the features with zero coefficients can simply be ignored.
- Ridge: It generally works well even in presence of highly correlated features as it will include all of them in the model but the coefficients will be distributed among them depending on the correlation.
- Lasso: It arbitrarily selects any one feature among the highly correlated ones and reduced the coefficients of the rest to zero. Also, the chosen variable changes randomly with change in model parameters. This generally doesn’t work that well as compared to ridge regression.
Elastic Net Regression
In LASSO some weights are reduced to zero, but others may be quite large. In Ridge, weights are small in magnitude, but they are not reduced to zero In Elastic Net, we may be able to use both of these by making some of the weights zero while reducing the magnitude of others.
Elastic net is basically a combination of both L1 and L2 regularization. So if we know elastic net, we can implement both Ridge and Lasso by tuning the parameters
Let’ say, we have a bunch of correlated independent variables in a dataset, then elastic net will simply form a group consisting of these correlated variables. Now if any one of the variable of this group is a strong predictor (meaning having a strong relationship with dependent variable), then we will include the entire group in the model building, because omitting other variables (like what we did in lasso) might result in losing some information in terms of interpretation ability, leading to a poor model performance.
Cost function = RSS + a * (sum of absolute value of coefficients) + b * (sum of square of coefficients) = RSS + a * L1 + b* L2
- We need to define alpha and l1_ratio while defining the model.
Alpha
andl1_ratio
are the parameters which you can set accordingly if you wish to control the L1 and L2 penalty separately. - Alpha = a + b
- l1_ratio = a / (a+b)
- So when we change the values of alpha and l1_ratio, a and b are set accordingly such that they control trade off between L1 and L2
Let alpha (or a+b) = 1, and now consider the following cases:
- If l1_ratio = 1, therefore if we look at the formula of l1_ratio, we can see that l1_ratio can only be equal to 1 if a=1, which implies b=0. Therefore, it will be a lasso penalty.
- If l1_ratio = 0, implies a=0. Then the penalty will be a ridge penalty.
- For l1_ratio between 0 and 1, the penalty is the combination of ridge and lasso.
Practical Implementation of Regularized Models
We can download the data from below link.
## Reading the data
data = pd.read_csv('Melbourne_Housing_Full.csv')
cols_to_use = ['Suburb', 'Rooms', 'Type','Method', 'SellerG',
'Regionname', 'Propertycount', 'Distance', 'CouncilArea',
'Bedroom2', 'Bathroom', 'Car', 'Landsize',
'BuildingArea', 'Price']
data = data[cols_to_use]
## Treating the missing values
data.loc[:,['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']] = data.loc[:,['Propertycount', 'Distance', 'Bedroom2', 'Bathroom', 'Car']].fillna(0)
data.loc[:,'Landsize'] = data.loc[:,'Landsize'].fillna(data.Landsize.mean())
data.loc[:,'BuildingArea'] = data.loc[:,'BuildingArea'].fillna(data.BuildingArea.mean())
data.dropna(inplace=True)
## Encoding the categorical features
data = pd.get_dummies(data, drop_first=True)
## Creating the dependent and Independent variables
X = data.drop(columns='Price')
y = data['Price']
## Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)
## Fitting a Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
This is clearly the case of overfitting. Overfitting causes the model to become specific rather than generic. This usually leads to high training accuracy(LOW BIAS) and very low test accuracy(HIGH VARIANCE).
from sklearn.linear_model import Ridge, Lasso, ElasticNet
lasso_reg = Lasso(alpha = 50)
lasso_reg.fit(X_train, y_train)
ridge_reg = Ridge(alpha = 50)
ridge_reg.fit(X_train, y_train)
By doing regularization, we added some bias (training accuracy dropped) and as a result variance decreased a lot (test accuracy increased).
I hope this article helped you understand the Algorithm and most of the concepts related to it.
HAPPY LEARNING!!!!!