L1 and L2 Regularization

Mehmet Akif Cifci
4 min readFeb 8, 2022

--

The definition of regularization is the process of modifying a situation or system so that it complies with applicable laws or regulations. Regularization is a technique that limits or regularizes the weights. When we use machine learning to address a problem, we have no way of knowing if the data set we have is adequate to generate a suitable model; we also have no way of knowing whether the model we develop will result in overfitting or underfitting. However, we have preliminary information that the model’s performance may be enhanced via specific optimization techniques. As a result, performance enhancement strategies may be seen as an external parameter that enhances the model’s performance independently.

Why is it necessary to impose constraints on the weights?

Overfitting is a significant issue in machine learning. Regularization is used to avoid overfitting. If the polynomial degree or the number of features is set to a too large value in polynomial regression, the algorithm learns all the noise. It fits the training set so well that it does not become an efficient model for generic data. It is solely beneficial for the training set’s data. This is referred to as overfitting or high variance issue. The training accuracy is high in this situation; however, the validation accuracy is somewhat low.

Consider the linear regression formula as an example. Since this is the simplest formula:

y = mx + c

If there is just one feature, it is the formula. However, in reality, we practically never work with a single feature. Often, we have many features.

As a result, the formula becomes:

Figure 1 (source Google)

I omitted the intercept term from this section. Because this post will concentrate on weights. The initial values of the slope “m1, m2, m3…mn” are created randomly. Slopes are often referred to as weights in machine learning. Additionally, slopes are adjusted depending on the Mean Squared Error (MSE) of the output values.

Figure 2 Mean Squared Error (MSE)(Google)

A regularization term is included to address the problem of overfitting. Regularizations are classified into two categories at the L1 and L2 levels.

Regularization of the L1:

The expression for L1 regularization is as follows. When we employ the L1 norm in linear regression, this is referred to as Lasso regression:

Figure 3 Regularization of the L1:(Google)

This formula’s first term is the straightforward MSE formula. However, the second term is the parameter for regularization. As you can see, the regularization term is equal to the total of all the slopes’ absolute values multiplied by the lambda term. You must select lambda depending on the output of cross-validation. If lambda is larger, the MSE is also larger. This means a greater penalty. When the error term is increased, the slopes get smaller.

On the other hand, the error term increases proportionately when the slopes increase. That, too, is a penalty. As a consequence, slopes begin to narrow. Some of the slopes may approach zero, rendering some of the features irrelevant because each slope is multiplied by a feature. L1 regularization may therefore be used for feature selection as well. However, the downside is that you must exercise caution if you do not want to lose any information or disable any feature.

The advantage of L1 regularization over L2 regularization is that it is more resilient to outliers. Additionally, it may be used to choose features.

Regularization of L2:

The expression for L2 regularization is as follows. Ridge regression is another name for this type of regression.

Figure 4 Regularization of the L2:(Google)

As seen in the formula, we multiply the squares of all the slopes by the lambda. As with L1 regularization, increasing the lambda value increases the MSE, resulting in lower slopes. Additionally, if the slope values are greater, the MSE is greater. This means a higher penalty. However, since it takes the square root of the slopes, the slope values will never reach zero. As a result, no feature contribution to the algorithm will be lost.

The downside is that outliers excessively influence it. Due to the squared weights, if a value is much higher than the others, it becomes too overwhelming due to the squared.

The advantage of the L2 norm is that the derivative of the regularization term is simpler to get. As a result, it may be employed more simply in gradient descent formulae. Additionally, since no information is lost, as no slope hits zero, it may improve performance provided outliers are not a concern.

Conclusion

Regularization on the L1 and L2 levels both have advantages and disadvantages. You may choose the type of regularization that fits for your project. Alternatively, you might test both to see which one performs better.

You can follow me on Twitter, and YouTube channel

--

--

Mehmet Akif Cifci
Mehmet Akif Cifci

Written by Mehmet Akif Cifci

Mehmet Akif Cifci holds the position of associate professor in the field of computer science in Austria.

No responses yet