Tuesday, May 23, 2017

Comparing Ridge & Lasso Regression Types and Best Subset Selection method Using k-fold CV with R


 Hello! My very first blog will be about comparison of Ridge & Lasso regression types and best subset selection using k-fold cross validation method. I will go through each types in detail and I will compare them based on their R² values. Before starting let’s cover the Lasso, Ridge and Best Subset Selection methods and R².


Ridge Regression
 Ridge Regression is a regression method that includes something called ‘shrinkage penalty’. In linear regression we are computing the Residual Sum of Squares and try to minimize it. In ridge regression we have also shrinkage penalty which penalizes each predictor with the value of lambda. The aim of this method is preventing overfitting by regularizing the coefficients. Prediction variance will be decreased compared to Least Squares method. As lambda grows our regression coefficient estimates will lose their effects. As lambda goes to infinity, coefficients will be equal to zero and we will have the null hypothesis, that is our predictors has no effect on estimation. The formula for ridge regression is;

 The first sum if the equation is our RSS, and the second sum is called as 'shrinkage penalty'. The shrinkage penalty shrinks all of the predictors towards zero but not exactly to zero. Ridge regression has all the parameters even if with small values of coefficients. In case of extremely large lambda value all coefficients will be zero and our prediction will become the mean value of response value. As we will see, this concept is different from Lasso Regression.

Lasso Regression
 Lasso Regression is another regression method that penalizes coefficients for getting less flexible fit. It has an advantage over Ridge Regression. That is in Ridge Regression we have all the predictors even if their coefficients is small. That brings some difficulties in terms of interpretation of the model. In Lasso, some predictors shrinks to zero, that is what we have after performing Lasso Regression is a subset of our predictors. In lasso, we minimize the quantity;



 The difference with Ridge Regression and Lasso Regression lays on the type of penalties. Lasso has ℓ1 penalty type whether Ridge has ℓ2. In lasso, we have |Bj| which is Bj^2 in Ridge regression. Let’s say we have a value X and we want to our error less than it. And let’s say that we have B1 and B2 as our coefficients. So, for lasso we will have |B1| + |B2| <= X, in Ridge we will have B1^2 + B^2 <= X. These functions can be illustrated as in Introduction to Statistical Learning book.



 From the figure we can see that Ridge Regression has a value for B1 and B2 at the intersection point while in Lasso Regression B1 is equal to zero.

 If B1 = B2 lasso regression will include only one of them but ridge regression will have some portion from both. In the end, we will have a subset of predictors if we select Lasso Regression.

Best Subset Selection

 Best Subset Selection method simply goes through all possible subsets. First subset will be composed of only one predictor that has the most variance among all predictors. Second subset will be best possible subset that composed of two predictors and has the most variance among other subsets that has two predictors. It goes like that. The thing to be careful about is that having best subset that has two predictors like B2 and B4 doesn't mean that in the best subset having three predictors will has those B2 and B4. As written, best subset selection simply goes through all possible subsets in each level and comes with the best one. This makes it costly to perform if we have high number of predictors. If we have n predictors the number of models that have k predictors will be
Now let’s perform each method using Boston Data Set. Let's begin with Best Subset Selection

> library(MASS)
> library(leaps)
> library(glmnet)
> ?leaps
> #leaps for best subset
> ?glmnet
> #glmnet for our fits
> mydata = Boston
> dim(Boston)
[1] 506  14
> fit.best = regsubsets(medv~.,data=mydata,nvmax=13)
> summary(fit.best)
> k = 10
> set.seed(1)
> folds = sample(1:k,nrow(Boston),replace=TRUE)
> cv.errors = matrix(NA,k,13,dimnames = list(NULL,paste(1:13)))
> for(i in 1:k){
+     fit = regsubsets(medv~., data = mydata[folds!=i,], nvmax =13)
+     for(j in 1:13){
+         pred = predict(fit, mydata[folds==i,],id=j)
+         cv.errors[i,j] = mean( (mydata$medv[folds==i]-pred)^2)
+     }
+ }
> mean.cv.errors = apply(cv.errors,2,mean)
 regsubsets is linear regression for each subset. By default it calculates through only 8 variables. That's why nvmax = 13 is used. We have 13 predictors in this setting. 'id' value in predict function determines how many predictors is used for each unique fit.

Now we have all errors from all subset levels. Let’s illustrate and see which one is the best
>min.p = which.min(mean.cv.errors)
>plot(mean.cv.errors,type = "b")
>points(min.p,mean.cv.errors[min.p],col = "red",cex = 2,pch = 20)
Seems like subset that has 11 predictors performs best.
>regfit = regsubsets(medv~.,data=mydata,nvmax=13)

Coefficients of the subset that has 11 predictors;
>coef(regfit,11)
 (Intercept)          crim            zn 
 36.341145004  -0.108413345   0.045844929 
         chas           nox            rm 
  2.718716303 -17.376023429   3.801578840 
          dis           rad           tax 
 -1.492711460   0.299608454  -0.011777973 
      ptratio         black         lstat 
 -0.946524570   0.009290845  -0.522553457 
>mean.cv.errors[11]
     11 
24.0169 
>error.best.subset = mean.cv.errors[11]
Now we have the error from Best Subset Selection method. Now let’s move to Ridge Regression
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}
>x = model.matrix(medv~.,mydata)[,-1]
#exclude response value medv 
#model.matrix drops medv and creates a design matrix based on the formula
>y=mydata$medv
>grid=10^seq(4,-2,length=100)
 
# grid is created for computing through all possible lambda values.
>cv.out = matrix(NA,10,1) 
>set.seed(1)
>folds = sample(1:k,nrow(mydata),replace = TRUE)
>>for(i in 1:10){
+     ridge.mod = glmnet(x[folds!=i,],y[folds!=i],alpha=0,lambda=grid)  
+     cv = cv.glmnet(x[folds!=i,],y[folds!=i],alpha=0)
+     lam = cv$lambda.min
+     ridge.pred = predict(ridge.mod,s=lam,newx=x[folds==i,])
+     cv.out[i] = mean((ridge.pred - y[folds==i])^2)
+   }
> error.ridge = mean(cv.out)
>error.ridge
[1] 24.60134

alpha = 0 tells to glmnet function that regression type that is being performed is Ridge Regression. In case of Lasso Regression, alpha gets the value of 1.

What is performed here is basically applying Cross Validation for each fold and getting the lambda that gives the least error in ridge.mod using other folds as a training set. After getting lambda value and assigning it to 'lam', prediction is performed using the fold that is not used in glmnet function as a test set. 'cv.out' has squared errors of each cross validation errors. The mean value of those cross validation error, mean(cv.out), gives us the 10-fold Cross Validation Error of this method.

Okey. We are done with Ridge Regression method. Now let’s move to Lasso Regression;

>cv.out = matrix(NA,k,1)
> for(i in 1:k){
+     lasso.mod = glmnet(x[folds!=i,],y[folds!=i],alpha=1,lambda=grid)
+     cv = cv.glmnet(x[folds!=i,],y[folds!=i],alpha=1)
+     lam = cv$lambda.min
+     lasso.pred = predict(lasso.mod,s=lam,newx = x[folds==i,])
+     cv.out[i] = mean((lasso.pred - y[folds==i])^2)
}
> mean(cv.out)
[1] 24.1567
> error.lasso = mean(cv.out)
 Again, like Ridge Regression we are getting the best fit using the cv.out vector that has the squared errors of each fit. The mean value of the cv.out is our 10-fold Cross Validation Error from Lasso Regression.

 Now we have all errors that comes from the best fit of all three types. How we can compare them? We can compare them by computing the R²values of each of them to see the variance of information that is covered using that method.

 R² is equal the 1 - (RSS / TSS). TSS is Total Sum of Squares which means how far is each variable from their means. It is the error comes from the null hypothesis.TSS is equal to squared error between each variable and its mean. It is the same for each fit since it is the feature of the data, not the fit.

 RSS is residual sum of squares that means how far away is the actual values from our fit. RSS / TSS gives us the percentage of the information that is NOT covered after performing specific regression fit. 1 - (RSS / TSS) gives is the percentage that is COVERED after performing regression. Our error values, error.lasso, error.ridge, error.best.subset, are RSS values that comes from each type of fit. We computed each of them by calculating the difference between the predicted values and the actual values. So, we have all the information that is needed. Now let's calculate the R² values.
>mean.medv = mean(mydata$medv)
> TSS = mean((mean.medv - mydata$medv)^2)
> TSS
[1] 84.41956
> r2.ridge = 1 - (error.ridge/TSS)
> r2.lasso = 1 - (error.lasso/TSS)
> r2.best.subset = 1 - (error.best.subset/TSS)
> r2.ridge
[1] 0.7085825
> r2.lasso
[1] 0.7138495
> r2.best.subset
       11 
0.7155055

 Our R² values are really close to each other. The best fit is comes from Best Subset Selection in this example which is equal to 0.716. This blog aims to show the implementation of each methods with some explanations without trying to find best possible fit among lots of possibilities. Since it is my very first blog I am sure it includes lots of mistakes. For correction and comments I am looking forward to your feedback!

Gibbs Sampler

Gibbs Sampler In this blog we are going to discuss Gibbs Sampler with an applied example. But we will not dive into Gibbs Sampler direc...