Multicollinearity

Multicollinearity is defined as the linear relationship between two or more independent variables while performing regression analysis between a dependent variables and set of independent variables.

Multicollinearity presents a severe problem during regression modelling. Inclusion of independent variables having linear relationship with each other leads to parameter estimation with higher standard error. This in turn leads to inaccurate parameter estimation. Furthermore, due to inaccurate parameters regression model becomes unstable.

The unstable models performs badly on the validation and test samples. When model is unstable, its performance deteriorate very fast compared to stable model over the period, though model is scored on the data of sample of same population. In such a situation an analyst must investigate for the multicollinearity, before finalizing the model.

Next question is how to investigate and which variable should be kept if some variables are found to be linearly related with each other. This is done by performing linear regression analysis between each pair of independent variables. This gives us a metric called variance inflating factor (VIF). It is equal to 1/sqrt(1-Rsquare) where R is correlation coefficient.

This can be calculated in MS Excel, R, MATLAB, SAS as well as using many other statistical packages available in the market. I discuss here how to use proc reg to measure the multicollinearity among the independent variables.

proc reg data = Dev_samp_cust;
model spend = salary age education tax_paid travelling_spend wealth family_members / vif tol collin;run;
quit; 

The VIF option used with Collin results in Variance Inflation factor whereas TOL results in the tolerance (requested by the tol option) is the proportion of variance in a given predictor that is NOT explained by all of the other predictors, while the VIF (or Variance Inflation Factor) is simply 1 / tolerance.The VIF represents a factor by which the variance of the estimated coefficient is multiplied due to the multicollinearity in the model.Exclusion of correlated variables depends upon the objective of the model, domain knowledge and VIF.

 In banking domain I used to drop the variable if VIF is more than 1.2 but I have also seen analysts in other domains such as retail etc dropping the variable only if VIF is more than 10. At this stage knowledge of data and understanding of the business is very important as it helps in deciding the VIF cut off as well as selecting one variable among the linearly related variables.


Comments

Popular posts from this blog

How to check whether a SAS dataset exist or not and throw an error in the log ?

Solution for ERROR: Some character data was lost during transcoding in the dataset

2018 plan for getting expertise in Machine Learning and Deep Learning